OpenOffice
6 languages, 15 bitexts
total number of files: 10983
total number of tokens: 2612156
total number of sentence fragments: 246760
The original documentation of the office package
OpenOffice.org (http://www.openoffice.org/)
contains 2014 English documents which have been partly
translated into 5 languages: French, Spanish, Swedish, German,
and Japanese. The original documentation in English comprises
about 500,000 words and translations contain between 400,000
and 500,000 words per language. All documents have been
tokenized and, except of the Spanish part, tagged with parts of
speech. The English part of the corpus has been marked with
syntactic chunks as well.
Download
Upper-right triangle: sample files (test = sentence alignment samples, language IDs = XML file samples)
Bottom-left triangle: XML-files (ces = sentence alignment files in XCES format, language IDs = gzipped tar-archives of corpus files in XML)Statistics
Number of files, tokens, and sentence fragments per language
Number of aligned sentences per target language language | files | tokens | sentences | de | en | es | fr | jp | sv |
de | 2014
| 474436
| 47482
| | 42903
| 37764
| 37085
| 31107
| 37947
|
en | 2014
| 478654
| 44961
| 42903
| | 38583
| 38014
| 33143
| 38906
|
es | 1738
| 491426
| 40009
| 37764
| 38583
| | 38477
| 33445
| 39479
|
fr | 1739
| 496780
| 39462
| 37085
| 38014
| 38477
| | 33295
| 38726
|
jp | 1739
| 267665
| 34167
| 31107
| 33143
| 33445
| 33295
| | 34026
|
sv | 1739
| 403195
| 40679
| 37947
| 38906
| 39479
| 38726
| 34026
| |
---|