6 languages, 15 bitexts
total number of files: 10983
total number of tokens: 2612156
total number of sentence fragments: 246760

The original documentation of the office package OpenOffice.org (http://www.openoffice.org/) contains 2014 English documents which have been partly translated into 5 languages: French, Spanish, Swedish, German, and Japanese. The original documentation in English comprises about 500,000 words and translations contain between 400,000 and 500,000 words per language. All documents have been tokenized and, except of the Spanish part, tagged with parts of speech. The English part of the corpus has been marked with syntactic chunks as well.


Upper-right triangle: sample files (test = sentence alignment samples, language IDs = XML file samples)
Bottom-left triangle: XML-files (ces = sentence alignment files in XCES format, language IDs = gzipped tar-archives of corpus files in XML)

de en es fr jp sv
de de test test test test test de
en ces en test test test test en
es ces ces es test test test es
fr ces ces ces fr test test fr
jp ces ces ces ces jp test jp
sv ces ces ces ces ces sv sv
de en es fr jp sv


Number of files, tokens, and sentence fragments per language
Number of aligned sentences per target language

language files tokens sentencesdeenesfrjpsv
de 2014 474436 47482 42903 37764 37085 31107 37947
en 2014 478654 44961 42903 38583 38014 33143 38906
es 1738 491426 40009 37764 38583 38477 33445 39479
fr 1739 496780 39462 37085 38014 38477 33295 38726
jp 1739 267665 34167 31107 33143 33445 33295 34026
sv 1739 403195 40679 37947 38906 39479 38726 34026