Corpus Linguistics
Tutorials
Corpora
English Corpora
German Corpora
More Languages
Spoken Corpora
Learner Corpora
ICE
Corpora
Parallel Corpora
Historical Corpora
Treebanks
Text
Archives
Alphabetical List
Software
CL in Applied Linguistics

You are now in section > Corpora > More Languages

CRATER Multilingual Aligned Annotated Corpus

Org: 
Time:

Size:

ca. 1 mio words
Contents: trilingual: English, French and Spanish - telecommunications texts

Access:

access online; also download of the text files via FTP possible
Notes aligned at the sentence level; POS tagged in all three languages

 

Danish corpus project Korpus 2000

Org:  The Society for Danish Language and Literature
Time: The project was finished and made available in spring 2002

Size:

approx. 28 mio. words
Contents: various texts written from 1998 to 2002

Access:

freely available to the public
Notes "It is also possible to search the Korpus 90 (1988-1992) which is similar to the Korpus 2000 in its composition and size and hence serves as an older comparative corpus for the Korpus 2000."

ECI/MCI - European Corpus Initiative Multilingual Corpus I

Org:  Distributed by ELSNET
Time: Corpus was first distributed in 1994

Size:

several corpora ranging from 4 - 34 mio words
Contents: French, Spanish, Dutch, German and English texts

Access:

available on CD-ROM
Notes

 

ET10-63 Corpus (bilingual, parallel)

Org: 
Time:

Size:

1,250,000 words of each language
Contents: English and French official documents on telecommunications

Access:

unkown
Notes POS tagges and lemmatized

Modern Languages French Corpus

Org:  Cambridge University Press/Cornell University (Aaron Lawson)
Time: 1997 - 199?

Size:

unknown
Contents: "[A] collective effort to collect 5 million words of spoken American English for use in developing research and teaching materials based on actual usage."

Access:

some parts are freely available on Lawson's site (e.g. French Spoken Corpus)
Notes Corpus documentation

Oslo Corpus of Bosnian Texts

Org:  IMS - Institut fuer Maschinelle Sprachverarbeitung
Time:

Size:

1.5 mio words
Contents: "[It] comprises several different genres: fiction (novels and short stories), essays, children's stories, folklore, islamic texts, legal texts, and newspapers and journals. The texts, written by authors from Bosnia and Herzegovina, have for the most part been published in the 1990s."

Access:

"The Oslo Corpus of Bosnian Texts is available for anybody who wants to use it for non-commercial academic research."
Notes  

Uppsala Russian Corpus

Org:  Slaviska Institutionen, Uppsala Universitet
Time: 1960-1989

Size:

ca. 1 mio words
Contents: "600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose"

Access:

online search access (cyrillic or latin transliteration) at Tübingen
Notes  

 

You are now in section  > Corpora > Other Languages

Data-driven learning
Virtual Resources
Bibliography
Email
About

webmaster@corpus-linguistics.de