Corpus Linguistics
Tutorials
Corpora
English Corpora
German Corpora
More Languages
Spoken Corpora
Learner Corpora
ICE
Corpora
Parallel Corpora
Historical Corpora
Treebanks
Text
Archives
Alphabetical List
Software
CL in Applied Linguistics

You are now in section > Corpora > Parallel Corpora

Compara

Org:  Ana Frankenberg-Garcia; Diana Santos
Time: misc

Size:

1 million and still growing
Contents: "open-ended collection of Portuguese-English and English-Portuguese source texts and translations" (more)

Access:

free
Notes Corpus manual

CRATER Multilingual Aligned Annotated Corpus

Org: 
Time:

Size:

ca. 1 mio words
Contents: trilingual: English, French and Spanish - telecommunications texts

Access:

access online; also download of the text files via FTP possible
Notes aligned at the sentence level; POS tagged in all three languages

ECPC - English-Chinese Parallel Corpus

Org:  Wang Lixun (University of Birmingham)
Time: misc

Size:

unknown
Contents: complete texts, such as novels, essays, etc.

Access:

free
Notes more on the corpus design

ENPC - English-Norwegian Parallel Corpus

Org:  University of Oslo, Norway
Time: completed in 

Size:

100 original texts and 100 translated texts, amounting to some 2.6 million words in all
Contents: fictional and non--fictional texts

Access:

Access to the corpus is restricted to the staff and students of the University of Oslo.
Notes

ESPC - The English-Swedish Parallel Corpus

Org:  Bengt Altenberg; Karin Aijmer; Mikael Svensson
Time: Project ran from 1993-2001

Size:

64 English text samples and translations; 72 Swedish text samples and translations; total corpus size 2.8 mio words
Contents: "With few exceptions, the samples have been taken from texts published since 1980. Most major regional varieties of English are represented (British, American, Canadian, Irish, South African) but no attempt has been made to achieve a systematic or 'representative' distribution of these. Only written texts are represented. A number of prepared speeches have been included but they have their origin in writing and do not reflect genuine speech. Other categories that are missing in the corpus are, for example, newspaper text, private letters and business correspondence."

Access:

Restricted to researchers and students at the Universities of Lund and Göteborg
Notes Corpus manual

IJS-ELAN Slovene-English Parallel Corpus

Org:  Dept. of Intelligent Systems, Institute Jozef Stefan
Time:  

Size:

1 mio words from 15 parallel Slovene-English / English-Slovene texts
Contents:  

Access:

free access, the corpus can be downloaded or accessed via their online concordancer
Notes "the corpus is tokenised, sentence segmented and aligned; encoded as a translation memory in SGML TEI P3"

 

 

 

 

 

 

 

 

You are now in section  > Corpora > Parallel Corpora

Data-driven learning
Virtual Resources
Bibliography
Email
About

webmaster@corpus-linguistics.de