Corpus Linguistics
Tutorials
Corpora
English Corpora
German Corpora
More Languages
Spoken Corpora
Learner Corpora
ICE
Corpora
Parallel Corpora
Historical Corpora
Treebanks
Text
Archives
Alphabetical List
Software
CL in Applied Linguistics

You are now in section > Corpora > English Corpora > A - E

 

A - E

ACE
Bank of English
BNC
BROWN
CEECS
Christine
CIC
CLC
COLT
CPSA
ECI

ACE - Australian Corpus of English

Org:  Macquarie University, Sydney, Australia
Time: 1986

Size:

1 mio words (500x2000 words samples)
Contents: written and spoken, multigeneric (15 different genres)

Access:

available on the ICAME CD-ROM 
Notes: modelled on BROWN and LOB for linguistic research

 

Bank of English - Cobuild

Org:  Cobuild and the University of Birmingham, UK (John Sinclair)
Time: majority of the material originates after 1990

Size:

415 mio words in Oct 2000 (still growing)
Contents: written and spoken; multigeneric ( 

Access:

CobuildDirect allows restricted access to the corpus through a java or telnet interface; restricted concordance and collocation queries are possible 
Notes:

BNC - British National Corpus

Org:  Lead by an industrial/academic consortium lead by Oxford University Press
Time: completed in 1994; first release in 1995; second release in 2001

Size:

over 100 mio words (4,125 texts)
Contents: multigeneric; 90% written and 10% spoken materials

Access:

Licensed; Guest account available by using the SARA Client at the BNC Online Service or conduct a simple search at the BNC.
Notes: SGML Markup according to the TEI guidelines; POS tagging carried out with CLAWS

 

BROWN University Corpus

Org:  Brown University, Rhode Island,U.S.
Time: 1960s

Size:

ca. 1 mio words
Contents: American written English; 500 text samples of approximately 2,000 words distributed over 15 text categories

Access:

available on the ICAME CD-ROM
Notes:

CEECS - Corpus of Early English Correspondence Sampler

Org:  University of Helsinki, Finnland
Time: 1418-1680

Size:

approx. 450,000 words
Contents: click here for a list of included texts

Access:

available on the ICAME CD-ROM 
Notes: represents the non-copyrighted materials included in the Corpus of Early English Correspondence

 

CHRISTINE Corpus

Org:  Geoffrey Sampson, University of Essex, UK
Time: first distributed in August 2000

Size:

Contents: spoken English, and particularly spontaneous, informal spoken English

Access:

freely available for download here
Notes: see also SUSANNE
name="CIC"

CIC - Cambridge International Corpus

Org:  Cambridge University Press
Time: ongoing

Size:

300 mio words and expanding
Contents: multigeneric; written and spoken British and American materials, learners' English

Access:

"Currently, it can only be used by authors and writers working for Cambridge University Press and by members of staff at UCLES."
Notes: "Authors, editors and lexicographers use the CIC [...] when they are working on books for Cambridge University Press."

 

CLC - Cambridge Learner Corpus

Org:  Cambridge University Press and UCLES.
Time: ongoing

Size:

10 mio and expanding
Contents: anonymised exam scripts written by students taking UCLES English exams around the world

Access:

"Currently, it can only be used by authors and writers working for Cambridge University Press and by members of staff at UCLES."
Notes: It forms part of the Cambridge International Corpus

COLT - Bergen Corpus of London Teenage Language

Org:  University of Bergen, Norway
Time: material collected in 1993

Size:

500.000 words; Pilot-version consists of 151 texts
Contents: transcripts of spoken 'London Teenage Language'

Access:

search in the pilot version is available; reg. users can search the entire corpus online; COLT is available on the ICAME CD-ROM
Notes: COLT is part of the BNC; it is tagged for word classes

 

CPSA  - Corpus of Spoken Professional American English

Org:  Contact: Michael Barlow
Time: 1994-1998

Size:

2 main sub-corpora, 1 mio words each
Contents: short interchanges by 400 speakers – professional activities broadly tied to academics and politics

Access:

Registered users only ($79 for the individual using the tagged version)
Notes: The tagging was performed by Tony McEnery and Paul Baker using the CLAWS programme at UCREL, Lancaster University; available both tagged and untagged

ECI Corpus

Org:  ELSNET
Time: materials collected between 1984 and 1993

Size:

Four different corpora ranging from 4 to 34 mio. words
Contents: German, French and Dutch newspaper texts; parallel texts in English Spanish and French

Access:

available on CD Rom for € 50 for research purposes only
Notes:  

 

You are now in section > Corpora > English Corpora > A - E

Data-driven learning
Virtual Resources
Bibliography
Email
About

webmaster@corpus-linguistics.de