Corpora installed on lab machines
There are two directories that corpora are installed in:
- /usr/local/corpora
- /home1/corpora
The /usr/local/corpora directory is only available on the daughter machines (i.e. not bulba), whereas the /home1/corpora directory is available on all machines, as well as accessible from the Windows machines at \\bulba\windows\corpora.
BritishNationalCorpus - 100 million word corpus of spoken and written British English
BrownCorpus - Corpus of literature, compiled in the 1960s at Brown U.
CHILDES - Child language database
ChinesePenn - Treebank of 325 articles from Xinhua newswire between 1994 and 1998 in GB encoding.
CmuDict - Machine-readable pronunciation dictionary
JapaneseNews - c. 30 million words of text of Nihon Kezai Shimbun, 1994
MandarinNews - articles drawn from the People's Daily newspaper and the Xinhua newswire formatted to include TREC document ids
MandarinBroadcastNews - 30 hours of recorded broadcasts and transcripts
MICASE - Michigan Corpus of Academic Spoken English
MorphDb - XTAG morphology database
MultilanguageTdt3 - news data from late 1998 in Mandarin and English
NorthAmericanNews - composed of approximately 310 million words of news text
PennTreebank - contains 1989 Wall Street Journal, some ATIS-3 material, the Brown Corpus and the Switchboard Corpus
ReutersNews - the most widely used test collection for text categorization research
SpokenAmericanEnglish - Santa Barbara Corpus of Spoken American English Part I
SUSANNE - 130,000-word subset of the Brown Corpus, annotated in accordance with the SUSANNE scheme
TwentyNewsgroups - a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups
Tdt2Mandarin - Topic detection and tracking
WebKb - corpus of web pages for text classification
Corpora available on CD
ATIS0 - Air Travel Information System Corpus
ContinuousSpeech - Continuous Speech Recognition Corpus II
FfmTimit - corpus of recorded read speech
ResourceManagement - Resource Management Continuous Speech Database
TipsterCorpus - Information Retrieval Collection
Corpora available on the Web