Corpora installed on lab machines

There are two directories that corpora are installed in:

The /usr/local/corpora directory is only available on the daughter machines (i.e. not bulba), whereas the /home1/corpora directory is available on all machines, as well as accessible from the Windows machines at \\bulba\windows\corpora.

BritishNationalCorpus - 100 million word corpus of spoken and written British English

BrownCorpus - Corpus of literature, compiled in the 1960s at Brown U.

CHILDES - Child language database

ChinesePenn - Treebank of 325 articles from Xinhua newswire between 1994 and 1998 in GB encoding.

CmuDict - Machine-readable pronunciation dictionary

JapaneseNews - c. 30 million words of text of Nihon Kezai Shimbun, 1994

MandarinNews - articles drawn from the People's Daily newspaper and the Xinhua newswire formatted to include TREC document ids

MandarinBroadcastNews - 30 hours of recorded broadcasts and transcripts

MICASE - Michigan Corpus of Academic Spoken English

MorphDb - XTAG morphology database

MultilanguageTdt3 - news data from late 1998 in Mandarin and English

NorthAmericanNews - composed of approximately 310 million words of news text

PennTreebank - contains 1989 Wall Street Journal, some ATIS-3 material, the Brown Corpus and the Switchboard Corpus

ReutersNews - the most widely used test collection for text categorization research

SpokenAmericanEnglish - Santa Barbara Corpus of Spoken American English Part I

SUSANNE - 130,000-word subset of the Brown Corpus, annotated in accordance with the SUSANNE scheme

TwentyNewsgroups - a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups

Tdt2Mandarin - Topic detection and tracking

WebKb - corpus of web pages for text classification

Corpora available on CD

ATIS0 - Air Travel Information System Corpus

CmuKidsSpeech

ContinuousSpeech - Continuous Speech Recognition Corpus II

FfmTimit - corpus of recorded read speech

ResourceManagement - Resource Management Continuous Speech Database

TipsterCorpus - Information Retrieval Collection

Corpora available on the Web

Online Corpora


None: LabCorpora (last edited 2009-08-17 23:11:43 by localhost)