Ling 571: Computational Corpus Linguistics (Fall 2007)
| Course type | Lecture / Lab |
|---|---|
| Instructor | Rob Malouf |
| Time | MWF 14:00–14:50 |
| Location | BA 412 |
Requirements
The final grade will be based on homework assignments (30%), a midterm project (30%), and a final project (40%).
Through the term, there will be five hands-on homework assignments in which students apply the techniques learned in class to actual corpus materials. Since it's important to not get behind on assignments, late assignments will be accepted for partial credit for one week only after the due date unless prior arrangements are made. Working in groups is encouraged, but please include the names of all coworkers on the assignment.
The final project should be a program (with documentation) to perform some substantial corpus processing task. Alternatively, the final project can be the collection and annotation of a new corpus. More details about both projects will be given later in the term.
Readings
There are two required paper textbooks for this course:
Geoffrey Sampson and Diana McCarthy (eds). 2005. Corpus Linguistics: readings in a widening discipline. Continuum International.
and
Jon Lasser. 2000. Think Unix. Pearson Education.
These books are available in the campus bookstore. We will also be using three online textbooks:
Martin Wynne (ed.) 2005. Developing Linguistic Corpora: A Guide to Good Practice. Oxbow Books. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm
and
Alan Gauld. 2001. Learn to Program using Python. Addison-Wesley Professional. http://www.freenetpages.co.uk/hp/alan.gauld/
and
Allen Downey, Jeffrey Elkner, and Chris Meyers. 2007. How to Think Like a Computer Scientist: Learning with Python. Green Tea Press. http://ibiblio.org/obp/thinkCS/python/english2e/html/index.html
In addition, you might find it useful to have a comprehensive Python reference manual, such as:
David M. Beazley. 2006. Python Essential Reference. Third Edition. New Riders.or
Jeffrey Friedl. 2002. Mastering Regular Expressions. Second Edition. O’Reilly.or
Alex Martelli, et al. 2005. Python Cookbook. Second Edition. O’Reilly.
These should be easy to find at local or on-line bookstores.
Additional readings will be made available in class or via the "Resources" section of the course web page.
Lab
For homework assignments and projects, we will be using the computational linguistics lab, part of the Social Sciences Research Lab in the basement of the Professional Services and Fine Arts building. Information about how to use the lab will be made available before the first assignment.
Schedule
- Week 1–2 Introduction
Background · Why corpus linguistics? · What is a corpus? · Corpus types · Constructing corpora · Computational linguistics lab · Introduction to Unix - Week 3–4 Second generation corpora
British National Corpus · Basic corpus tools · Multimedia corpora · Corpus design - Week 5–6 Python
What is Python? · Basic Python programming · Tokenization · Python data structures - Week 7–9 Building corpus tools
Counting · Dictionaries · Tokenization revisited · Concordancers · Stemming - Week 10–12 Quantitative linguistics
Quantitative data analysis · Collocations and idioms · Text types and genre - Week 13–14 Annotation
Tagging · Parsing · Treebanks · Unicode · XML - Week 15 Future prospects
Very very large corpora · World Wide Web as a corpus · Bioinformatics · Computational linguistics