Personal tools
You are here: Home Members rmalouf Courses Ling 571: Computational Corpus Linguistics (Fall 2007)
Document Actions

Ling 571: Computational Corpus Linguistics (Fall 2007)

by Rob Malouf last modified 2008-05-01 14:13
Advances in technology have revolutionized the way linguists approach their data. Using computers, extremely large bodies of text ("corpora") can be collected and analyzed at a level of detail that only a generation ago would have been unthinkable. For linguists and computer scientists alike, the accelerating growth of the World Wide Web and other natural language resources have made techniques for dealing with very large texts more important than ever. Through a combination of lectures, demonstrations, and hands-on exercises, this course will give students an introduction to the skills necessary for computer-aided text manipulation. Students will learn to construct and search text databases using Unix tools, to write python programs to manipulate large natural language corpora, and to use statistical software to perform quantitative analysis of linguistic data.
Available resources
Course type Lecture / Lab
Instructor Rob Malouf
Time MWF 14:00–14:50
Location BA 412

Requirements

The final grade will be based on homework assignments (30%), a midterm project (30%), and a final project (40%).

Through the term, there will be five hands-on homework assignments in which students apply the techniques learned in class to actual corpus materials. Since it's important to not get behind on assignments, late assignments will be accepted for partial credit for one week only after the due date unless prior arrangements are made. Working in groups is encouraged, but please include the names of all coworkers on the assignment.

The final project should be a program (with documentation) to perform some substantial corpus processing task. Alternatively, the final project can be the collection and annotation of a new corpus. More details about both projects will be given later in the term.

Readings

There are two required paper textbooks for this course:

Geoffrey Sampson and Diana McCarthy (eds).  2005. Corpus Linguistics: readings in a widening discipline.  Continuum International.

and

Jon Lasser. 2000. Think Unix. Pearson Education.

These books are available in the campus bookstore. We will also be using three online textbooks:

Martin Wynne (ed.) 2005. Developing Linguistic Corpora: A Guide to Good Practice. Oxbow Books. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm

and

Alan Gauld. 2001. Learn to Program using Python. Addison-Wesley Professional. http://www.freenetpages.co.uk/hp/alan.gauld/

and

Allen Downey, Jeffrey Elkner, and Chris Meyers. 2007. How to Think Like a Computer Scientist: Learning with Python. Green Tea Press. http://ibiblio.org/obp/thinkCS/python/english2e/html/index.html

In addition, you might find it useful to have a comprehensive Python reference manual, such as:

David M. Beazley. 2006. Python Essential Reference. Third Edition. New Riders.
or
Jeffrey Friedl. 2002. Mastering Regular Expressions. Second Edition. O’Reilly.
or
Alex Martelli, et al. 2005. Python Cookbook. Second Edition. O’Reilly.

These should be easy to find at local or on-line bookstores.

Additional readings will be made available in class or via the "Resources" section of the course web page.

Lab


For homework assignments and projects, we will be using the computational linguistics lab, part of the Social Sciences Research Lab in the basement of the Professional Services and Fine Arts building. Information about how to use the lab will be made available before the first assignment.


Schedule

  • Week 1–2 Introduction
    Background · Why corpus linguistics? · What is a corpus? · Corpus types · Constructing corpora · Computational linguistics lab · Introduction to Unix
  • Week 3–4 Second generation corpora
    British National Corpus · Basic corpus tools · Multimedia corpora · Corpus design
  • Week 5–6 Python
    What is Python? · Basic Python programming · Tokenization · Python data structures
  • Week 7–9 Building corpus tools
    Counting · Dictionaries · Tokenization revisited · Concordancers · Stemming
  • Week 10–12 Quantitative linguistics
    Quantitative data analysis · Collocations and idioms · Text types and genre
  • Week 13–14 Annotation
    Tagging · Parsing · Treebanks · Unicode · XML
  • Week 15 Future prospects
    Very very large corpora · World Wide Web as a corpus · Bioinformatics · Computational linguistics

Prerequisites

None.

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: