This tutorial is a condensed version of the The CQP User's Manual. Though some information from that manual is omitted here, this tutorial should still give you enough to quickly get you comfortable with making most of the cqp searches that you'd want to do. The original version of this tutorial was written by Jeanette Pettibone, and can be found here.

Although cqp is quite a powerful tool for doing searches on large corpora, it is not necessarily designed to be user friendly. It works in a unix-style command line environment. If you need more help with unix itself, you can look at our GeneralUnixTutorial.

Table of Contents:

1. What is cqp?

cqp - or "Corpus Query Processor" - is a tool developed to retrieve information from large corpora. One of the advantages of using cqp is that it is efficient. You can get a large amount of information in a short amount of time. Also, because part of speech tags are also encoded in the corpora, you can search for not only a word, but you can specify a particular part of speech. For example, if you want to look for the noun "rain" (and not the verb) you can make that part of the query (see Specifying a part of speech.) A further advantage is that you can set the size of surrounding context - a few words on either side, the whole sentence - whatever is most appropriate for the type of search you are conducting. Once you have made a query, you can save the results as a subcorpus and run further queries on that subcorpus.

cqp is part of the IMS Corpus Workbench, developed at Universitat Stuttgart.

2. The corpora

Here at SDSU we have three corpora encoded for use with cqp -- the Brown Corpus, the British National Corpus (BNC) and the Michigan Corpus of Academic Spoken English (MICASE).

3. Getting started

To tell the computer that you want to use cqp, you must first call the program.

% cqp

After you type in this command and hit the carridge return, the computer will let you know that you are connected to cqp and that it is waiting for you to tell it which corpus you want to search. This is the prompt:

[no corpus]>

Now you want to type in the name of the corpus.

[no corpus]> BROWN;

or

[no corpus]> show;

The first tells cqp that you want to search the Brown corpus. But if you forget how to call the corpus you want, you can ask cqp to show you the list of possible corpora.

Note: it is important that you type a semi-colon after every command in cqp. As it is possible to enter large queries in cqp, the semi-colon just lets cqp know that you are finished with your query. are done adding information. cqp lets you enter large queries.

When a corpus is activated, the prompt will change:

BROWN>

4. Simple searches

4.1. Searching for a word

When you are simply searching for a word, you only need to type the word in double quotes followed by a semi-colon:

BROWN> "helpful";

The results will look as follows:

         15779: e electric gadget is most <helpful> when there are many crow
         18298: he border will be no_more <helpful> .  The material for comp
         40881: e seasons . It is usually <helpful> to make a sketch_map in 
         41023: s or verify opinions most <helpful> to the planning study of 
         44917: . And dancing school , so <helpful> in artistic and psycholo
         62851: samll firms . This may be <helpful> in improving the competi
        103284: . The reader will find it <helpful> to think_of the special
        141792: t for_granted , try to be <helpful> , but do n't ask questio
        209867: uled as the Martians were <helpful> . Part of the time saved

The results shown here use the default context. Change the context will be discussed later.

Note that within double quotes blanks and case are significant. If you do the same search above but with a capital letter, you get a different result:

BROWN> "Helpful";
0 matches.

5. Using regular expressions

5.1. Alternatives

cqp is case specific, so if you run a search on "the" you are only searching for lower-case "the"and not "The." However, this does not mean that to get both "the" and "The" you have to do separate searches. Instead you can use the disjunction operator (|) - using the operator is like saying "the" or "The". For example:

BROWN> "the|The";

(Actaully doing a search on "the|The" would be pointless, however, because it is such a common word. Almost every line in the corpus would return a result if you ran this query.

For other ways of expressing options, see groups and ranges.

5.2. Repetition: Kleene star and Kleene plus

You can use the Kleene star (which means "zero or more") or the Kleene plus. The Kleene star (which means "one or more") when you want to indicate repetition. The Kleene star and plus are often used with the wildcard character (.) which can stand for any character.

5.3. Groups

We saw in Alternatives that you can use the disjunction operator to run a query on either "the" or "The" with "the|The"; However, it is more efficient to say that it is only the "t/T" that is being alternated. We can show this grouping:

BROWN> "(t|T)he";

Also, if you want to indicate repetition of more than just one character, you use grouping. If you want to search for "ab|abab|ababab|abababab" etc you can do the following:

BROWN> "(ab)+";

Also look to ranges for anther way to express this relationship.

5.4. Ranges

If you want to include a range - whether of numbers (i.e. from 0-9) or of letters (from a-z), we use square brackets ([]).

BROWN> "[0-9]";
BROWN> "[a-z]";

Also, we can use the square brackets as another way of expressing alternations. Of the following examples, the first one means "a|b|c|d" and the second is either "(t|T)he" as seen in groups or "the|The" as seen in alternations.

BROWN> "[abcd]";
BROWN> "[tT]he";

5.5. Omissions

Using "?" after a character means that the character can be omitted from the query. For example, in

BROWN> "walks?";

the "s" can be omitted. This is similar to saying "walk|walks" (see alternatives)

6. Complex searches

Now we can combine what we know about simple searches and regular expressions to make more complex searches.

6.1. Searching for adjacent words

To search for two words, we can simply repeat a word search:

BROWN> "quite" "the";
BROWN> "give" "up";

If we want to do a search for "give up" or "gives up" or "given up" or "gave up" - we can do this a few different ways:

BROWN> ("give" "up")|("gives" "up")|("given" "up")|("gave" "up");
BROWN> "give|gives|given|gave" "up";
BROWN> "give[sn]?|gave" "up";

Remember to be careful about blanks - "give | gives" means "give " or " gives" with a space as part of the word.

6.2. Searching for non-adjacent words

Searching for a particular pair of words that are not necessarily adjacent is easy with cqp. We use a pair of empty square brackets to represent a possible word.

BROWN> "give" [] "up";
BROWN> "give" []* "up";
BROWN> "give" []{2} "up";
BROWN> "give" []{0,2} "up";

We can either use [] to mean that there is one word between the target words, or []* uses the Kleene star to say "zero or more" words between. The curly brackets are used to indicate either an exact number ({2} - exactly two words between) or a range ({0,2}- representing between zero and two words)

* is equivalent to {0,} + is equivalent to {1,} ? is equivalent to {0,1}

6.3. More examples

BROWN> "in" ("the" "city" "of")? "Boston|Washington";

Means that "in" must be present, but "the city of" can be omitted (see omission.) Then we need either "Boston" or "Washington" (see alternatives.)

But be careful of the double quotes. If you write this:

BROWN> "in" ("the" "city" "of")? "Boston"|"Washington";

it is the same as the whole first part as an alternation of "Washington" - like this:

BROWN> ("in" ("the" "city" "of")? "Boston" ) | ("Washington");

Another way to say the same thing as the first is:

BROWN> "in" ("the" "city" "of")? ("Boston"|"Washington");

but we see that the first way might be the most efficient.

BROWN> "Clinton" "said" "in"?

This will capture either "Clinton said nothing" or "Clinton said in Columbus, Ohio..."

6.4. Specifying the part of speech

Because searching for words is the most used option of cqp, it is the default query. But in fact, the query

BROWN> "helpful";

is actually just an abbreviation for the query

BROWN> [word="helpful"];

We can also search by part of speech in the following way:

BROWN> [pos="DT"];

This will allow you to search for all determiners. (To find out how each corpus labeled the various parts of speech, see the tag sets .)

To look for a word as a particular part of speech, you can run a query specifying both the word and the part of speech in the following way:

BROWN> [word="rain" & pos="N.*"];

Note that we use "&" for logical "and" - we can use "|" for logical "or" and "!" as logical "not." As the tag sets often have more than one type of noun, we use "N.*" to say that we just want the part of speech to be a noun.

6.5. Sentence boundaries

When you want to specify that the search word is at the beginning or end of a sentence, you can use the symbols "<s>" (start of a sentence) and "</s>" (end of a sentence).

To look for a noun at the beginning of a sentence, the query is as follows:

BROWN> <s> [pos="N.*"];

For a noun at the end of a sentence:

BROWN> [pos="N.*"] </s>;

If you want to capture words on both sides of a sentence boundary, you can do the following:

BROWN> [pos="N.*"] [] <s> "It";

The catchall [] will capture the sentence boundary punctuation (the period, for example)

7. Subcorpora

7.1. Creating subcorpora

Each query creates a subcorpus. This means that everytime you run a search, the results are stored in a subcorpus - and by defualt cqp names this subcorpus "Last". If you do not wish to save the results, the next time you run a search, "Last" will be overwritten by the new results. However, if you want to save this subcorpus and run searches on it, you can rename it in two different ways: either through renaming a corpus (in this case your last query "Last")

cqp>MyQuery = Last;  

or by running a "named" query

cqp>MyQuery = query;  

You can now see your subcorpus in the list of corpora in the system with the "show" command. Now that we know about subcorpora, you might want to know about three types of "show":

cqp>show system;   // displays the names of the system corpora  

cqp>show sub;       // displays the names of the user's subcorpora

cqp>show;            // displays both the system corpora and the subcorpora

7.2. Searching subcorpora

To run a search on your subcorpus just as you called the system corpus. If you cannot remember the name of your subcorpora, you can use the show command.

BROWN> MyQuery;
BROWN:MyQuery[9]>

After you type in the name of your subcorpus, the prompt will change, showing you that you are now searching on a subcorpus of the Brown corpus that has a certain number of lines - in this case, the subcorpus MyQuery has 9 lines.

If you want to do a search of the previous query without renaming it, you can do the same as above but with "Last" - which is the default name for the very last query done.

BROWN> Last;
BROWN:Last[9]>

8. Renaming a corpus

A corpus can be renamed by entering

> corpus-name-1 = corpus-name-2

where corpus-name-1 is a new corpus and corpus-name-2 is an existing corpus which is to be copied. However, cqp does not allow you to rename a system corpus in order to avoid large scale copying.

9. Displaying the contents of a corpus

You probably won't want to display the whole contents of a 1 million word corpus, let alone a 100 million one, but you might want to display the contents of your subcorpus. There is an easy command for this - "cat"

BROWN> cat MyQuery;

will show you the result of MyQuery. If you type

BROWN> cat;

The default corpus is the last query "Last" so this will display your last query.

10. Changing context parameters of searches

10.1. Changing context parameters in a query

It is fairly easy to change the context of a query. For an individual query, you can use the restructure operator (expand to ... ) to specify the expansion of context as follows:

BROWN> SC = "helpful" expand to 1 s;

This will expand the context of this particular search to 1 sentence ("s"). The next time you run a search, the context will go back to the default. To permanently change the context, see Changing the default context. Different ways to expand are:

Whereas expand to adds context to both sides of the query, you can restrict the direction of context expansion with expand right to ... or expand left to ...

10.2. Changing context parameters with subcorpora

When changing context of subcorpora, you must rename the corpus.

BROWN> SC = "helpful";
BROWN> SCS = SC expand to s;

SC is the name of the subcorpus with the results of the search "helpful"; we are now creating a new subcorpus named SCS that stores the results of the subcorpus SC when the context is expanded to the whole sentence ("s") in which the word is contained.

11. Ending your session

To end your session, you need only type

BROWN> exit;

None: CorpusQueryProcessor (last edited 2009-08-17 23:11:43 by localhost)