This tutorial is a condensed version of the The CQP User's Manual. Though some information from that manual is omitted here, this tutorial should still give you enough to quickly get you comfortable with making most of the cqp searches that you'd want to do. The original version of this tutorial was written by Jeanette Pettibone, and can be found here.
Although cqp is quite a powerful tool for doing searches on large corpora, it is not necessarily designed to be user friendly. It works in a unix-style command line environment. If you need more help with unix itself, you can look at our GeneralUnixTutorial.
Table of Contents:
Contents
1. What is cqp?
cqp - or "Corpus Query Processor" - is a tool developed to retrieve information from large corpora. One of the advantages of using cqp is that it is efficient. You can get a large amount of information in a short amount of time. Also, because part of speech tags are also encoded in the corpora, you can search for not only a word, but you can specify a particular part of speech. For example, if you want to look for the noun "rain" (and not the verb) you can make that part of the query (see Specifying a part of speech.) A further advantage is that you can set the size of surrounding context - a few words on either side, the whole sentence - whatever is most appropriate for the type of search you are conducting. Once you have made a query, you can save the results as a subcorpus and run further queries on that subcorpus.
cqp is part of the IMS Corpus Workbench, developed at Universitat Stuttgart.
2. The corpora
Here at SDSU we have three corpora encoded for use with cqp -- the Brown Corpus, the British National Corpus (BNC) and the Michigan Corpus of Academic Spoken English (MICASE).
- == The Brown corpus ==
The BrownCorpus has about one million words - so it's a relatively small corpus by today's standards. Brown was put together at Brown University in the 60s. It is a balanced corpus of American English - composed of texts from different genres: newspaper, scientific, legal, academic, novels,...
For more information, see the BrownCorpus page.
The Brown tag set == The BNC ==
The BritishNationalCorpus, at over 100 million words, is one of the largest corpora avaliable today. It is a balanced corpus of British English.
For more information, see the BritishNationalCorpus page.
The BNC tag set We have two versions of the BNC, the "little BNC" and the later-released "BNC-World". Both are accessible on the daughter machines, but only the "little BNC" is available on bulba right now. == MICASE ==
MICASE consists of 1.7 million words of classroom lectures and other spoken language collected at the University of Michigan. It is not tagged.
For more information, see the MICASE page.
3. Getting started
To tell the computer that you want to use cqp, you must first call the program.
% cqp
After you type in this command and hit the carridge return, the computer will let you know that you are connected to cqp and that it is waiting for you to tell it which corpus you want to search. This is the prompt:
[no corpus]>
Now you want to type in the name of the corpus.
[no corpus]> BROWN;
or
[no corpus]> show;
The first tells cqp that you want to search the Brown corpus. But if you forget how to call the corpus you want, you can ask cqp to show you the list of possible corpora.
Note: it is important that you type a semi-colon after every command in cqp. As it is possible to enter large queries in cqp, the semi-colon just lets cqp know that you are finished with your query. are done adding information. cqp lets you enter large queries.
When a corpus is activated, the prompt will change:
BROWN>
4. Simple searches
4.1. Searching for a word
When you are simply searching for a word, you only need to type the word in double quotes followed by a semi-colon:
BROWN> "helpful";
The results will look as follows:
15779: e electric gadget is most <helpful> when there are many crow
18298: he border will be no_more <helpful> . The material for comp
40881: e seasons . It is usually <helpful> to make a sketch_map in
41023: s or verify opinions most <helpful> to the planning study of
44917: . And dancing school , so <helpful> in artistic and psycholo
62851: samll firms . This may be <helpful> in improving the competi
103284: . The reader will find it <helpful> to think_of the special
141792: t for_granted , try to be <helpful> , but do n't ask questio
209867: uled as the Martians were <helpful> . Part of the time savedThe results shown here use the default context. Change the context will be discussed later.
Note that within double quotes blanks and case are significant. If you do the same search above but with a capital letter, you get a different result:
BROWN> "Helpful"; 0 matches.
5. Using regular expressions
5.1. Alternatives
cqp is case specific, so if you run a search on "the" you are only searching for lower-case "the"and not "The." However, this does not mean that to get both "the" and "The" you have to do separate searches. Instead you can use the disjunction operator (|) - using the operator is like saying "the" or "The". For example:
BROWN> "the|The";
(Actaully doing a search on "the|The" would be pointless, however, because it is such a common word. Almost every line in the corpus would return a result if you ran this query.
For other ways of expressing options, see groups and ranges.
5.2. Repetition: Kleene star and Kleene plus
You can use the Kleene star (which means "zero or more") or the Kleene plus. The Kleene star (which means "one or more") when you want to indicate repetition. The Kleene star and plus are often used with the wildcard character (.) which can stand for any character.
- === Kleene star === Here is an example of how the Kleene star (*) is used:
BROWN> "walk.*";
Here, the "." stands for a single character and the "*" means zero or more times. Put together, this query is the same as saying "walk|walks|walked|walking" === Kleene plus === Here is an example of how the Kleene plus (+) is used:BROWN> "walk.+";
Here, the "." stands for a single character and the "+" means one or more times. Put together, this query is the same as saying "walks|walked|walking" Because the plus means ONE or more, "walk" is not a result of this query.
5.3. Groups
We saw in Alternatives that you can use the disjunction operator to run a query on either "the" or "The" with "the|The"; However, it is more efficient to say that it is only the "t/T" that is being alternated. We can show this grouping:
BROWN> "(t|T)he";
Also, if you want to indicate repetition of more than just one character, you use grouping. If you want to search for "ab|abab|ababab|abababab" etc you can do the following:
BROWN> "(ab)+";
Also look to ranges for anther way to express this relationship.
5.4. Ranges
If you want to include a range - whether of numbers (i.e. from 0-9) or of letters (from a-z), we use square brackets ([]).
BROWN> "[0-9]"; BROWN> "[a-z]";
Also, we can use the square brackets as another way of expressing alternations. Of the following examples, the first one means "a|b|c|d" and the second is either "(t|T)he" as seen in groups or "the|The" as seen in alternations.
BROWN> "[abcd]"; BROWN> "[tT]he";
5.5. Omissions
Using "?" after a character means that the character can be omitted from the query. For example, in
BROWN> "walks?";
the "s" can be omitted. This is similar to saying "walk|walks" (see alternatives)
6. Complex searches
Now we can combine what we know about simple searches and regular expressions to make more complex searches.
6.1. Searching for adjacent words
To search for two words, we can simply repeat a word search:
BROWN> "quite" "the"; BROWN> "give" "up";
If we want to do a search for "give up" or "gives up" or "given up" or "gave up" - we can do this a few different ways:
BROWN> ("give" "up")|("gives" "up")|("given" "up")|("gave" "up");
BROWN> "give|gives|given|gave" "up";
BROWN> "give[sn]?|gave" "up";Remember to be careful about blanks - "give | gives" means "give " or " gives" with a space as part of the word.
6.2. Searching for non-adjacent words
Searching for a particular pair of words that are not necessarily adjacent is easy with cqp. We use a pair of empty square brackets to represent a possible word.
BROWN> "give" [] "up";
BROWN> "give" []* "up";
BROWN> "give" []{2} "up";
BROWN> "give" []{0,2} "up";We can either use [] to mean that there is one word between the target words, or []* uses the Kleene star to say "zero or more" words between. The curly brackets are used to indicate either an exact number ({2} - exactly two words between) or a range ({0,2}- representing between zero and two words)
* is equivalent to {0,} + is equivalent to {1,} ? is equivalent to {0,1}
6.3. More examples
BROWN> "in" ("the" "city" "of")? "Boston|Washington";Means that "in" must be present, but "the city of" can be omitted (see omission.) Then we need either "Boston" or "Washington" (see alternatives.)
But be careful of the double quotes. If you write this:
BROWN> "in" ("the" "city" "of")? "Boston"|"Washington";it is the same as the whole first part as an alternation of "Washington" - like this:
BROWN> ("in" ("the" "city" "of")? "Boston" ) | ("Washington");Another way to say the same thing as the first is:
BROWN> "in" ("the" "city" "of")? ("Boston"|"Washington");but we see that the first way might be the most efficient.
BROWN> "Clinton" "said" "in"?
This will capture either "Clinton said nothing" or "Clinton said in Columbus, Ohio..."
6.4. Specifying the part of speech
Because searching for words is the most used option of cqp, it is the default query. But in fact, the query
BROWN> "helpful";
is actually just an abbreviation for the query
BROWN> [word="helpful"];
We can also search by part of speech in the following way:
BROWN> [pos="DT"];
This will allow you to search for all determiners. (To find out how each corpus labeled the various parts of speech, see the tag sets .)
To look for a word as a particular part of speech, you can run a query specifying both the word and the part of speech in the following way:
BROWN> [word="rain" & pos="N.*"];
Note that we use "&" for logical "and" - we can use "|" for logical "or" and "!" as logical "not." As the tag sets often have more than one type of noun, we use "N.*" to say that we just want the part of speech to be a noun.
6.5. Sentence boundaries
When you want to specify that the search word is at the beginning or end of a sentence, you can use the symbols "<s>" (start of a sentence) and "</s>" (end of a sentence).
To look for a noun at the beginning of a sentence, the query is as follows:
BROWN> <s> [pos="N.*"];
For a noun at the end of a sentence:
BROWN> [pos="N.*"] </s>;
If you want to capture words on both sides of a sentence boundary, you can do the following:
BROWN> [pos="N.*"] [] <s> "It";
The catchall [] will capture the sentence boundary punctuation (the period, for example)
7. Subcorpora
7.1. Creating subcorpora
Each query creates a subcorpus. This means that everytime you run a search, the results are stored in a subcorpus - and by defualt cqp names this subcorpus "Last". If you do not wish to save the results, the next time you run a search, "Last" will be overwritten by the new results. However, if you want to save this subcorpus and run searches on it, you can rename it in two different ways: either through renaming a corpus (in this case your last query "Last")
cqp>MyQuery = Last;
or by running a "named" query
cqp>MyQuery = query;
You can now see your subcorpus in the list of corpora in the system with the "show" command. Now that we know about subcorpora, you might want to know about three types of "show":
cqp>show system; // displays the names of the system corpora cqp>show sub; // displays the names of the user's subcorpora cqp>show; // displays both the system corpora and the subcorpora
7.2. Searching subcorpora
To run a search on your subcorpus just as you called the system corpus. If you cannot remember the name of your subcorpora, you can use the show command.
BROWN> MyQuery; BROWN:MyQuery[9]>
After you type in the name of your subcorpus, the prompt will change, showing you that you are now searching on a subcorpus of the Brown corpus that has a certain number of lines - in this case, the subcorpus MyQuery has 9 lines.
If you want to do a search of the previous query without renaming it, you can do the same as above but with "Last" - which is the default name for the very last query done.
BROWN> Last; BROWN:Last[9]>
8. Renaming a corpus
A corpus can be renamed by entering
> corpus-name-1 = corpus-name-2
where corpus-name-1 is a new corpus and corpus-name-2 is an existing corpus which is to be copied. However, cqp does not allow you to rename a system corpus in order to avoid large scale copying.
9. Displaying the contents of a corpus
You probably won't want to display the whole contents of a 1 million word corpus, let alone a 100 million one, but you might want to display the contents of your subcorpus. There is an easy command for this - "cat"
BROWN> cat MyQuery;
will show you the result of MyQuery. If you type
BROWN> cat;
The default corpus is the last query "Last" so this will display your last query.
10. Changing context parameters of searches
10.1. Changing context parameters in a query
It is fairly easy to change the context of a query. For an individual query, you can use the restructure operator (expand to ... ) to specify the expansion of context as follows:
BROWN> SC = "helpful" expand to 1 s;
This will expand the context of this particular search to 1 sentence ("s"). The next time you run a search, the context will go back to the default. To permanently change the context, see Changing the default context. Different ways to expand are:
expand to number adds that number of corpus positions to each interval of a given corpus;
expand to structure-name expands the intervals of a corpus to the boundaries of the structure structure-name. Optionally, you can use the command
expand to number structure-name which expands the intervals of the given corpus to the boundaries of the following (right context) and previous (left-context) number structural annotations called structure-name;
Whereas expand to adds context to both sides of the query, you can restrict the direction of context expansion with expand right to ... or expand left to ...
10.2. Changing context parameters with subcorpora
When changing context of subcorpora, you must rename the corpus.
BROWN> SC = "helpful"; BROWN> SCS = SC expand to s;
SC is the name of the subcorpus with the results of the search "helpful"; we are now creating a new subcorpus named SCS that stores the results of the subcorpus SC when the context is expanded to the whole sentence ("s") in which the word is contained.
11. Ending your session
To end your session, you need only type
BROWN> exit;