This tutorial is condensed version of the "The CQP User's Manual" available with the cqp software. Though some information from that manual is omitted here, this tutorial should still give you enough to quickly get you comfortable with making most of the cqp searches that you'd want to do.
For help logging on to a computer in the Computational Lingusitics Lab, please see the guide to logging in.
Although cqp is quite a powerful tool for doing searches on large corpora, it is not necessarily designed to be user friendly. It works in a unix-type environment. If you need more help with unix itself, you can look at our Unix Tutorial.
To read the original document that was used to make this tutorial, go to the The CQP User's Manual.
For administration information, see the Corpus Administrator's Manual
cqp - or "Corpus Query Processor" - is a tool developed to retrieve information from large corpora.
One of the advantages of using cqp is that it is efficient. You can get a large amount of information in a short
amount of time. Also, because part of speech tags are also encoded in the corpora, you can search for not only
a word, but you can specify a particular part of speech. For example, if you want to look for the noun "rain"
(and not the verb) you can make that part of the query (see Specifying a part of speech.) A further
advantage is that you can set the size of surrounding context
- a few words on either side, the whole sentence - whatever
is most appropriate for the type of search you are conducting. Once you have made a query, you can save the results as
a subcorpus and run further queries on that subcorpus.
cqp is part of the IMS Corpus Workbench, developed at Universitat Stuttgart.
Here at sDSU we have two corpora encoded for use with cqp.
The Brown corpus has about one million words - so it's a relatively small corpus by today's standards. Brown was put together at Brown University in the 60s. It is a balanced corpus of American English - composed of texts from different genres: newspaper, scientific, legal, academic, novels,...
For more information, go to the Brown Corpus Manual.
Also see the tag set.
The British National Corpus, at over 100 million words, is one of the largest corpora avaliable today. It is a balanced corpus of British English. The BNC has both part of speech tag as well as stem information in it.
For more information, see what is the BNC
Also see the tag set.
To tell the computer that you want to use cqp, you must first call the program.
% cqp
After you type in this command and hit the carraige return, the computer will let you know that you are connected to cqp and that it is waiting for you to tell it which corpus you want to search. This is the prompt:
[no corpus]>
Now you want to type in the name of the corpus.
[no corpus]> BROWN;or
[no corpus]> show;The first tells cqp that you want to search the Brown corpus. But if you forget how to call the corpus you want, you can ask cqp to show you the list of possible corpora.
Note: it is important that you type a semi-colon after every command in cqp. As it is possible to enter large queries in cqp, the semi-colon just lets cqp know that you are finished with your query. are done adding information. cqp lets you enter large queries.
When a corpus is activated, the prompt will change:
BROWN>
When you are simply searching for a word, you only need to type the word in double quotes followed by a semi-colon:
BROWN> "helpful";
The results will look as follows:
15779: e electric gadget is most <helpful> when there are many crow 18298: he border will be no_more <helpful> . The material for comp 40881: e seasons . It is usually <helpful> to make a sketch_map in 41023: s or verify opinions most <helpful> to the planning study of 44917: . And dancing school , so <helpful> in artistic and psycholo 62851: samll firms . This may be <helpful> in improving the competi 103284: . The reader will find it <helpful> to think_of the special 141792: t for_granted , try to be <helpful> , but do n't ask questio 209867: uled as the Martians were <helpful> . Part of the time saved
Notice that the results only include the word "helpful" and no other variations of the word. The double quotes stand for word boundaries so you will only get the word you searched for without any additionaly morphology. For example, if you were to do a search for "respective," you would not find examples of "respectively" and vice versa. Look in Using regular expressions to learn about searching for both words simultaneously.
The word that you searched for is often called the "key word in context" - KWIC for short. The context is the amount of data shown on either side of the key word. In this case, we see the default context for cqp - 26 characters on either side of the key word. A character is any letter, number, piece of punctuation, the space, etc. Changing the size of the context will be discussed later.
Note that within double quotes blanks and case are significant. If you do the same search above but with a capital letter, you get a different result:
If the results of your search are significant and cqp has to print out multiple pages of results, you will see a prompt in the lower left corner. It looks like a colon (:) and it is just cqp's way of asking you whether you want to go on or not.
:
When you see this colon, some options are:
BROWN> "Helpful"; 0 matches.
cqp is case specific, so if you run a search on "the" you are only searching for lower-case "the"and not "The." However, this does not mean that to get both "the" and "The" you have to do separate searches. Instead you can use the disjunction operator (|) - using the operator is like saying "the" or "The". For example:
BROWN> "the|The";
(Actaully doing a search on "the|The" would be pointless, however, because it is such a common word. Almost every line in the corpus would return a result if you ran this query.
For other ways of expressing options, see groups and ranges.
You can use the Kleene star (which means "zero or more") or the Kleene plus. The Kleene star (which means "one or more") when you want to indicate repetition. The Kleene star and plus are often used with the wildcard character (.) which can stand for any character.
Here is an example of how the Kleene star (*) is used:
BROWN> "walk.*";
Here, the "." is a wildcard character which stands for any single character and the "*" means zero or more times. So there could be any character or characters after the specified "walk" until cqp detects a word boundary. Put together, the query above is the same as saying "walk|walks|walked|walking|walker" and so on.
Here is an example of how the Kleene plus (+) is used:
BROWN> "walk.+";
Here, the "." stands for a single character and the "+" means one or more times. Put together, this query is the same as saying "walks|walked|walking" Because the plus means ONE or more, "walk" is not a result of this query.
We saw in Alternatives that you can use the disjunction operator to run a query on either "the" or "The" with "the|The"; However, it is more efficient to say that it is only the "t/T" that is being alternated. We can show this grouping:
BROWN> "(t|T)he";
Also, if you want to indicate repetition of more than just one character, you use grouping. If you want to search for "ab|abab|ababab|abababab" etc you can do the following:
BROWN> "(ab)+";
Also look to ranges for anther way to express this relationship.
If you want to include a range - whether of numbers (i.e. from 0-9) or of letters (from a-z), we use square brackets ([]).
BROWN> "[0-9]"; BROWN> "[a-z]";
Also, we can use the square brackets as another way of expressing alternations. Of the following examples, the first one means "a|b|c|d" and the second is either "(t|T)he" as seen in groups or "the|The" as seen in alternations.
BROWN> "[abcd]"; BROWN> "[tT]he";
Using "?" after a character means that the character can be omitted from the query. For example, in
BROWN> "walks?";
the "s" can be omitted. This is similar to saying "walk|walks" (see alternatives)
Now we can combine what we know about simple searches and regular expressions to make more complex searches.
To search for two words, we can simply repeat a word search:
BROWN> "quite" "the"; BROWN> "give" "up";
The first of these examples will give us results with the word "quite" preceeding the word "the" - and the second example will give results with the word "give" preceeding "up"
Check carefully the use of the quotation marks. Each word must be surrounded by its own set of quotation marks. Do not group the two words together. Remember, cqp thinks of the double quotes as word boundaries, and so if both words are in one set of quotes, it will look for one word that happens to have the space character in it.
If we want to do a search for "give up" or "gives up" or "given up" or "gave up" - we can do this a few different ways:
BROWN> ("give" "up")|("gives" "up")|("given" "up")|("gave" "up");
BROWN> "give|gives|given|gave" "up";
BROWN> "give[sn]?|gave" "up";
Remember to be careful about blanks - "give | gives" means "give " or " gives" with a space as part of the word.
The BNC also has lemma information in it. So if we were searching the BNC, we could do the above search with the attribute lemma, described later in this tutorial.
Searching for a particular pair of words that are not necessarily adjacent is easy with cqp. We use a pair of empty square brackets to represent a possible word.
BROWN> "give" [] "up";
BROWN> "give" []* "up";
BROWN> "give" []{2} "up";
BROWN> "give" []{0,2} "up";
We can either use [] to mean that there is one word between the target words, or []* uses the Kleene star to say "zero or more" words between. The curly brackets are used to indicate either an exact number ({2} - exactly two words between) or a range ({0,2}- representing between zero and two words)
BROWN> "in" ("the" "city" "of")? "Boston|Washington";
Means that "in" must be present, but "the city of" can be omitted (see omission.) Then we need either "Boston" or "Washington" (see alternatives.)
But be careful of the double quotes. If you write this:
BROWN> "in" ("the" "city" "of")? "Boston"|"Washington";
it is the same as the whole first part as an alternation of "Washington" - like this:
BROWN> ("in" ("the" "city" "of")? "Boston" ) | ("Washington");
Another way to say the same thing as the first is:
BROWN> "in" ("the" "city" "of")? ("Boston"|"Washington");
but we see that the first way might be the most efficient.
BROWN> "Clinton" "said" "in"?
This will capture either "Clinton said nothing" or "Clinton said in Columbus, Ohio..."
Because searching for words is the most used option of cqp, it is the default query. But in fact, the query
BROWN> "helpful";
is actually just an abbreviation for the query
BROWN> [word="helpful"];
We can also search by part of speech in the following way:
BROWN> [pos="DT"];
This will allow you to search for all determiners. (To find out how each corpus labeled the various parts of speech, see the tag sets .)
To look for a word as a particular part of speech, you can run a query specifying both the word and the part of speech in the following way:
BROWN> [word="rain" & pos="N.*"];
Note that we use "&" for logical "and" - we can use "|" for logical "or" and "!" as logical "not."
As the tag sets often have more than one type of noun, we use "N.*" to say that we just want the part of speech
to be a noun.
Earlier we did a search for the adjacent words give up, but we wanted to look for gives up, given up, gave up, etc. at the same time. We did
BROWN> "give[sn]?|gave" "up";We could also do the same thing with lemma. For example,
BNC> [lemma="give"] "up";
This should give you the same results as the previous search.
When you want to specify that the search word is at the beginning or end of a sentence, you can use the symbols "<s>" (start of a sentence) and "</s>" (end of a sentence).
To look for a noun at the beginning of a sentence, the query is as follows:
BROWN> <s> [pos="N.*"];
For a noun at the end of a sentence:
BROWN> [pos="N.*"] </s>;
If you want to capture words on both sides of a sentence boundary, you can do the following:
BROWN> [pos="N.*"] [] <s> "It";
The catchall [] will capture the sentence boundary punctuation (the period, for example)
Each query creates a subcorpus. This means that everytime you run a search, the results are stored in a subcorpus - and by defualt cqp names this subcorpus "Last". If you do not wish to save the results, the next time you run a search, "Last" will be overwritten by the new results. However, if you want to save this subcorpus and run searches on it, you can rename it in two different ways:
cqp>MyQuery = Last;
cqp>MyQuery = query;
show
You can now see your subcorpus in the list of corpora in the system with the "show" command. Now that we know about subcorpora, you might want to know about three types of "show":
To run a search on your subcorpus just as you called the system corpus. If you cannot remember the name of your subcorpora, you can use the show command.
BROWN> MyQuery;
BROWN:MyQuery[9]>
After you type in the name of your subcorpus, the prompt will change, showing you that you are now searching on a subcorpus of the Brown corpus that has a certain number of lines - in this case, the subcorpus MyQuery has 9 lines.
If you want to do a search of the previous query without renaming it, you can do the same as above but with "Last" - which is the default name for the very last query done.
BROWN> Last;
BROWN:Last[9]>
A corpus can be renamed by entering
corpus-name-1 = corpus-name-2
where corpus-name-1 is a new corpus and corpus-name-2 is an existing corpus which is to be copied. However, cqp does not allow you to rename a system corpus in order to avoid large scale copying.
You probably won't want to display the whole contents of a 1 million word corpus, let alone a 100 million one, but you might want to display the contents of your subcorpus. There is an easy command for this - "cat"
BROWN> cat MyQuery;
will show you the result of MyQuery. If you type
BROWN> cat;
The default corpus is the last query "Last" so this will display your last query.
It is fairly easy to change the context of a query. For an individual query, you can use the restructure operator (expand to ... ) to specify the expansion of context as follows:
BROWN> SC = "helpful" expand to 1 s;
This will expand the context of this particular search to 1 sentence ("s"). The next time you run a search, the context will go back to the default. To permanently change the context, see Changing the default context. Different ways to expand are:
expand to number structure-name
which expands the intervals of the given corpus to the boundariesof the following (right context) and previous (left-context) number structural annotations called structure-name;
When changing context of subcorpora, you must rename the corpus.
BROWN> SC = "helpful";
BROWN> SCS = SC expand to s;
SC is the name of the subcorpus with the results of the search "helpful"; we are now creating a new subcorpus named SCS that stores the results of the subcorpus SC when the context is expanded to the whole sentence ("s") in which the word is contained.
This is only for those comfortable in a unix-like environment. You can create a file called .cqprc. For example, call emacs
% emacs /.cqprc
Now you only need to type in one sentence:
set context 1 s
To end your session, you need only type
BROWN> exit;