General Help Facilities

  1. For the beginner
    1. Just to get started today.
    2. Our Unix help page
    3. Introduction to Linux (compressed)
    4. Emacs command reference card (ps, pdf)
    5. Part I of the Linux cookbook
    6. Python help (place to start)
  2. For the more advanced
    1. Linux Cookbook (compressed)
    2. Pocket Guide to Linux (compressed)
    3. Guide to shell-scripting in Unix (compressed)

Searching Exercises

For the following exercises, try using the files in /opt/corpora/nanc/wsj/1994 for your searches.

This means if you are connected to /opt/corpora/nanc/wsj/1994, the command for searching for lines containing the string 'find' is:

    gunzip -c * | egrep 'find'
Where 'gunzip -c *' sends a decompressed version of any file to STDOUT and egrep 'find' searches for the string 'find' in all input it gets from STDIN.

Where word lists are called for, it is useful to be able to turn a file and into one-word per line format. This can be done with the command:

    cat file | tr -sc "[:alpha:]" "[\n*]"
The tr command translates characters; the "c" part of the "-sc" option tells tr to translate the complement set of the set of characters immediatelty following into the set of characters following that; "[:alpha:]" denotes the set of alphanumeric characters, so this translates all NON-alphanumeric characters in the file, including for example, space, into carriage returns, in effect putting each word on a separate line. By the way, the "s" part of the "-sc" option squeezes consecutive nonalphabetic characters into one translation step, so that, for example, two consecutive spaces get translated into a single carriage return.

This prints the file on standard out. To redirect it to a file, do:

    cat file | tr -sc "[:alpha:]" "[\n*]" > outfile

To produce a vocabulary list from a file in one-word per line format, sort it (into alphabetic order) and then remove adjoining duplicate lines using sort and uniq:

    sort one-word-per-line-file | uniq > vocab_list

Summarizing all of the above points. Here is one line that will produce an alphabetized vocabulary file in your home directory for a compressed Wall Street Journal file:

    gunzip -c ws941116.gz | tr -sc "[:alpha:]" "[\n*]" | sort | uniq > ~/wsj94116_vocab.txt

Problem 1

A. Try to write a single regular expression pattern that matches all forms of the verb 'find' (meaning "to successfully conclude a search").

Here's a nice answer: Egrep command

B. What other words match this expression? Here are some: Other words

C. Modify the pattern in answer to exercise 1a, so as to rule out matching 'founder' and 'foundation':

    New pattern

D.

The revised pattern still gives you a lot of output to check by eye. Use a second regular expression search of that output to make sure there are no hits on 'foundation.' Use the same idea to check that the answer to part A DOES have hits on 'foundation'.

    New commands

E.

Here's another way to revise the pattern:

    Alternative

Problem 2

A. Try to write a single regular expression to find words that start with lower-case 's' followed by two lower-case consonants in the file 'ws940701':

Here's a nice answer: Egrep command

B. Here's an automatically generated list of the the words that come up: Word list

Here's the unix-ese I used to generate this list: Complex Unix command

Try out the commands between the vertical bars in sequence so you understand what each step of this sequence does. In each of the following sets of directions "bulba%" stands for the Linux prompt (which not be quite the same for all of you). What follows are commands you should actually type to the prompt, up until you get to the symbol "#", which is a standard Unix scripting language comment character. What follows "#" is a comment on or explanation of what you just did.

  1. bulba% cd /home/ssrl/compling; cat reverse_alphabet # notice there 2 of each letter
  2. bulba% sort reverse_alphabet # Look what happens
  3. cd $HOME; cp /home/ssrl/compling/reverse_alphabet . # copy the file "reverse_alphabet" to your home directory
  4. bulba% sort reverse_alphabet > alphabet # store the sorted file in a file
  5. bulba% cat alphabet # inspect the file you just created
  6. bulba% uniq alphabet # what changes?
  7. bulba% uniq alphabet > uniqified_alphabet # Store the uniqified file
  8. bulba% wc alphabet uniqified_alphabet # The first column is lines, the second words, the third characters
  9. bulba% sort reverse_alphabet | uniq > new_unqified_alphabet # Doing it all in one step

Problem 3

A

Using the command line in Complex Unix command from problem 2 as a model, construct a list of words beginning with the prefix "exo" found in /opt/corpora/nanc/wsj/1994/ws940701.

Here's a good answer: Another complex Unix command

Here's an automatically generated list of the the words that come up: Word list

B

Now construct a list of words beginning with the prefix "exo" found in the Wall Street Journal in 1994.

Here's the revised command: Revised command

Here's the word list constructed from that command:

    Revised word list

Do you think all of these words really use the prefix 'exo' or do some of them just start with the letters 'exo'? HINT: Think about the meanings? Do all of them have to do with the outside of something or getting to the outside of something?

C

Now look for all instances of words beginning with the prefix "intra" found in the Wall Street Journal in 1994.

Here's a good answer: Another complex Unix command

Here's the automatically generated list: Word list


Python

  1. Running and modifying the 'hello world' program
  2. Python is interpretive language