For the following exercises, try using the files in /opt/corpora/nanc/wsj/1994 for your searches.
This means if you are connected to /opt/corpora/nanc/wsj/1994, the command for searching for lines containing the string 'find' is:
Where word lists are called for, it is useful to be able to turn a file and into one-word per line format. This can be done with the command:
This prints the file on standard out. To redirect it to a file, do:
To produce a vocabulary list from a file in one-word per line format, sort it (into alphabetic order) and then remove adjoining duplicate lines using sort and uniq:
Summarizing all of the above points. Here is one line that will produce an alphabetized vocabulary file in your home directory for a compressed Wall Street Journal file:
A. Try to write a single regular expression pattern that matches all forms of the verb 'find' (meaning "to successfully conclude a search").
B. What other words match this expression? Here are some: Other words
C. Modify the pattern in answer to exercise 1a, so as to rule out matching 'founder' and 'foundation':
D.
The revised pattern still gives you a lot of output to check by eye. Use a second regular expression search of that output to make sure there are no hits on 'foundation.' Use the same idea to check that the answer to part A DOES have hits on 'foundation'.
E.
Here's another way to revise the pattern:
A. Try to write a single regular expression to find words that start with lower-case 's' followed by two lower-case consonants in the file 'ws940701':
Here's a nice answer: Egrep command
B. Here's an automatically generated list of the the words that come up: Word list
Here's the unix-ese I used to generate this list: Complex Unix command
Try out the commands between the vertical bars in sequence so you understand what each step of this sequence does. In each of the following sets of directions "bulba%" stands for the Linux prompt (which not be quite the same for all of you). What follows are commands you should actually type to the prompt, up until you get to the symbol "#", which is a standard Unix scripting language comment character. What follows "#" is a comment on or explanation of what you just did.
A
Using the command line in Complex Unix command from problem 2 as a model, construct a list of words beginning with the prefix "exo" found in /opt/corpora/nanc/wsj/1994/ws940701.
Here's a good answer: Another complex Unix command
Here's an automatically generated list of the the words that come up: Word list
B
Now construct a list of words beginning with the prefix "exo" found in the Wall Street Journal in 1994.
Here's the revised command: Revised command
Here's the word list constructed from that command:
Do you think all of these words really use the prefix 'exo' or do some of them just start with the letters 'exo'? HINT: Think about the meanings? Do all of them have to do with the outside of something or getting to the outside of something?
C
Now look for all instances of words beginning with the prefix "intra" found in the Wall Street Journal in 1994.
Here's a good answer: Another complex Unix command
Here's the automatically generated list: Word list
#!/usr/bin/python print 'hello world'Now save the change, rerun the program and verify that the change has taken effect.
#!/usr/bin/pythonIt is an instruction to Unix about what to do with this file, in particular, what program to call to execute it. A full path name is given and is advisable so that the program will run anywhere.
The second line is a Python instruction:
print 'hello world\n'
bulba% ls hello_world.py bulba% python
>>> from sys import *That was just for practice.
>>> import hello_world hello worldNotice the program actually RUNS when you import it. That's because that's how you wrote it. The instruction in the file is just executed when loaded.
#!/usr/bin/python
def hello ():
print 'hello world'
if __name__=='__main__':
hello()
Indent as above! Indentation is important!
The following will cause an error:
#!/usr/bin/python def hello (): print 'hello world'The exact number of spaces indented is not significant, but the fact that there IS indentation matters. This is why it's useful to edit Python with an editor that knows Python.
Here are two:
bulba% idleClick File>Open and select 'hello_world.py'. Then in the new window that opens with the file, click Run>Run Module.
In the hello_world window, if you type "def hello():" and hit carriage return, the proper indentation will appear.
If you hit a carriage return after the print line, indentation will also appear. But now you don't want it. Just hit the Backspace key.
bulba% emacs hello_world.pyClick Python>Start interpreter. Then click on the hello_world window (you'll have 2) and click Python>Execute Buffer.
Loading the original file by menu in idle should look like this:
>>> ================================ RESTART ================================ >>> Hello world!The program executes. Loading the revised file by menu should look like this in idle:
>>> ================================ RESTART ================================ >>>The program does not execute.
We need to execute the function to make it execute:
>>> hello() Hello world!
>>> import hello_worldThen:
>>> hello() Traceback (most recent call last): File "In Python importing a module defines everything in the name space for that module. So the function "hello" is not defined. The function "hello_world.hello" is defined:", line 1, in -toplevel- hello() NameError: name 'hello' is not defined
>>> hello_world.hello() Hello world!You can import things into the top level name space if you desire. The command for that is:
>>> from hello_world import * >>> hello() Hello world!