LINGUIST 138/238 -- SYMBSYS 138. Autumn 2004. Homework 3

LINGUIST 138/238 - SYMBSYS 138 - Autumn 2004
Homework 3: Part of Speech Tagging

Due: October 19 at the start of class

Read this entire page before starting!!

Do exercises 8.1, 8.2, and 8.3 from in the reading chapter ("Word Classes and Part-of-Speech Tagging") from Jurafsky and Martin. Exercise 8.3 requires a partner in the class, so make sure to pick a partner in class soon, and in any case by Thursday.
Implement the Most-Frequent-Tag algorithm for part-of-speech tagging as discussed in class on Tuesday. You should create your dictionary of possible tags, and your tag frequencies, from the file /afs/ir/class/linguist238/WWW/restricted/brown.train.txt. For any word that appears in your test set but that is not in the dictionary, (i.e., unknown words) assign it the tag NN.
Compute the accuracy of your Most-Frequent-Tag algorithm on the test set in /afs/ir/class/linguist238/WWW/restricted/brown.test.txt.
Have a look at some of the tags that you got wrong. Write me two rules (just descriptively, in English, you don't have to write any code) which would have improved your tagging if you had run them as post-processors to your Most-Frequent-Tag algorithm.
Improve the unknown-word tagging algorithm, to do something smarter than just assigning all unknown words the tag NN. Think about the examples we discussed in class Tuesday.

What to turn in:

Your writeups for problems 8.1, 8.2, and 8.3
Your program (as usual, you may write in any programming language you want).
As usual, a sample run of your program. Make sure your sample run prints out the percent-correct accuracy you compute for the test set, and the percentage of "unknown words" in your test set. If you have any interesting features of your program, show thows in the sample run also. For example, your sample run must show off your improved unknown-word code.
A list of 5 errors your system made (with enough context so we can understand the errors)
Your two descriptive rules, together with some examples of the errors that these rules would correct (or you can just have the examples be ones from the 5 errors mentioned above)
A description of your improved unknown-word algorithm

How to turn it in: