LINGUIST 138/238 -- SYMBSYS 138. Autumn 2004. Homework 5

LINGUIST 138/238 - SYMBSYS 138 - Autumn 2004
Homework 5

Due: Thursday November 11

Read this entire page before starting!!

You are going to implement the Direct approach to MT in a small way. Here's what you do!

Choose a language X. Pick one that you know well enough to work with. If you don't know any language besides English this well, just choose the one you know best, perhaps getting a friend who is a native (or just a good-enough) speaker of language X to help you.
Create a test document in Language X; one paragraph is probably fine, let's say 10 sentences. Don't write these 10 sentences yourself; take a real 10 sentences from some source (a newspaper, a novel, a web site, etc).
Create a bilingual English-X dictionary for each word in your test document. It's very difficult to get a good downloadable dictionary, so do this using a web-based or print English-X dictionary (For example here's a web-based dictionary for a couple languages, based on Collins dictionaries). Just create a little dictionary file that has each word in your test document and a corresponding translation English. *Don't* use your knowledge of the context of the sentence to pick the translating word in English. Just use the first (or most frequent, or something like that) definition in the dictionary.
Now write code (again, any programming language is ok) to implement the following "Direct MT" system:
1. Use your bilingual dictionary to translate each word from Language X into English.
2. Run your simple part-of-speech tagger (from homework 3) on the English target words.
3. Now write at least 10 simple part-of-speech-based reordering transformations to reorder the words in your 10 new "English" sentences to look more like real English!
4. See how close you can get to a real translation!!
Now do an error analysis: what kinds of errors are still left, and what kind of knowledge would you have needed to fix them?
Turn in, in the normal way:
1. Your code
2. Your input file, your dictionary.
3. Running of your code on your input
4. A description, for each rule you write, about what it was supposed to do, what differences between Language X and English it was supposed to fix (and make sure you give a good example of the use of the rules in your running of your code).
5. Your error analysis of the remaining errors.