Project - Concordance

This CS193Q project works on files and dicts. A concordance is like and index of book, except for each word it includes a reference to each line where that words appears.

concordance.zip

1. clean()

We'll say that the "clean" version of a word has all of the non-alphabetic chars removed from its beginning and end, so the clean form of "--Isn't--" is "Isn't". Write a clean() function that forms the cleaned version of a word. This function should have at least 5 Doctests, the other functions do not need Doctests. Use the lowercase, clean version of each word as the key in the dict below.

2. Build Concordance

We'll say that a concordance is a dict with a key for the lowercase version of every alphabetic word to appear in a text. The value for each key is a nested dict, where the key is an int line number, and its value is that line of text from the original, with whitespace removed from the ends (str.strip()). The concordance captures all the words in the text, and all the locations where those words appear.

So for this input poem:

Roses are red
Violets are blue
"RED" BLUE.

The concord dict is:

{'roses': {1: 'Roses are red'}, 'are': {1: 'Roses are red', 2: 'Violets are blue'}, 'red': {1: 'Roses are red', 3: '"RED" BLUE.'}, 'violets': {2: 'Violets are blue'}, 'blue': {2: 'Violets are blue', 3: '"RED" BLUE.'}}

Process the file line by line to build up the concord dict.

With the dict-count algorithm, the value for each key is an int, and a newly seen word sets its value to 0. In this case, the value for each word is a nested dict, so the initial value is {}, and later operations use d[key] etc. as usual to edit the nested dict.

3. main() Output Options

For each of these 2 options, decompose out helper functions as needed, and write logic in main() to call your functions. The print() function should only appear directly in main() for the -raw option.

1. With the -raw command line option, print the raw dict data structure. (Calling print() directly ok for this one.)

$ python3 concordance.py -raw poem.txt 
{'roses': {1: 'Roses are red'}, 'are': {1: 'Roses are red', 2: 'Violets are blue'}, 'red': {1: 'Roses are red', 3: '"RED" BLUE.'}, 'violets': {2: 'Violets are blue'}, 'blue': {2: 'Violets are blue', 3: '"RED" BLUE.'}}

Memory use aside: note you're not duplicating the line of text for each dict entry - with Python's shallow use of =, all the entries point to the 1 copy of the line in memory.

2. With the -index command line option, print a "report" for all the key words in alphabetical order: the word on a line followed by the text of all the original lines where it appears, with each listed with its line number, all followed by a blank line, like this:

$ python3 concordance.py -index poem.txt
are
1 Roses are red
2 Violets are blue

blue
2 Violets are blue
3 "RED" BLUE.

red
1 Roses are red
3 "RED" BLUE.

roses
1 Roses are red

violets
2 Violets are blue

3. (optional) The -top option has output like -index, except order the words so that the word that appear on the most number of lines come first. Use sorted() with a key= lambda to custom sort the items. Note that len(dict) returns the number of key/value pairs in a dict.

$ python3 concordance.py -top poem.txt 
are
1 Roses are red
2 Violets are blue

red
1 Roses are red
3 "RED" BLUE.

blue
2 Violets are blue
3 "RED" BLUE.

roses
1 Roses are red

violets
2 Violets are blue

We won't worry about the order when the number of lines is a tie, so for poem.txt we just know that "roses" and "violets" are listed last.

Get this code working and cleaned up and you're all done.