Mimic

Homework 7b - Mimic

This small project will use a dict data structure for a text algorithm with some whimsy to it. The program will be able to produce a random imitation of any source text. The code is short but very algorithmic.

The file mimic.zip has the mimic project directory. For this project, the handout will outline the problem to solve. The file "mimic.py" is included as a starting point, but it is basically blank, so you are practicing creating and testing code from scratch. This project does not require map/lambda.

Bigrams Algorithm

Given a text, we'll say the "bigrams" data records what words come immediately after other words in the text. For example the text:

This and that and the other.

Yields the bigrams:

'This' → 'and'
'and' → 'that', 'the'
'that' → 'and'
'the' → 'other.'

In the text, 'This' is followed only by 'and', but 'and' is followed by both 'that' and 'the'.

For every pair of adjacent words in the text, call them First and Second. The bigrams dict has a key in it for every First word in the text. The value for each First in the dict is a list of all the Second words for that First (duplicates are allowed in the list). As a one-off addition, the empty string '' is treated as being First before the very first word. So for example the text 'This and that and the other.' yields the following bigrams dict:

{
 '': ['This'],
 'This': ['and'],
 'and': ['that', 'the'],
 'that': ['and'],
 'the': ['other.'],
}

To build the bigrams structure, loop through all the words in the text, keeping track of both the word and the previous word for each iteration. This table shows the 6 loops required to go through the 6 words of 'This and that and the other.':

LOOP     WORD    PREVIOUS
1        This    ''
2        and     This
3        that    and
4        and     that
5        the     and
6        other.  the

Each time through the loop, use the previous/word pair to update the bigrams dict.

Gettysburg Bigrams

Here's a bit of the bigrams of the Gettysburg address:

{
 '': ['Four'],
 'Four': ['score'],
 'score': ['and'],
 'and': ['seven', 'dedicated', 'so', 'proper', 'dead,', 'that'],
 'seven': ['years'],
 ...
}

The word 'Four' only has the word 'score' after it, but the word 'and' has many words which appear after it.

Bigrams Mimic Algorithm

For any particular word, the bigrams data encodes what words might come after it. What is the bigrams data good for? One neat application is chasing through the bigrams structure, creating a random text that mimics the original text. Here is an outline of the algorithm to generate a n words of mimic text from the bigrams:

Start with the empty string '' as the choice
Look up the list of words that come after the choice in the bigrams
Select one of those words at random and print it followed by a space but no '\n'
print(word + ' ', end='')
Go back to step 2 with the word just printed as the choice and repeat
If in step 2 the choice is not in the bigrams, use '' as the choice instead which is always present
Do the steps 2..4 in a loop n times to print n words
After printing all the words, do one print() to add one '\n' at the end

Mimic Output

Here is some mimic output text based on the gettysburg address. It has a feverish quality, somehow echoing the words and tone of the original but without completely making sense.

Four score and dedicated to the living, rather, to the great civil war, testing whether that all men are met on a great task remaining before us to dedicate a great battle-field of that we can not have come to add or any nation so nobly advanced. It is for By default, the mimic program produces 50 word outputs, and there's a command line option to specify any number. The output ends with word 50, so like the above example, the ending tends to be abrupt.

Here we have 100 words from the declaration of independence:

When in the earth, the Laws of mankind requires that among the earth, the separate and equal station to the causes which have connected them with another and of the separate and to which the opinions of Nature's God entitle them, a decent respect to the powers of mankind requires that they are created equal, that all men are Life, Liberty and the pursuit of mankind requires that they are Life, Liberty and the powers of Nature and of human events it becomes necessary for one people to the powers of Nature and of Nature and equal station to assume

Someday perhaps you can adapt this work to get a grant as an experimental computer literature artist. But for now in CS106A, this amounts to less than one page of Python code!

Aside: you are building what is known as a "Markov Model": imagine each word is a node with arrows leading out of it to each possible next word. The algorithm randomly chases through this network, picking a random arrow out of a node for each hop.

each word is a node, arrows to possible next words

Design Your Functions

For this assignment, we outline what the program needs to do, but you decompose your own functions and code. The requirement is that you decompose your program into three or more functions (counting main() as one function), and these functions fit together to solve the whole problem. We will not be super picky about the exact decomposition of the functions. They just need to be proper black-box functions and work together to solve the whole problem.

Very often in CS106A programs, there are functions that read a file and return a data structure, and later functions take in that data structure as a parameter for further computation or printing. (The wordcount.zip lecture example has this structure.)

Create at least 1 function to read a filename and create a bigrams data structure from its text. For this project, it's best to read the whole file as one big string, so that the last word on one line comes before the first word on the next line (code fragment below). This function must have at least 2 DocTests. The files test-tiny.txt test-small.txt and test-roses.txt are quite small and suitable for testing. For example, the result for test-small.txt should be:
```
{'': ['a'], 'a': ['b', 'c', 'd.'], 'b': ['a'], 'c': ['a']}
```
Use str.split() to separate out the "words" of the text, leaving the punctuation in place, so 'and.' counts as a word and leave the upper/lower case intact. The str.split() function merges all the lines in the text together, treating each newline '\n' like a space. This is perfect for this algorithm.
Very often we use for-line-in-file to process a file line by line. The bigrams algorithm is unusual in that you can process the whole file as one big string. Here is a reminder of the code to load a whole text file into a string:
```
with open(filename) as f:
    text = f.read()       # reads entire file into 1 string
    words = text.split()  # split text into list of word strings
```
If you get a UnicodeDecodeError opening the file, it means your computer is not using 'utf-8' by default to decode the file. To fix this, just add an explicit encoding= to the file open like this: open(filename, encoding='utf-8')
Create at least 1 function to do the random-print algorithm, taking in any needed data structures and an int n parameter indicating how many words to print. The printing code does not need tests, since the randomness is not well suited to testing.
For your functions other than main(), write a Pydoc """Given xxx does yyy""" sentence, summarizing what the function does with its input parameters. PyCharm will paste in ":param s: :return:" in the Pydoc, which you can delete; those are advanced Pydoc features which are not needed for CS106A, and in fact these are rarely used in ordinary Python programs.
Write code in main() to call your code for the 2 command line forms of this program:
Case-1 The default which prints 50 words (len(args) == 1)
```
$ python3 mimic.py gettysburg.txt
```
Case-2 A number is typed in after the filename to specify a custom number of words to print (len(args) == 2)
```
$ python3 mimic.py gettysburg.txt 200
```

When it's working, try making longer random texts from from "alice-book.txt" and "independence.txt" and "apple-license.txt", like this:

$ python3 mimic.py apple-license.txt 200

Once your code is cleaned up and works right, please turn in your mimic.py file on Paperless as usual. That's nice looking python project - creating neat output with not much code.

Appendix A: Previous State

A standard code-pattern that comes up sometimes is to loop through elements as usual, but for each element, the code also has access to what the previous element was. The standard solution is to introduce a "previous" variable, give it an initial value, and then at the bottom of the loop, assign previous to the current value, so it's ready for the next iteration.

Here's the technique in a for/range loop

previous = -1   # set special previous value for the very first elem
for i in range(100):
    # in here, use i and previous
    print(previous, i)
    # at the end of the loop, set previous for next iteration
    previous = i

Appendix B: Random Choice

The Python "random" module contains functions which provide pseudo-random numbers. For this program, call the random.choice() function which returns, at random, one element from a non-empty list. The needed "import random" is provided at the top of the starter file.

>>> import random
>>> lst = ['a', 'b', 'c']
>>> random.choice(lst)
'a'
>>> random.choice(lst)
'b'
>>> random.choice(lst)
'a'