Homework 6b - Mimic

All parts of HW6 are due Wed Nov 13th at 11:55pm as usual.

This small project will use a dict data structure for a text algorithm with some whimsy to it. The program will be able to produce a random imitation of any source text.

The file mimic.zip has the mimic project directory. For this project, the handout will outline the problem to solve, but the starter file does not include any functions. You will need to create a set of functions that work together to solve the whole problem.

Bigrams Algorithm

Given a text, we'll say the "bigrams" data records what words come immediately after other words in the text. For example the text:

This and that and the other.

Yields the bigrams:

'This' → 'and'
'and' → 'that', 'the'
'that' → 'and'
'the' → 'other.'

In the text, 'This' is followed only by 'and', but 'and' is followed by both 'that' and 'the'.

Call the 2 words in the pair "previous" and "next". In terms of a data structure, the bigrams dict has a key in it for every previous word in the text (the last word in the text is not necessarily a previous word). The value of each key is a list of all the words that immediately follow previous in the text (duplicates are allowed in the list). As a one-off addition, the empty string '' is treated as being previous to the very first word. So for example the text 'This and that and the other.' yields the following bigrams dict:

{
 '': ['This'],
 'This': ['and'],
 'and': ['that', 'the'],
 'that': ['and'],
 'the': ['other.'],
}

Here's a bit of the bigrams of the Gettysburg address:

{
 '': ['Four'],
 'Four': ['score'],
 'score': ['and'],
 'and': ['seven', 'dedicated', 'so', 'proper', 'dead,', 'that'],
 'seven': ['years'],
 ...
}

The word 'Four' only has the word 'score' after it, but the word 'and' has many words which appear after it.

Bigrams Output

What is the bigrams data good for? The bigrams data enables an algorithm that chases through the words, creating a random text that resembles the original like this:

  1. Start with the empty string '' as the choice
  2. Look up words that come after the choice in the bigrams
  3. Select one of those words at random and print it followed by a space but no '\n'
  4. Go back to step 2 with the word just printed as the choice and repeat
  5. If in step 2 the choice is not in the bigrams, use '' instead which is always present

Always Do The Minimum

The printing process is guided by a "minimum" int parameter, such as 25. The printing algorithm continues to put out words until at least the minimum number of words have been printed. Once the minimum is reached (i.e. the 25th word is printed), a word ending wth a '.' or ';' ends the printing at that word. This amounts to some loop/break code — a loop to do all the words, but exiting the loop on just the right moment. In this way the random text sort of ends at the end of a sentence instead cutting off mid-phrase. Friendly reminder of Python builtin: str.endswith()

Mimic Output

Here is some random text we get with this algorithm from the gettysburg address. It has a feverish quality, somehow echoing the words and tone of the original but without completely making sense.

Four score and proper that these honored dead shall not consecrate -- we here have a great battle-field of that nation, under God, shall have died in a new birth of that field, as a great battle-field of devotion to the people, for us the living, rather, to the people, by the earth.

Here we have the declaration of independence. You can see where it sort of gets stuck in a loop at times:

When in the causes which impel them to assume among the causes which have connected them to which the Course of human events it becomes necessary for one people to the separation. We hold these truths to the Course of Nature's God entitle them, a decent respect to assume among these truths to be self-evident, that all men are endowed by their Creator with another and equal station to be self-evident, that all men are created equal, that they are endowed by their Creator with another and equal station to assume among these are endowed by their Creator with certain unalienable Rights, that they are created equal, that they should declare the pursuit of human events it becomes necessary for one people to the powers of Nature's God entitle them, a decent respect to the Laws of mankind requires that they should declare the pursuit of Nature and the Course of Happiness.

Someday perhaps you can adapt this work to get a grant as an experimental computer literature artist. But for now in CS106A, this amounts to less than one page of Python code!

Aside: you are building what is known as a "Markov Model": imagine each word is a node with arrows leading out of it to each possible next word. The algorithm randomly chases through this network, picking a random arrow out of a node for each hop.

each word is a node, arrows to possible next words

Design Your Functions

For this assignment, we outline what the program needs to do, but you decompose your own functions and code. The requirement is that you decompose your program into three or more functions (including main()), and these functions fit together to solve the whole problem. Very often in CS106A programs, there are functions that read a file and return a data structure, and later functions take in that data structure as a parameter for further computation or printing. (The index example has this structure.)

When it's working, try making longer random texts from from "alice-book.txt" and "independence.txt" and "apple-license.txt", like this:

$ python3 mimic.py apple-license.txt 200

Once your code is cleaned up and works right, please turn in your mimic.py file through Paperless as usual. That's a neat algorithm, and the sort of data slicing and wrangling where Python works well.

Appendix A: Previous State

A standard code-pattern that comes up sometimes is to loop through elements as usual, but for each element, the code also has access to what the previous element was. The standard solution is to introduce a "previous" variable, give it an initial value, and then at the bottom of the loop, assign previous to the current value, so it's ready for the next iteration.

Here's the technique in a for/range loop

previous = -1   # set special previous value for the very first elem
for i in range(100):
    # in here, use i and previous
    print(previous, i)
    # at the end of the loop, set previous for next iteration
    previous = i

Appendix B: Random Choice

The Python "random" module contains functions which provide pseudo-random numbers. We'll explore modules in more detail in later weeks, but for now we'll just use this one. For this program, call the random.choice() function which returns, at random, one element from a non-empty list. The needed "import random" is provided at the top of the starter file.

>>> import random
>>> lst = ['a', 'b', 'c']
>>> random.choice(lst)
'a'
>>> random.choice(lst)
'b'
>>> random.choice(lst)
'a'

Appendix C: Pydoc + DocTest Examples

Here are some representative Python examples, showing Pydoc and Doctests. The num_histogram() example demonstrates writing a test that refers to a local file by its name.

def index_line(index, line):
    """
    Given an index dict and a line of text.
    Use split() to access all the words in the line.
    Using the lowercase form of each word as the key,
    update and return the index dict.
    >>> index_line({}, 'tea time')
    {'tea': ['tea time'], 'time': ['tea time']}
    >>> index_line({'tea': ['tea time'], 'time': ['tea time']}, 'coffee time')
    {'tea': ['tea time'], 'time': ['tea time', 'coffee time'], 'coffee': ['coffee time']}
    """


def parse_tags(tweet):
    """
    Given a tweet string like '@bob: this #is #it', returns
    a list of tags like ['#is', '#it'].
    >>> parse_tags('This #tweet is #fire')
    ['#tweet', '#fire']
    >>> parse_tags('@user: #tag1 #tag2')
    ['#tag1', '#tag2']
    """


def num_histogram(filename):
    """
    Given filename of a text file of scores.
    Each line of the file contains an int 0..100.
    Compute a counts dict, using a "bucket" of
    0, 10, 20, .... e.g. bucket 20 counts scores 20..29
    >>> num_histogram('small.txt')
    {60: 3, 70: 4, 80: 2, 90: 1}
    """