## Homework 6b - Mimic

All parts of HW6 are due Wed Feb 26th at 11:55pm.

This small project will use a dict data structure for a text algorithm with some whimsy to it. The program will be able to produce a random imitation of any source text.

The file mimic.zip has the mimic project directory. For this project, the handout will outline the problem to solve, but the starter file does not include any functions. You will need to create a set of functions that work together to solve the whole problem.

## Bigrams Algorithm

Given a text, we'll say the "bigrams" data records what words come immediately after other words in the text. For example the text:

`This and that and the other.`

Yields the bigrams:

```'This' → 'and'
'and' → 'that', 'the'
'that' → 'and'
'the' → 'other.'
```

In the text, 'This' is followed only by 'and', but 'and' is followed by both 'that' and 'the'.

Call the 2 words in the pair "previous" and "next". In terms of a data structure, the bigrams dict has a key in it for every previous word in the text (the last word in the text is not necessarily a previous word). The value of each key is a list of all the words that immediately follow previous in the text (duplicates are allowed in the list). As a one-off addition, the empty string '' is treated as being previous to the very first word. So for example the text 'This and that and the other.' yields the following bigrams dict:

```{
'': ['This'],
'This': ['and'],
'and': ['that', 'the'],
'that': ['and'],
'the': ['other.'],
}
```

Here's a bit of the bigrams of the Gettysburg address:

```{
'': ['Four'],
'Four': ['score'],
'score': ['and'],
'and': ['seven', 'dedicated', 'so', 'proper', 'dead,', 'that'],
'seven': ['years'],
...
}
```

The word 'Four' only has the word 'score' after it, but the word 'and' has many words which appear after it.

## Bigrams Output

What is the bigrams data good for? The bigrams data enables an algorithm that chases through the words, creating a random text that resembles the original like this:

2. Look up words that come after the choice in the bigrams
3. Select one of those words at random and print it followed by a space but no '\n'
4. Go back to step 2 with the word just printed as the choice and repeat
5. If in step 2 the choice is not in the bigrams, use '' instead which is always present

## Always Do The Minimum

The printing process is guided by a "minimum" int parameter, such as 25. The printing algorithm continues to put out words until at least the minimum number of words have been printed. Once the minimum is reached (i.e. the 25th word is printed), a word ending wth a '.' or ';' ends the printing at that word. This amounts to some loop/break code — a loop to do all the words, but exiting the loop on just the right moment. In this way the random text sort of ends at the end of a sentence instead cutting off mid-phrase. Friendly reminder of Python builtin: str.endswith()

## Mimic Output

Here is some random text we get with this algorithm from the gettysburg address. It has a feverish quality, somehow echoing the words and tone of the original but without completely making sense.

Four score and proper that these honored dead shall not consecrate -- we here have a great battle-field of that nation, under God, shall have died in a new birth of that field, as a great battle-field of devotion to the people, for us the living, rather, to the people, by the earth.

Here we have the declaration of independence. You can see where it sort of gets stuck in a loop at times:

When in the causes which impel them to assume among the causes which have connected them to which the Course of human events it becomes necessary for one people to the separation. We hold these truths to the Course of Nature's God entitle them, a decent respect to assume among these truths to be self-evident, that all men are endowed by their Creator with another and equal station to be self-evident, that all men are created equal, that they are endowed by their Creator with another and equal station to assume among these are endowed by their Creator with certain unalienable Rights, that they are created equal, that they should declare the pursuit of human events it becomes necessary for one people to the powers of Nature's God entitle them, a decent respect to the Laws of mankind requires that they should declare the pursuit of Nature and the Course of Happiness.

Someday perhaps you can adapt this work to get a grant as an experimental computer literature artist. But for now in CS106A, this amounts to less than one page of Python code!

Aside: you are building what is known as a "Markov Model": imagine each word is a node with arrows leading out of it to each possible next word. The algorithm randomly chases through this network, picking a random arrow out of a node for each hop.

For this assignment, we outline what the program needs to do, but you decompose your own functions and code. The requirement is that you decompose your program into three or more functions (including main()), and these functions fit together to solve the whole problem. Very often in CS106A programs, there are functions that read a file and return a data structure, and later functions take in that data structure as a parameter for further computation or printing. (The "index" lecture example has this structure.)

• Create at least 1 function to read a file and create a bigrams data structure from it. This must have at least 2 DocTests.
• Use str.split() to separate out the "words" of the text, leaving the punctuation in place, so 'and.' counts as a word and leave the upper/lower case intact.
• Very often we use for-line-in-file to process a file line by line. The bigrams algorithm is unusual in that you can process the whole file as one big string. Here is a reminder of the code to load a whole text file into a string:
```with open(filename) as f:
words = text.split()  # as above, use split()
```
• Create at least 1 function to do the random-print algorithm. Part of the printing is the minimum parameter, either 25 as a default, or the number supplied on the command line (see below). The staff solution uses 2 functions to do the printing. The printing code does not need tests, since the randomness is not well suited to testing. Sometimes we do printing directly in main(), but for this problem, the printing is complicated enough to go in its own functions.
• For your functions other than main(), write a Pydoc """Given xxx does yyy""" sentence, summarizing what the function does with its input parameters. The DocTests go at the end of the Pydoc. PyCharm will paste in ":param s: :return:" in the Pydoc, which you can delete; those are advanced Pydoc features which are not needed for CS106A, and in fact these are rarely used in ordinary Python programs.
• Create code in main() to call your functions for the 2 command line forms of this program:
• Case-1 The default which prints with a minimum of 25 (len(args) == 1)
```\$ python3 mimic.py gettysburg.txt
```
• Case-2 A minimum number is typed in after the filename to provide a custom minimum (len(args) == 2)
```\$ python3 mimic.py gettysburg.txt 500
```
• The files test-tiny.txt test-small.txt and test-roses.txt are quite small and suitable for testing.

When it's working, try making longer random texts from from "alice-book.txt" and "independence.txt" and "apple-license.txt", like this:

```\$ python3 mimic.py apple-license.txt 200
```

Once your code is cleaned up and works right, please turn in your mimic.py file on Paperless as usual. That's a neat algorithm, and the sort of data slicing and wrangling where Python works well.

## Appendix A: Previous State

A standard code-pattern that comes up sometimes is to loop through elements as usual, but for each element, the code also has access to what the previous element was. The standard solution is to introduce a "previous" variable, give it an initial value, and then at the bottom of the loop, assign previous to the current value, so it's ready for the next iteration.

Here's the technique in a for/range loop

```previous = -1   # set special previous value for the very first elem
for i in range(100):
# in here, use i and previous
print(previous, i)
# at the end of the loop, set previous for next iteration
previous = i
```

## Appendix B: Random Choice

The Python "random" module contains functions which provide pseudo-random numbers. We'll explore modules in more detail in later weeks, but for now we'll just use this one. For this program, call the random.choice() function which returns, at random, one element from a non-empty list. The needed "import random" is provided at the top of the starter file.

```>>> import random
>>> lst = ['a', 'b', 'c']
>>> random.choice(lst)
'a'
>>> random.choice(lst)
'b'
>>> random.choice(lst)
'a'
```

## Appendix C: Pydoc + DocTest Examples

Here are some representative Python examples, showing Pydoc and Doctests. The num_histogram() example demonstrates writing a test that refers to a local file by its name.

```def index_line(index, line):
"""
Given an index dict and a line of text.
Use split() to access all the words in the line.
Using the lowercase form of each word as the key,
update and return the index dict.
>>> index_line({}, 'tea time')
{'tea': ['tea time'], 'time': ['tea time']}
>>> index_line({'tea': ['tea time'], 'time': ['tea time']}, 'coffee time')
{'tea': ['tea time'], 'time': ['tea time', 'coffee time'], 'coffee': ['coffee time']}
"""

def parse_tags(tweet):
"""
Given a tweet string like '@bob: this #is #it', returns
a list of tags like ['#is', '#it'].
>>> parse_tags('This #tweet is #fire')
['#tweet', '#fire']
>>> parse_tags('@user: #tag1 #tag2')
['#tag1', '#tag2']
"""

def num_histogram(filename):
"""
Given filename of a text file of scores.
Each line of the file contains an int 0..100.
Compute a counts dict, using a "bucket" of
0, 10, 20, .... e.g. bucket 20 counts scores 20..29
>>> num_histogram('small.txt')
{60: 3, 70: 4, 80: 2, 90: 1}
"""
```