Homework 5a - Parse Mystery

For this project you have a text file which has, hidden within it, data that identifies a mystery object on campus. Your mission is to decipher the file to figure out what the campus object is.

All parts of HW5 are due Wed Oct 30th 11:55 pm as usual.

Homework 5 Warmups

To get started with parsing, we have 2 warmup functions. Complete the code for these functions.

> Parse Warmups

Submit to Paperless: submit work

Install Pillow

This project will use the "Pillow" library. A library is a body of already written code which you import and use, and in this case the Pillow library contains code to manipulate images. We'll use libraries with more detail later. For this assignment you need to install Pillow on your machine so your code can use it.

Open a "terminal" window - the same type of window where you type "python3 foo.py" to run programs. The easiest way to get a Terminal is to use the terminal tab at the lower-left within PyCharm. Type the following command (shown in bold as usual). Note that "Pillow" starts with an uppercase P. (On windows, "python" instead of "python3").

$ python3 -m pip install Pillow
..prints stuff...
Successfully installed Pillow-5.4.1

To test that Pillow is working, type the following command to a terminal inside your parse-mystery folder. This runs the "simpleimage.py" code included in the folder. When run like this, simpleimage.py creates and displays a big yellow rectangle.

# (inside your parse-mystery folder)
$ python3 simpleimage.py
# yellow rectangle appears

If you cannot get Pillow installed successfully, you can still write most of the code for this project. You will need to comment out the line "from simpleimage import SimpleImage" near the top of the starter file, and see the hours "fix Pillow" hours listed on our course page.

Frightful Numbers

Download the parse-mystery.zip to get started.

For this project, you will need to make sense of text lines like the following which have numbers hidden in them. The numbers are frightfully messed up (these are from 480k.txt):

47$ 42^ 18$bj55
77$b 51*25 b35 44*35
*32 j46@ 65^!05$#Z90^(32 x
wait there's no digits in this one at all!
31*32^#34)68^ 60!38$ 74 b148^*60 53#38 c21  28*)

Each number is represented by some text starting with a digit, following these rules:

1. The numbers are non-negative integers, like 123

2. The first char of each number is always a digit.

3. If a '$' char appears immediately after a number, its digits are backwards. So '211$' is the number 112.

4. If a '^' char appears immediately after a number, it's as if that number is not present in the data, and it is omitted from the output. So for example '176^' would be omitted.

5. The numbers are separated from each other by random chars which are not '^' or '$' or digits.

Here is the first example line:


What are the numbers in there?

[800, 600, 64, 63, 61, 60]

This looks rather impossible at first. But using graduated tests and some Python, you can boil that mess down to some nice clean numbers.

a. parse_line(s)

Given a string text line from a data file, extract all the numbers as described above from the line and return them in order in a list. If the line contains zero valid numbers, return the empty list.

This may be the most difficult function yet this quarter. In this case, decomposition, apart from reverse(), is not a big help. The lines of complexity in parse_line() all go together in this one function.

Three tiny tests are provided to get started. Build your code gradually while building up the tests: Make the tiny tests work first. Then add tests with a couple more digits or other chars. Add a test with no digits at all. Then combine 2 or more issues in one string. Smallness is a virtue in the early tests, e.g. '2^' — when the string is only 2 chars long, it's easy to trace through the lines of code to see why a test if failing.

Build up larger cases, and cases with 2 or 3 different number types in the string. In this way, you can gradually build up your code to solve all the cases. Your parse_line() must have at least 8 Doctests. We are requiring you to write that many Doctests, so you may as well get some benefit from them as you get this code working.

When you are feeling brave, add tests made from whole lines from 480k.txt as shown above.

If you have a reverse(s) function from HW4, you can paste it in to this file and use it here as a helper, and you can write any new helper functions. Later we'll see how to share functions across files, but for now just paste the helper function in. Any helper functions should have Pydoc and at least three Doctests, as on HW4.

b. parse_file(filename)

Read all the lines out of the given file. Extract all the numbers out of each line, gathering all the numbers together in one giant list, maintaining their order from the file.

Here is the contents of the "3lines.txt" file, showing a few lines of sample data.

47$ 42^ 18$bj55

One Doctest is provided for parse_file() that references this file. Doctests can refer to test data files in the same directory as the source code in this way.

c. main() -nums Test

The main() code is provided for this project. If the command line is "-nums file.txt", main() calls your parse_file() function and prints the list returned. The file 3lines.txt and 10lines.txt have some test lines, so this is another way to check that the numbers look reasonable.

$ python3 parse-mystery.py -nums 3lines.txt
[800, 600, 64, 63, 61, 60, 74, 81, 55, ...

Programming strategy aside: when your program is, say, 50% built, it's helpful that you can run that 50% to see some sort of output from it, confirming that what's built so far works. You don't want to write 100% of the code, and only then start running it. The Doctests are a big help with this strategy.

Grayscale Pixels

The following fact is needed for this puzzle: For each pixel in a grayscale image, the red, green, and blue values in each pixel must be equal. So for example a pixel might have red=50 green=50 blue=50 to be dark gray, or red=212 green=212 blue=212 to be light gray.

Mystery Data

The mystery campus object is somehow described in the file "480k.txt" which contains 480002 ints. The beginning of the list of ints looks like this (this is just the list your parse_file() returns):

[800, 600, 64, 63, 61, 60, 74, 81, 55,...

What could these 480002 numbers represent?

It's a grayscale image! The first number is the width. The second number is the height. The remaining 480000 (800 * 600 = 480000) numbers are the grayscale values, one number per pixel. The pixel values are laid out in the 1-dimensional list. After the width and height are all the values of the top y=0 row, then all the values for y=1, then all the values for y=2 and so on in 1 big list:

[width, height, y=0x=0, y=0x=1, y=0x=2, .... y=0x=799, y=1x=0, y=1x=2, y=1x=3 ... y=599x=799]

This ordering of the pixels is the same, standard order that the range/y/x loops visit the pixels of an image.

d. solve_mystery()

Parse all the numbers from the given filename. Figure out the width and height of the desired image. The code to create a blank image and loop over it is the same as in week 2 - it's included in the starter file.

SimpleImage starter code:

    width = ???  # determine proper width and height values
    height = ???
    image = SimpleImage.blank(width, height)
    for y in range(image.height):
        for x in range(image.width):
            pixel = image.get_pixel(x, y)
            # use pixel.red etc. in here

    # this puts the built image on screen

Fix this code to set the pixel.red etc. of every pixel in the image using the parsed out values, using the provide range/y/x loops. This is a bit of a code puzzle. The range/y/x loops are going over the 2-d image. Each time through that loop, you want to grab the right parsed value. Think about which value you want the first time the loop runs, the second time, the third time. etc. to work out a pattern.

The provided main() is set up to call your solve_mystery() function when there is just 1 command line argument, like this:

$ python3 parse-mystery.py 480k.txt

The 480k.txt thing lives a bit downhill from Lake Lagunita. The 100k.txt shows a famous thing on campus (and the file is smaller, if the large files take too long on your computer). The 600k.txt file is the biggest, showing a famous thing in California. For a smaller test, the 60k.txt shows a pair of friends. Your code should be able to solve all of these. The little 3lines.txt and 10lines.txt test files do not contain images.

Protip: type the first few letters of a filename, then hit tab to auto-complete it. Hit tab a couple times, and it will list all the candidate filenames that match what you have typed so far.

When your code is cleaned up with good style and solves these puzzles correctly, please turn in your "parse-mystery.py" file on Paperless. Also make sure the warmup functions are done.