Homework 6a - Baby Names

This project use the Social Security Administration's baby names data set - names of babies born in the US going back more than 100 years.

All parts of HW6 are due Wed Nov 13th at 11:55pm as usual. The file babynames.zip contains a "babynames" folder to get started. In a later project, you will build upon the data capabilities here.

Warmups

We have 6 warmup functions to get you started. These use the list patterns from lecture and 2 dict problems.

> List Patterns

Submit to Paperless: submit work

Baby Data

Let's see what form the data is in to start with. At the Social Security baby names site, you can visit a different web page for each year. Here's what the data looks like in a web page (indeed, this is pretty close to the birth year for many students in CS106A - hey there Emily and Jacob!)

Popularity in 2000
Rank Male name Female name
1 Jacob Emily
2 Michael Hannah
3 Matthew Madison
4 Joshua Ashley
5 Christopher Sarah
...

In this data set, rank 1 means the most popular name, rank 2 means next most popular, and so on down through rank 1000. The data is divided into "male" and "female" columns. (To be strictly accurate, at birth when this data is collected, not all babies are categorized as male or female. That's rare enough to not affect the numbers at this level.)

baby-2000.txt

A web page is encoded as - you guessed it! - plain text in a format called HTML. For your project, we have done a superficial clean up of the HTML text and stored it in files "baby-2000.txt", which look like:

2000
1,Jacob,Emily
2 , Michael , Hannah
3,Matthew,Madison
4,Joshua,Ashley
5,Christopher,Sarah
6,Nicholas,Alexis
7,Andrew,Samantha
...
997,Vincenzo,Maiya
998,Dayne,Melisa
999,Francesco,Adrian
1000,Isaak,Marlen

Data Organization

A door is what a dog is perpetually on the wrong side of. - Ogden Nash

Data in the real world is very often not in the form you need. Reasonably for the Social Security Administration, their data is organized by year. Each year they get all those forms filled out by parents, they crunch it all together, and eventually publish the data for that year, such as we have as baby-2000.txt.

However, the most interesting analysis of the data requires organizing it by name, across many years. This will be the main, highly realistic data challenge for this project.

Names Data Structure

We'll say that the "names" dict structure for this program has a key for every name. The value for each name is a nested dict, mapping int year to int rank:

{
'Aaden': {2010: 560}.
'Aaliyah': {2000: 211, 2010: 56},
...
}

Each name has data for 1 or more years, but which years have data for each name jumps around. In the above data, 'Aaliyah' jumped from rank 211 in 2000 to 56 in 2010 (these names are alphabetically first in the 2000 + 2010 data set). An empty dict is a valid names data structure - it just has zero names in it.

Functions below will work on this "names" data structure.

a. Add Name

The add_name() function takes in a single name of data, and adds it into the names dict. Later phases can call this function in a loop to build up the whole data set.

The dict is passed in as a parameter. Python never passes a copy, but instead passes a reference to the one dict in memory. In this way, if add_name() modifies the passed in "names" dict, that's the same dict being used by the caller. The function also returns the names dict to facilitate writing Doctests.

The starter code includes a single Doctest as an example (below). The test passes in the empty dict as the input names, and some fake data for baby 'abe'.

def add_name(names, year, rank, name):
    """
    Add the given data: int year, int rank, str name
    to the given names dict and return it.
    (1 test provided, more tests TBD)
    >>> add_name({}, 2000, 10, 'abe')
    {'abe': {2000: 10}}
    """

Write at least 2 additional tests for add_name(). Pass in a non-empty names dicts for at least 1 Doctest to test that names and years accumulate in the dict correctly. This function is short but dense. Doctests are a good fit for this situation, letting you explicitly identify and work on the tricky cases.

The Sammy Issue

In rare cases a name, e.g. 'Sammy', appears twice in the data: once as a male name and once as a female name. We need a policy for how to handle that case. Our policy will be to store whichever rank number is smaller, e.g. if 'Sammy' comes in for a year at rank 100, and also comes in for that year at rank 200, use the rank 100. Your tests should include this case. This sort of rare case in the data is more likely to cause bugs; it doesn't fit the common data pattern you have in mind as you write the code. In baby-2000.txt the name 'Christian' shows this issue, and there are other such names in this giant data set.

CS Observation — if 99% of the data is one way, and 1% is some other way .. that doesn't mean the 1% is going to be require less work just because it's rare. A hallmark of computer code is that it forces you to handle 100% of the cases.

b. Add File

The simple baby text format for this data looks like:

2000
1,Jacob,Emily
2 , Michael , Hannah
3,Matthew,Madison
4,Joshua,Ashley
5,Christopher,Sarah
6,Nicholas,Alexis
7,Andrew,Samantha
...
997,Vincenzo,Maiya
998,Dayne,Melisa
999,Francesco,Adrian
1000,Isaak,Marlen

The year is on the first line. The later lines each have the rank, male name, female name separated from each other by commas. There may be superfluous whitespace chars separating the data as in line 2 above. Don't assume the data runs to exactly 1000, which would make the function too single-purpose. Just process all the lines there are.

Write the code to add the contents of one file.txt to the names dict parameter, which is returned. Tests are provided for this function, using the feature that a Doctest can refer to a file in the same directory. Here the tests use the relatively small test files "small-2000.txt" and "small-2010.txt" to build a names dict.

In this case, you want to treat the first line of the file differently than all the other lines. Therefore the standard for-line-in-file is a little awkward, but there are other ways to get the lines of a text file. Here is a friendly reminder of the Python ways to read a file:

# Always open the file first
with open(filename, 'r') as f:

  # 1. Go through all the lines, the super common pattern
  for line in f:
     ...

  # 2. Alternative: read the entire file contents into 1 text string
  text = f.read()

  # 3. Alternative: read the entire file contents in as a list of strings,
  # one string for each line. Similar to #1, but a list that can be
  # processed with a later foreach loop, you can grab a subset of the lines
  # with a slice, etc.
  lines = f.readlines()

c. read_files()

Write code for read_files() which takes a list of filenames, building and returning a names dict of all their data. This function is called by main() to build up the names dict from all the files mentioned on the command line. No tests are required this short function.

d. search_names()

Write code for search_names() which searches for a target string and returns a sorted list of all the name strings that match the target (no year or rank data). In this case, the target matches a name, not-case sensitive, if the target appears anywhere in the name. For example the target strings 'aa' and 'AA' both match 'Aaliyah' and 'Ayaan'. Return the empty list if no names match the target string. This function is called by main() for the -search command line argument.

Write at least 3 Doctests for search_names() which is the most algorithmic. You can make up a tiny names dict just for the tests.

Provided: main() and print_names()

We've provided the main() function. Given 1 or more baby data file arguments, main() reads them in with your read_files() function, and then calls the provided print_names() function (2 lines long!) to print all the data out.

The files small-2000.txt small-2010.txt have just a few test names, A B C D E, so they are good to hand-check that your output is correct, and of course your Doctests are working on your decomposed functions to check them individually. The output should be the same if small-2010.txt is loaded before small-2000.txt.

Running your code to load multiple files:

$ python3 babynames.py small-2000.txt small-2010.txt 
A [(2000, 1), (2010, 2)]
B [(2000, 1)]
C [(2000, 2), (2010, 1)]
D [(2010, 1)]
E [(2010, 2)]

For reference, here is the contents of the small files:

small-2000.txt:

2000
1 , A , B
2,C,A

small-2010.txt:

2010
1,C,D
2 , A  , E

Try It With Real Data

This is the correct meme for this part of the homework.

The small files test that the code is working correctly, but are no fun. The provided main() function looks at all the files listed on the command line, and loads them all by calling your read_files() function in a loop. You can take a look at 4 decades of data with the following command in the terminal (use the tab-key to complete file names without all the typing).

$ python3 babynames.py baby-1980.txt baby-1990.txt baby-2000.txt baby-2010.txt
...tons of output!...

Optional *

(optional experiment to try) Windows users - I apologize, this key command line feature does not work in the Windows terminal. You can install the Windows Linux Subsystem to get a terminal on Windows where this works.

A handy feature of the terminal is that you can enter baby-*.txt to mean all the filenames with that pattern: baby-1900.txt baby-1910.txt ... baby-2010.txt. This is an incredibly handy shorthand when you are working through some data problem with a bunch of different files. This also maybe explains why CS and data-science people tend to use patterns to name their data files, so the filenames work with this * feature. You can demonstrate this with the "ls" command, which prints out filenames, like this (in Windows PowerShell, this and the "ls" command works):

$ ls baby-*.txt
baby-1900.txt	baby-1930.txt	baby-1960.txt	baby-1990.txt
baby-1910.txt	baby-1940.txt	baby-1970.txt	baby-2000.txt
baby-1920.txt	baby-1950.txt	baby-1980.txt	baby-2010.txt

This * feature fits perfectly with babynames.py. The following terminal command loads all 12 baby-xxx.txt files without typing in anything else:

$ python3 babynames.py baby-*.txt

This terminal command is expanding the * to hit all the files, running the 24,000 odd data points through your functions to get it all organized in the blink of an eye .. that's how the data scientists to it.

Search

Organizing all the data and dumping it out is impressive, but it is a blunt instrument. Main() connects to your search function like this: if the first 2 command line args are "-search target", then main() reads in all the data and calls your search_names() function to find matching names and print them. Here is an example with the search target "aa":

$ python3 babynames.py -search aa baby-2000.txt baby-2010.txt
Aaden
Aaliyah
Aarav
Aaron
Aarush
Ayaan
Isaac
Isaak
Ishaan
Sanaa

When everything is working, please turn in your babynames.py on Paperless. We'll do something fun with this data and code in the next homework, but for now you've solved the key part of reading and organizing a realistic mass of data.