Section 6. Nested Structures

Written by Juliette Woodrow, Brahm Capoor, Nick Parlante, John Dalloul, and Zheng Lian

This week's section will give you practice with the dictionary data structure. In particular, you'll explore the nested dictionary structure and how it lends itself to representing complex information. You'll also explore how to use decomposition-by-variable to more easily and readably update a nested dictionary structure.

Counting by Consonants

Implement the following function:

def count_by_consonants(filename):
    """
    Reads in the file whose name is filename and returns a dictionary
    that contains counts of words that share the same consonants in order.
    """

For example, if the file consonants.txt contains this text:

great
grate
teeny
greet
tiny
bump 

calling count_by_consonants('consonants.txt') will return the dictionary {'grt': 3, 'tny': 2, 'bmp': 1}.

Wannabe Astronauts

The inspiration for this problem came from this video by Veritasium on YouTube

How lucky do you have to be to be an astronaut?

In 2020 more than 12,000 people applied to join NASA's newest class of astronauts. In a few years, 10-12 of those people will be selected to join the Artemis program and become astronauts. Will those 10-12 truly be the best of the bunch, or could there be some luck involved? Source

Programming a Simulation

Let's explore this question by making a simulation using code. Complete function make_applicants(num_applicants) that creates num_applicants applicants who want to be astronauts, where each applicant is created with a random float of skill and luck. Before writing this function, take a minute to consider what data structures would be helpful here. What information needs to be stored for each applicant? How can we store the information on all of the applicants? This function should return the data structure containing information on all of the applicants.

Then run these applicants through a skill test: each applicant must pass a minimum skill score (like 0.90) to make the first cut. To do this, write a function make_first_cut_applicants(applicants) which takes as parameter a data structure holding all applicants and their scores and returns a list of all of the skilled astronauts.

After this first cut, all remaining applicants are highly skillful, but it's time to select the best of the best. Find the best applicants by combining their luck and skill scores (made from some proportion such as 95% skill and 5% luck) and make the final list of astronauts from the applicants with the ten highest combined scores. To do so, first write a function calc_combined_score(applicant) which takes as parameter an applicant and returns their combined luck and skill score. Then use this function to create a list of the top ten applicants.

When you find the top ten astronauts, compare them to the group of highly skilled applicants that made the first cut. Find the average skill and luck scores of each group,then use the percent difference function to see how important each score was in the final decision process.

So is Success All Hard Work? - Discussion

If you were to ask one of the selected astronauts why they were chosen, what might they say?

Stanford accepted 2,349 out of 45,227 applicants for the class of 2024. What happens when you use those numbers instead? Where do you think this difference comes from? Check out the numbers for other classes at the bottom of this page.

What do the results of this simulation say about the possible effects of luck and skill in highly selective programs? In what ways can this simulation be misleading?

If you like, try altering some of the constants in your code, such as increasing the percentage of luck in the combined score, lowering the skill score needed to pass the first cut, or selecting a greater proportion of applicants.

Drawing Friend Graphs

Python is commonly used to help analyze data sets by creating visualizations of said data. One common example of data visualization is to look at relationships between users social networks. Programmers can use python to help draw the network. The visual can be used to measure statistics of the network or to find clusters, which are groups of individuals who have many of the same friends. There are many reasons to search for close-knit groups in social networks - for instance, to create targeted marketing schemes or to suggest groups they'd like to be a part of. Cluster analysis has been also used to identify terrorist cells.

In this problem, you are going to use your python programming skills to create a visual representation of a social network where users can follow each other. More specifically, you will use information from two text files and draw lines representing the relationships in the network. If a user follows another user, your output should have a line connecting the two people.

You will be given two files through command-line arguments. The first file is a list of each person in the network, each on a separate line, where their name is followed by a colon and then a list of all the people that they follow. The second file is a list of coordinates. Each line has the name of a person in the network, followed by a colon and then a comma separated list of two integers representing the x and y coordinates you will use to place the node representing that person on the canvas. For example, two matching files, users.txt and coords.txt, might look like this (note that the users needn't necessarily be in the same order in both files):

users.txt

Juliette: Brahm, Nick, Julie, Cynthia
Brahm: Juliette, Nick, Cynthia, Mehran, Chris, Cynthia
Mehran: Oliver, Chris
Chris: Mehran, Oliver
Nick: Juliette, Julie, Keith
Julie: Juliette, Nick, Cynthia
Oliver: Mehran, Chris
Cynthia: Juliette, Julie, Keith
Keith: Nick, Cynthia  

coords.txt

Brahm: 141, 343
Chris: 390, 65
Cynthia: 100, 250
Julie: 185, 670
Juliette: 238, 409
Keith: 30, 400
Mehran: 550, 145
Nick: 14, 585
Oliver: 699, 18

Your job is to implement the following function:

def draw_friend_graph(canvas, friends_file, coordinates_file):
    """
    Draws a graph representing the friend network. For each user, 
    draw a circle at their respective coordinates and a label with their name. 
    Next, draw lines connecting that circle to the circle of each person they
    follow. 
    """

As you consider how best to approach the problem and store your data, keep in mind that relationships aren't necessarily symmetric. For example, note that Brahm follows several users in the example above, but despite his best efforts to use in-vogue hashtags, has but a single humble follower.

You may assume that the two files provided are valid and represent the same users, but an interesting extension to this problem might be to verify that.

First Letter Index

Implement the following function:

def first_list(strs):
    """
    Given a list of strings, create and return a dictionary whose 
    keys are the unique first characters of the strings and whose 
    values are lists of words beginning with those characters, in 
    the same order that they appear in strs.

    >>> first_list(['banter', 'brahm', 'aardvark', 'python', 'antiquated'])
    {'b': ['banter', 'brahm'], 'a': ['aardvark', 'antiquated'], 'p': ['python']}
    """

Cryptography

Cryptography is the study of techniques for communicating messages secretly. Imagine that Alice and Bob want to send messages to each other, but Eve can snoop on the messages they're sending and read them. Alice wants to figure out way to "encrypt" her messages so that if Eve reads the message, she won't be able to understand it, but Bob will be able to "decrypt" the message. We're going to write a program to help Alice and Bob do this.

Alice decides that she'll replace every letter in her original message with a different letter. She's defined a variable in the program called ENCRYPTION_DICT that keeps track of these associations:

ENCRYPTION_DICT = {
 'A': 'T',
 'B': 'H',
 'C': 'E',
 'D': 'Q',
 'E': 'U',
 'F': 'I',
 'G': 'C',
 'H': 'K',
 ...
}

Alice and Bob exchanged this dictionary before we went into quarantine, so they both know that this is the strategy, but Eve doesn't know that! Note that in order to avoid ambiguity, the values in this dictionary are unique (that is, each letter is a value in the dictionary exactly once).

Encryption

To start us off, implement the following function:

def encrypt(plaintext):
    """
    Takes in plaintext as an input and returns 'ciphertext': the result 
    of substituting each letter in the plaintext by its corresponding 
    encrypted character in ENCRYPTION_DICT. 

    The plaintext comprises entirely of uppercase letters and non-alphabetic 
    characters like punctuation. Non-alphabetic characters needn't be encrypted, 
    but rather should appear in the plaintext in their original form. 

    >>> encrypt("HEY, HOW'S IT GOING?")
    "KUD, KXZ'S BV CXBFC?"
    >>> encrypt("I LOVE CS 106A!")
    'B WXLU ES 106T!'
    >>> encrypt("UNICORNS ARE THE MOST BEAUTIFUL ANIMALS IN EXISTENCE")
    'AFBEXPFS TPU VKU NXSV HUTAVBIAW TFBNTWS BF UYBSVUFEU'
    """

Decryption

Now that Bob has received Alice's encrypted message, he needs to "decrypt" it, or convert it back to the original message. To help him do so, implement the following function:

def decrypt(ciphertext):
    """
    Uses ENCRYPTION_DICT to decrypt each of the alphabetic characters of
    ciphertext.

    >>> decrypt("KUD, KXZ'S BV CXBFC?")
    "HEY, HOW'S IT GOING?"
    >>> decrypt('B WXLU ES 106T!')
    'I LOVE CS 106A!'
    >>> decrypt('AFBEXPFS TPU VKU NXSV HUTAVBIAW TFBNTWS BF UYBSVUFEU')
    'UNICORNS ARE THE MOST BEAUTIFUL ANIMALS IN EXISTENCE'
    """

Hint

Note that in order to successfully decrypt a message, you need the 'reverse' of ENCRYPTION_DICT: rather than associating plaintext characters with their encrypted counterparts, we need to go the other way.

We suggest writing a function reverse_encryption_dict to decompose out this problem. In class, we saw an example of this wherein a reversed dictionary associated keys with lists of values, because multiple keys can share the same value. Note that this doesn't apply to the case of ENCRYPTION_DICT, because each character is guaranteed to have a unique character. How does this affect how you implement reverse_encryption_dict?

This is an interesting encryption scheme (known formally as a substitution cipher), but unfortunately isn't very secure. What issues do you see with it?

Recipes

Inspired by an unhealthy amount of Netflix and quarantine cooking, Parth and Peter are opening up a bakery. To try and keep costs low, they want to automate as much of their inventory tracking as possible.

Thanks to 106A, they think they can use dictionaries to help.

All of the ingredients they have available will be stored in a pantry dictionary, whose keys are ingredients and values are weights (Of course they're using the metric system), like so:

pantry = {
    'flour': 400,
    'sugar': 300,
    'salt': 10,
    'chocolate': 150
} 

Each recipe is also stored in a dictionary, like the following uninspiring concoction:

recipe = {
  'flour': 200,
  'salt': 2.5
}

(As you can tell from the recipe above, it doesn't look like their bakery is going to do very well, but that's not important right now.)

Recipes and the pantry are first stored as files, where each key value pair is on a different line, separated by ':: '. The pantry list and recipe above would look like:

pantry.txt

flour:: 400
sugar:: 300
salt:: 10
chocolate:: 150   

recipe.txt

flour:: 200
salt:: 2.5  

Our goal is to write a few functions to help Parth and Peter run their bakery. Begin by implementing the following function:

def read_dict_from_file(filename):
    """
    Takes in the name of a file containing a recipe or 
    pantry list and reads it into a dictionary.
    
    An example doctest using the file above:
    >>> read_dict_from_file('recipe.txt')
    {'flour': 200, 'salt': 2.5}
    """

Once you've developed the infrastructure to construct recipe and pantry list dictionaries, implement the following functions:

def can_make(recipe, pantry):
    """
    Given the contents of the pantry, returns a boolean indicating 
    whether or not it is possible to follow the recipe. Note that 
    the parameters to this function are dictionaries, and not 
    filenames. The pantry should not be modified in this function
    """
    pass

def make_recipe(recipe, pantry):
    """
    Given a recipe and a pantry with enough ingredients to make the recipe, 
    modify the contents of the pantry to remove as many quantities as the 
    recipe requires. You may modify the pantry in place, but return the modified 
    pantry in order to test the output using doctests. 

    # using the recipe and pantry defined above
    >>> make_recipe(recipe, pantry)
    {'flour': 200, 'sugar': 300, 'salt': 7.5, 'chocolate': 150}
    """
    pass

Note that make_recipe assumes the pantry is sufficient for the recipe but this is not necessarily always the case; thus, every call to make_recipe will need to be guarded by a call to can_make_recipe.

Whilst implementing can_make function, you might find the dictionary's .get function helpful, which accepts a parameter suggesting what to return if the value is not in the dictionary. For example, calling recipe.get('yeast', 42) will return 42 if 'yeast' is not a key in the recipe dictionary.

Finally, implement a main function, which first reads in a pantry file from a user and then continuously asks for recipe filenames from the user and either prints an error message if the pantry does not have sufficient ingredients, or removes the ingredients from the pantry if it does exist, printing the pantry afterwards.

A sample run of the program is below, which repeatedly tries to make the recipe above in the vain hope that someone will actually want to eat it. Note that your program should be able to accept any valid recipe filename, though. User input is bolded and italicized.

$ python3 pantry_manager.py 
Enter pantry filename: pantry.txt
What recipe should we bake next (Press enter to quit.)? recipe.txt
You can make that recipe! Your pantry now looks like this:
{'flour': 200, 'sugar': 300, 'salt': 7.5, 'chocolate': 150}

What recipe should we bake next (Press enter to quit.)? recipe.txt
You can make that recipe! Your pantry now looks like this:
{'flour': 0, 'sugar': 300, 'salt': 5, 'chocolate': 150}

What recipe should we bake next (Press enter to quit.)? recipe.txt
You can't make that recipe. 

What recipe should we bake next (Press enter to quit.)? User presses enter immediately

Anagrams

Two words are anagrams if they consist of the same letters in a different order. Your job in this problem is to write a program that allows users to see what words are anagrams of words that they type. Here's some sample output (user input is bolded and italicized):

$ python3 anagrams.py
Word: listen
['enlist', 'inlets', 'listen', 'silent', 'slinte', 'tinsel']
Word: race
['acer', 'acre', 'care', 'cera', 'crea', 'race']
Word: python
['phyton', 'python', 'typhon']
Word: programming
['programming']
Word: piech
piech is not in the dictionary
Word: microphone
['microphone', 'neomorphic']
Word: User presses enter immediately

The design of this program is largely up to you, although we have a few suggestions for you:

We've provided a constant LEXICON whose value is the name of a file containing all of the words in the English language. It may be helpful to find some way of associating words with their anagrams.
A useful observation is that two anagrams are identical when their characters are sorted. For example, 'LISTEN' and 'SILENT', when sorted are both 'eilnst'. Python has an inbuilt sorted function, which returns a list of the characters in a string in alphabetical order. This list can then be converted back to a string like so:

>>> string = 'banter'
>>> sorted_characters = sorted('banter')
>>> sorted_characters
['a', 'b', 'e', 'n', 'r', 't']
>>> sorted_string = ''.join(sorted_characters)
'abenrt'

Big Tweet Data

In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in tweets. This is a great example of how Python can be used in data analysis tasks.

Our Dataset

For the purposes of this problem, each tweet is represented as a single line of text in a file. Each line consists of the poster's username (prefixed by a '@' symbol), followed by a colon and then the text of the tweet. Each character in this file can be a character from any language, or an emoji, although you don't need to do anything special to deal with these characters. One such file in the PyCharm project we provide is small-tweets.txt, which is reproduced here:

@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQ…
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: 😂😂😂🔥🔥🔥 https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education #vt #realestate
https://t.co/iqxXtruqrt

We provide 3 such files for you in the PyCharm Project: small-tweets.txt, big-tweets.txt and huge-tweets.txt.

Building a `user_tags` Dictionary

Central to this program is a user_tags dictionary, in which each key is a Twitter user's name like '@BarackObama'. The value for each key in this dictionary is a second, nested dictionary which counts how frequently that particular user has used particular hashtags. For example, a very simple user_tags dictionary might be:

{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}

We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary. Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.

1. Warmup questions

Given the dictionary above, what updates we would make to it in each of the following cases?

We encounter a new tweet that reads '@BarackObama: #Obamacare signups now!'.
We encounter a new tweet that reads '@kanyewest: 😂😂😂🔥🔥🔥 https://t.co/KmvxIwKkU6'.

2. Implementing `add_tweet`

The add_tweet function is the core of this whole program, and is responsible for performing the update to a user_tags dictionary described above. The tests shown below represent a sequence of tweets, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary ({}) and gets back a dictionary with 1 user and 2 tags. The 2nd test then takes that returned dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.

We've provided you with two functions entitled parse_tags and parse_user, both of which take as a parameter the tweet in question and return a list of tags in the tweet and the username that posted the tweet, respectively.

def add_tweet(user_tags, tweet):
    """
    Given a user_tags dict and a tweet, parse out the user and tags,
    and add those counts to the user_tags dict which is returned.
    If no user exists in the tweet, return the user_tags dict unchanged.
    Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
    the parts out of the tweet.
    >>> add_tweet({}, '@alice: #apple #banana')
    {'@alice': {'#apple': 1, '#banana': 1}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@alice: #banana')
    {'@alice': {'#apple': 1, '#banana': 2}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 2}}, '@bob: #apple')
    {'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
    """ 

3. Implementing `parse_tweets`

Use add_tweet in a loop to build up and return a user_tags dict. This should look mostly like other file-reading functions you've written, and your job is to make sure you understand how to follow the pattern of creating and updating a dictionary suggested by the add_tweet function. Restated, the responsibility of add_tweet is to update a dictionary, and parse_tweets must create and maintain that dictionary as it is updated.

Running your program

We provide a main function that calls the parse_tweets function you implemented in a variety of ways. To use it, run the program from the terminal. Run with just 1 argument (a data filename), it reads in all the data from that file and prints out a summary of each user and all their tweets and counts:

$ python3 tweets.py small-tweets.txt
@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
@GonzalezSarahA
 #education -> 1
 #vt -> 1
 #realestate -> 1

When run with the '-users' argument, main prints out all the usernames:

$ python3 tweets.py -users small-tweets.txt
users
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA  

When run with the '-user' argument followed by a username, the program prints out the data for just that user.

$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3 

Extension

You probably won't get to this extension in section, but if you have time, implement this additional function which you can then leverage to answer some interesting questions about Hashtag use.

Implementing `flat_counts`

It's natural to be curious about how often tags are used across users. This function takes in a user_tags dictionary and computes a new "flat" count dictionary:

def flat_counts(user_tags):
    """
    Given a user_tags dicts, sum up the tag counts across all users,
    return a "flat" counts dict with a key for each tag,
    and its value is the sum of that tag's count across users.
    >>> flat_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
    {'#apple': 2, '#banana': 2}
    """    

main will call that function with the -flat argument, like so:

$ python3 tweets.py -flat small-tweets.txt
flat
 #BigBlockOfCheeseDay -> 1
 #MAGA -> 2
 #SOTU -> 3
 .
 .
 .