Section #6: Dictionaries

February 13th, 2022


Written by Juliette Woodrow, Brahm Capoor, Nick Parlante, Anna Mistele, John Dalloul, and Jonathan Kula

Learning goals

  • Understanding the dictionary data structure, and how it differs from lists
  • Understanding the difference between keys and values in a dictionary, and how to get and set the value associated with a particular key
  • Understanding the nested dictionary structure, and how it lends itself to representing complex information.
  • Understanding how to use decomposition-by-variable name to more easily and readably update a nested dictionary structure.

Warmup with Nested Dictionaries

def suffix_list(strs) Given a list of strings, create and return a dictionary whose keys are the suffixes of those strings and whose values are lists of words ending with those suffixes, in the same order that they appear in strs. A suffix is defined as the last 2 characters of a string, and a string that is less than 2 characters long has no suffix.

Big Tweet Data!

In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in user's tweets. This is a great example of how Python can be used in data analysis tasks.

Our Dataset

For the purposes of this problem, each tweet is represented as a single line of text in a file. Each line consists of the poster's username (prefixed by a '@' symbol), followed by a colon and then the text of the tweet. Each character in this file can be a character from any language, or an emoji, although you don't need to do anything special to deal with thse characters. One such file in the PyCharm project we provide is small-tweets.txt, which is reproduced here:

          
          
@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQā€¦
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: šŸ˜‚šŸ˜‚šŸ˜‚šŸ”„šŸ”„šŸ”„ https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education
https://t.co/iqxXtruqrt
          
        

We provide 3 such files for you in the PyCharm Project: small-tweets.txt, big-tweets.txt and huge-tweets.txt.

Building a user_tags Dictionary

Central to this program is a user_tags dictionary, in which each key is a Twitter user's name like '@BarackObama'. The value for each key in this dictionary is a second, nested dictionary which counts how frequently that particular user has used particular hashtags. For example, a very simple user_tags dictionary might be:

{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}

We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary (since it uses the dict-count algorithm we've seen a bunch in class). Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.

1. Warmup questions

Given the dictionary above, what updates we would make to it in each of the following cases?

  • We encounter a new tweet that reads '@BarackObama: #Obamacare signups now!'.
  • We encounter a new tweet that reads '@kanyewest: šŸ˜‚šŸ˜‚šŸ˜‚šŸ”„šŸ”„šŸ”„ https://t.co/KmvxIwKkU6'.
  • We encounter a new tweet that reads '@BarackObama: #NationalDogDay'.
  • We encounter a new tweet that reads '@BarackObama: Reminder to sign up for #Obamacare'.
2. Implementing add_tweet

The add_tweet() function is the core of this whole program, and is responsible for performing the update to a user_tags dictionary described above. The tests shown below represent a sequence, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary ({}) and gets back a dictionary with 1 user and 2 tags. The 2nd test then takes that returned dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.

You're all string parsing experts by now, so we won't make you do that work anymore. We've provided you with two functions entitled parse_tag and parse_user, both of which take as a parameter the tweet in question and returns the tag in the tweet and the username that posted the tweet, respectively.

          
          
def add_tweet(user_tags, tweet):
    
    """
    Given a user_tags dict and a tweet, parse out the user and the hashtag,
    and add those counts to the user_tags dict which is returned.
    If no user exists in the tweet, return the user_tags dict unchanged.
    Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
    the parts out of the tweet.
    >>> add_tweet({}, '@alice: #apple')
    {'@alice': {'#apple': 1}}
    >>> add_tweet({'@alice': {'#apple': 1}}, '@alice: #banana')
    {'@alice': {'#apple': 1, '#banana': 1}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@bob: #apple')
    {'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
    """
          
        
3. Implementing parse_tweets

Use add_tweet in a loop to build up and return a user_tags dict. This should look mostly like other file-reading functions you've written, and your job is to make sure you understand how to follow the pattern of creating and updating a dictionary suggested by the add_tweet function. Restated, the responsibility of add_tweet is to update a dictionary, and parse_tweets must create and maintain that dictionary as it is updated.

Running your program

We provide a main function that calls the parse_tweets function you implemented in a variety of ways. To use it, run the program from the terminal. Run with just 1 argument (a data filename), it reads in all the data from that file and prints out a summary of each user and all their tweets and counts:

          
          
$ python3 tweets.py small-tweets.txt
@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
@GonzalezSarahA
 #education -> 1
 #vt -> 1
 #realestate -> 1
          
        

When run with the '-users' argument, main prints out all the usernames:

          
          
$ python3 tweets.py -users small-tweets.txt
users
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA
          
        

When run with the '-user' argument followed by a username, the program prints out the data for just that user.

          
          
$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
          
        

Extensions

You probably won't get to this extension in section, but if you have time, implement this additional function which you can then leverage to answer some interesting questions about Hashtag use.

Implementing flat_counts

It's natural to be curious about how often tags are used across users. This function takes in a user_tags dictionary and computes a new "flat" count dictionary:

          
          
def flat_counts(user_tags):
    """
    Given a user_tags dicts, sum up the tag counts across all users,
    return a "flat" counts dict with a key for each tag,
    and its value is the sum of that tag's count across users.
    >>> flat_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
    {'#apple': 2, '#banana': 2}
    """
          
        

main will call that function with the -flat argument, like so:

          
          
$ python3 tweets.py -flat small-tweets.txt
flat
 #BigBlockOfCheeseDay -> 1
 #MAGA -> 2
 #SOTU -> 3
 .
 .
 .
          
        

Nested Dictionaries

To get some practice with dictionaries and nesting collections, implement the following functions:

  1. def first_chars(strs) Given a list of strings, create and return a counts dictionary whose keys are the unique first characters of the strings in strs and whose values are the count of the words that start with that character.
    strs = ['stanford', 'python', 'computer', 'science', 'democracy', 'day']
    counts = {'s': 2, 'p': 1, 'c': 1, 'd': 2}
                
  2. def first_list(strs) Given a list of strings, create and return a dictionary whose keys are the unique first characters of the strings and whose values are lists of words beginning with those characters, in the same order that they appear in strs.

  3. def add_price(name, type, price, gas_prices) that takes in the name of a gas station (e.g. "Speedway"), the type of gas (e.g. "unleaded"), and a price (e.g. 3.50) and adds it to the also passed in gas_prices dictionary (which is ultimately returned). The gas prices dictionary should be set up so that each key is a different gas station and each value is another dictionary where the key is the type of gas and the value is the gas price. For example, add_price("Speedway", "unleaded", 3.50, {}) should return: {"Speedway" : {"unleaded" : 3.50}} If there is already a price for a given gas station / gas type pairing, overwrite it.

Word counts

This problem is a challenge problem. It is a bit too much to be covered in section this week, but it does cover a lot of 106A concepts so feel free to come chat about it at office hours.

Your job is to write a full python program, including a main function, for a file called special-count.py. This program should implement the following behavior:

  1. If special-count.py is called with just a filename (e.g. python3 special-count.py myfile.txt) then it should read in the file named myfile.txt (you may assume that this file is formatted as single words separated by newlines) and produce counts of words that share the same consonants in order. For example, if we had the following text in myfile.txt:

                    
                  
    great
    grate
    greet
    teeny
    tiny
    bump
                    
                  

    Your program should produce the following output. Note that the output is in sorted order.

                    
    $ python3 special-count.py myfile.txt
    bmp -> 1
    grt -> 3
    tny -> 2
                    
                  
  2. If special-count.py is called with the additional flag -vowels (e.g. python3 special-count.py -vowels myfile.txt) then it should produce the same output, only grouping words by their vowels rather than their consonants. So, using the same file as before produces the following output:
  3.               
                
    $ python3 special-count.py -vowels myfile.txt
    ae -> 1
    ea -> 1
    ee -> 2
    i -> 1
    u -> 1