Section #6: Dictionaries

July 26th, 2021


Written by Juliette Woodrow, Brahm Capoor, Nick Parlante, Anna Mistele, John Dalloul, and Jonathan Kula

Learning goals

  • Understanding the dictionary data structure, and how it differs from lists
  • Understanding the difference between keys and values in a dictionary, and how to get and set the value associated with a particular key
  • Understanding the nested dictionary structure, and how it lends itself to representing complex information.
  • Understanding how to use decomposition-by-variable name to more easily and readably update a nested dictionary structure.

Nested Dictionaries

To get some practice with dictionaries and nesting collections, implement the following functions:

  1. def print_course_info(explore_courses, course_id, info) which takes in a nested dictionary of courses and their information, a specific course id, and a specific type of information to print about that course id. The function should print out the information if the dictionary has that information for the given course. If the course is in the dictionary, but it does not have that information stored, it should print 'We do not have that information for the given course.' If the course is not in the dictionary at all, it should print 'Sorry, that course is not in our system.'

    explore_courses is a nested dictionary where the outer key is the ID of a course, such as 'cs106a', and the value is a dictionary with keys representing information about the course, such as 'name', 'units', 'ways', 'instructor', and 'ta'. Each of these keys map to their corresponding value which is either a string or an int. The explore_courses dictionary is not complete for all courses, so some info may be missing for certain courses. Below is an example explore_courses dictionary and some example calls to print_course_info.

                    
    EXPLORE_COURSES = {
        "cs106a": {
            "name": "Programming Methodology",
            "units": 5,
            "ways": "FR",
            "instructor": "Nick Parlante",
            "ta": "Juliette Woodrow"
        },
    ā€‹
        "cs106b": {
            "name": "Programming Abstractions",
            "units": 5,
            "ways": "FR",
            "instructor": "Chris Gregg",
            "ta": "Chase Davis"
        },
    ā€‹
        "taps104": {
            "name": "Intermediate Improvisation",
            "units": 3,
            "instructor": "Dan Klein"
        }
    }
    
    >>> print_course_info(EXPLORE_COURSES, 'cs106b', 'ta')
    >>> 'Chase Davis'
    
    >>> print_course_info(EXPLORE_COURSES, 'cs107', 'ta')
    >>> 'Sorry, that course is not in our system.'
    
    >>> print_course_info(EXPLORE_COURSES, 'taps104', 'ways')
    >>> 'We do not have that information for the given course.'
                    
                  

  2. def add_price(name, type, price, gas_prices) that takes in the name of a gas station (e.g. "Speedway"), the type of gas (e.g. "unleaded"), and a price (e.g. 3.50) and adds it to the also passed in gas_prices dictionary (which is ultimately returned). The gas prices dictionary should be set up so that each key is a different gas station and each value is another dictionary where the key is the type of gas and the value is the gas price. For example, add_price("Speedway", "unleaded", 3.50, {}) should return: {"Speedway" : {"unleaded" : 3.50}} If there is already a price for a given gas station / gas type pairing, overwrite it.

  3. def first_list(strs) Given a list of strings, create and return a dictionary whose keys are the unique first characters of the strings and whose values are lists of words beginning with those characters, in the same order that they appear in strs.

  4. def suffix_list(strs) Given a list of strings, create and return a dictionary whose keys are the suffixes of those strings and whose values are lists of words ending with those suffixes, in the same order that they appear in strs. A suffix is defined as the last 2 characters of a string, and a string that is less than 2 characters long has no suffix.

Word counts

Your job is to write a full python program, including a main function, for a file called special-count.py. This program should implement the following behavior:

  1. If special-count.py is called with just a filename (e.g. python3 special-count.py myfile.txt) then it should read in the file named myfile.txt (you may assume that this file is formatted as single words separated by newlines) and produce counts of words that share the same consonants in order. For example, if we had the following text in myfile.txt:

                    
                  
    great
    grate
    greet
    teeny
    tiny
    bump
                    
                  

    Your program should produce the following output. Note that the output is in sorted order.

                    
    $ python3 special-count.py myfile.txt
    bmp -> 1
    grt -> 3
    tny -> 2
                    
                  
  2. If special-count.py is called with the additional flag -vowels (e.g. python3 special-count.py -vowels myfile.txt) then it should produce the same output, only grouping words by their vowels rather than their consonants. So, using the same file as before produces the following output:
  3.               
                
    $ python3 special-count.py -vowels myfile.txt
    ae -> 1
    ea -> 1
    ee -> 2
    i -> 1
    u -> 1
                  
                

Big Tweet Data!

In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in user's tweets. This is a great example of how Python can be used in data analysis tasks.

Our Dataset

For the purposes of this problem, each tweet is represented as a single line of text in a file. Each line consists of the poster's username (prefixed by a '@' symbol), followed by a colon and then the text of the tweet. Each character in this file can be a character from any language, or an emoji, although you don't need to do anything special to deal with thse characters. One such file in the PyCharm project we provide is small-tweets.txt, which is reproduced here:

          
          
@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQā€¦
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: šŸ˜‚šŸ˜‚šŸ˜‚šŸ”„šŸ”„šŸ”„ https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education
https://t.co/iqxXtruqrt
          
        

We provide 3 such files for you in the PyCharm Project: small-tweets.txt, big-tweets.txt and huge-tweets.txt.

Building a user_tags Dictionary

Central to this program is a user_tags dictionary, in which each key is a Twitter user's name like '@BarackObama'. The value for each key in this dictionary is a second, nested dictionary which counts how frequently that particular user has used particular hashtags. For example, a very simple user_tags dictionary might be:

{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}

We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary (since it uses the dict-count algorithm we've seen a bunch in class). Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.

1. Warmup questions

Given the dictionary above, what updates we would make to it in each of the following cases?

  • We encounter a new tweet that reads '@BarackObama: #Obamacare signups now!'.
  • We encounter a new tweet that reads '@kanyewest: šŸ˜‚šŸ˜‚šŸ˜‚šŸ”„šŸ”„šŸ”„ https://t.co/KmvxIwKkU6'.
2. Implementing add_tweet

The add_tweet() function is the core of this whole program, and is responsible for performing the update to a user_tags dictionary described above. The tests shown below represent a sequence, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary ({}) and gets back a dictionary with 1 user and 2 tags. The 2nd test then takes that returned dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.

You're all string parsing experts by now, so we won't make you do that work anymore. We've provided you with two functions entitled parse_tag and parse_user, both of which take as a parameter the tweet in question and returns the tag in the tweet and the username that posted the tweet, respectively.

          
          
def add_tweet(user_tags, tweet):
    
    """
    Given a user_tags dict and a tweet, parse out the user and the hashtag,
    and add those counts to the user_tags dict which is returned.
    If no user exists in the tweet, return the user_tags dict unchanged.
    Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
    the parts out of the tweet.
    >>> add_tweet({}, '@alice: #apple')
    {'@alice': {'#apple': 1}}
    >>> add_tweet({'@alice': {'#apple': 1}}, '@alice: #banana')
    {'@alice': {'#apple': 1, '#banana': 1}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@bob: #apple')
    {'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
    """
          
        
3. Implementing parse_tweets

Use add_tweet in a loop to build up and return a user_tags dict. This should look mostly like other file-reading functions you've written, and your job is to make sure you understand how to follow the pattern of creating and updating a dictionary suggested by the add_tweet function. Restated, the responsibility of add_tweet is to update a dictionary, and parse_tweets must create and maintain that dictionary as it is updated.

Running your program

We provide a main function that calls the parse_tweets function you implemented in a variety of ways. To use it, run the program from the terminal. Run with just 1 argument (a data filename), it reads in all the data from that file and prints out a summary of each user and all their tweets and counts:

          
          
$ python3 tweets.py small-tweets.txt
@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
@GonzalezSarahA
 #education -> 1
 #vt -> 1
 #realestate -> 1
          
        

When run with the '-users' argument, main prints out all the usernames:

          
          
$ python3 tweets.py -users small-tweets.txt
users
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA
          
        

When run with the '-user' argument followed by a username, the program prints out the data for just that user.

          
          
$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
          
        

Extensions

You probably won't get to this extension in section, but if you have time, implement this additional function which you can then leverage to answer some interesting questions about Hashtag use.

Implementing flat_counts

It's natural to be curious about how often tags are used across users. This function takes in a user_tags dictionary and computes a new "flat" count dictionary:

          
          
def flat_counts(user_tags):
    """
    Given a user_tags dicts, sum up the tag counts across all users,
    return a "flat" counts dict with a key for each tag,
    and its value is the sum of that tag's count across users.
    >>> flat_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
    {'#apple': 2, '#banana': 2}
    """
          
        

main will call that function with the -flat argument, like so:

          
          
$ python3 tweets.py -flat small-tweets.txt
flat
 #BigBlockOfCheeseDay -> 1
 #MAGA -> 2
 #SOTU -> 3
 .
 .
 .