Section #6: Dictionaries


Written by Juliette Woodrow, Anna Mistele, John Dalloul, Jonathan Kula, and Elyse Cornwall


This week in section, we'll get practice with the dictionary data structure, and see how it differs from lists. We'll get and set the value associated with a certain key in the dictionary, and use nested dictionaries to represent complex information. We'll use decomposition-by-variable to make these nested structures easier to work with.

Big Tweet Data!

In this problem, you'll write a program that reads through a collection of tweets and stores the data in a dictionary to keep track of how hashtags occur in users' tweets. This is a great example of how Python can be used in data analysis tasks.

Our Dataset

For the purposes of this problem, each tweet is represented as a single line of text in a file. Each line consists of a username (prefixed by a '@' symbol), followed by a colon, and then the text of the tweet, which contains exactly one hashtag. Each character in this file can be a character from any language, or an emoji, although you don't need to do anything special to deal with thse characters. One such file in the PyCharm project we provide is small-tweets.txt, which is reproduced here:

          
@taylorswift13: This is my last day of life before #RedTaylorsVersion. Midnight.
@taylorswift13: life was a willow and it bent right to your wind. The #willowMusicVideo is out now!
@NASA: The #CRS18 Cygnus spacecraft is named after the late Sally Ride, the first American woman in space and an advocate for STEM education.
@NASA: Liftoff of #CRS18 is now set for 5:32am ET (1032 UTC). We are go for launch from @NASA_Wallops!
@NASA: Welcome to the International Space Station, #Crew5!

        

We provide 2 such files for you in the PyCharm Project: small-tweets.txt and big-tweets.txt.

The users Dictionary

Central to this program is a users dictionary, in which each key is a Twitter username like '@taylorswift13'. The value for each key in this dictionary is a second, nested tag_counts dictionary which counts how many times that user has used particular hashtags. As a matter of nomenclature, we'll call the outer dictionary users and then each inner dictionary tag_counts. Here's a users dictionary based on the tweets in small-tweets.txt above:

        
{
  '@taylorswift13': {'#RedTaylorsVersion': 1, '#willowMusicVideo': 1},
  '@NASA': {'#CRS18': 2, '#Crew5': 1}
}

1. Warmup: Adding Tweets

Consider the users dictionary from the example above. How would this dictionary look after reading in the following lines from our file?

2. Implementing add_tweet()

Now that you've got a feel for how this process works, let's implement it in Python. We want to write the function add_tweet(users, tweet), which takes in a nested users dictionary and a string tweet and updates and returns the dictionary to store the data from this tweet. Since you're all string parsing experts by now, we won't make you write the code for parsing out the username and hashtag from each tweet. Instead, we've provided you with the functions parse_user(tweet) and parse_tag(tweet) that take in a tweet and return the username or hashtag from the tweet, respectively:

          
parse_user('@taylorswift13: The #willowMusicVideo premieres in 1 hour!') -> '@taylorswift13'
parse_tag('@taylorswift13: The #willowMusicVideo premieres in 1 hour!') -> '#willowMusicVideo'
          
        

Let's walk through an example based on our warmup. Let's say we have a users dictionary with some tweet data already in it, and we want to add a few more tweets to this dictionary using our add_tweet() function:

        
users = {
          '@taylorswift13': {'#RedTaylorsVersion': 1, '#willowMusicVideo': 1},
          '@NASA':          {'#CRS18': 2, '#Crew5': 1}
        }

add_tweet(users, '@taylorswift13: This is my last day of life before #RedTaylorsVersion. Midnight.')

After calling add_tweet(), our users dictionary should look like this. Notice that the count for '#RedTaylorsVersion' has incremented by one.

  
{
  '@taylorswift13': {'#RedTaylorsVersion': 2, '#willowMusicVideo': 1},
  '@NASA':          {'#CRS18': 2, '#Crew5': 1}
}
        
      

Hint: when updating the users dictionary, use "decomposition-by-variable" to put the specific user's tag_counts dictionary into a variable and then update that using the counts algorithm we've seen this week in lecture. How might this look different for a username we haven't encountered before?

3. Implementing read_tweets()

Now, you'll read in all of the tweets from a tweets file and construct a users dictionary. To implement the function read_tweets(), call add_tweet() in your file reading loop to build up a users dictionary, then return it at the end of the function after reading all the lines of the file.

Running your program

We provide a main function that calls the read_tweets() function you implemented in a variety of ways. To use it, run the program from the terminal. Run with just one argument (a data filename) to read in all the data from that file and print out a summary of each user and all their hashtags:

          
$ python3 big_tweet_data.py small-tweets.txt
@NASA
  #CRS18 -> 2
  #Crew5 -> 1
@taylorswift13
  #RedTaylorsVersion -> 1
  #willowMusicVideo -> 1
          
        

When run with the '-users' argument, main prints out all the usernames:

          
$ python3 big_tweet_data.py -users small-tweets.txt
users
@NASA
@taylorswift13
          
        

When run with the '-user' argument followed by a username, the program prints out the data for just that user.

          
$ python3 big_tweet_data.py -user @taylorswift13 small-tweets.txt
user: @taylorswift13
  #RedTaylorsVersion -> 1
  #willowMusicVideo -> 1
          
        

Extension: flat_counts

If you happen to have some extra time and want to explore this data further, here's an optional extension problem. It's natural to be curious about how often tags are used across all users, not just for a specific user. This function takes in a users dictionary and computes a new "flat" count dictionary:

          
users = {
          '@elyse': {'#cs106a': 4, '#Stanford': 2}, 
          '@nick':  {'#cs106a': 5, '#yotter': 12}
        }

flat_counts(users) -> {'#cs106a': 9, '#Stanford': 2, '#yotter': 12}
          
          

You can test flat_counts like so:

          
$ python3 big_tweet_data.py -flat small-tweets.txt