Written by Juliette Woodrow, Anna Mistele, John Dalloul, Jonathan Kula, and Elyse Cornwall
This week in section, we'll get practice with the dictionary data structure, and see how it differs from lists. We'll get and set the value associated with a certain key in the dictionary, and use nested dictionaries to represent complex information. We'll use decomposition-by-variable to make these nested structures easier to work with.
In this problem, you'll write a program that reads through a collection of tweets and stores the data in a dictionary to keep track of how hashtags occur in users' tweets. This is a great example of how Python can be used in data analysis tasks.
For the purposes of this problem, each tweet is represented as a
single line of text in a file. Each line consists of a
username (prefixed by a '@' symbol), followed by a colon, and then the
text of the tweet, which contains exactly one hashtag. Each character in this file can be a character from
any language, or an emoji, although you don't need to do anything
special to deal with thse characters. One such file in the PyCharm
project we provide is small-tweets.txt
, which is
reproduced here:
@taylorswift13: This is my last day of life before #RedTaylorsVersion. Midnight.
@taylorswift13: life was a willow and it bent right to your wind. The #willowMusicVideo is out now!
@NASA: The #CRS18 Cygnus spacecraft is named after the late Sally Ride, the first American woman in space and an advocate for STEM education.
@NASA: Liftoff of #CRS18 is now set for 5:32am ET (1032 UTC). We are go for launch from @NASA_Wallops!
@NASA: Welcome to the International Space Station, #Crew5!
We provide 2 such files for you in the PyCharm Project:
small-tweets.txt
and big-tweets.txt
.
users
Dictionary
Central to this program is a users
dictionary, in
which each key is a Twitter username like '@taylorswift13'. The value for each key in this dictionary
is a second, nested tag_counts
dictionary which counts how many times that user has used particular
hashtags. As a matter of nomenclature, we'll call the outer dictionary users
and then each inner
dictionary tag_counts
. Here's a users
dictionary based on the tweets in
small-tweets.txt
above:
{
'@taylorswift13': {'#RedTaylorsVersion': 1, '#willowMusicVideo': 1},
'@NASA': {'#CRS18': 2, '#Crew5': 1}
}
Consider the users
dictionary from the example above. How would this dictionary look after
reading
in the following lines from our file?
'@taylorswift13: The #willowMusicVideo premieres in 1 hour!'
'@NASA: It's #WorldSpaceWeek! We're celebrating the theme of space and sustainability.'
'@dog_rates: This is Connie. She knows the best things in life are free. 14/10 #SeniorPupSaturday'
add_tweet()
Now that you've got a feel for how this process works, let's implement it in Python. We want to write the
function add_tweet(users, tweet)
, which takes in a nested users
dictionary and a
string tweet
and updates and returns the dictionary to store the data from this tweet. Since
you're all string parsing experts by now, we won't make you write the code for parsing out the username and
hashtag from each tweet. Instead, we've provided you with the functions parse_user(tweet)
and
parse_tag(tweet)
that take in a tweet and return the username or hashtag from the tweet,
respectively:
parse_user('@taylorswift13: The #willowMusicVideo premieres in 1 hour!') -> '@taylorswift13'
parse_tag('@taylorswift13: The #willowMusicVideo premieres in 1 hour!') -> '#willowMusicVideo'
Let's walk through an example based on our warmup. Let's say we have a users
dictionary with
some tweet data already in it, and we want to add a few more tweets to this dictionary using our
add_tweet()
function:
users = {
'@taylorswift13': {'#RedTaylorsVersion': 1, '#willowMusicVideo': 1},
'@NASA': {'#CRS18': 2, '#Crew5': 1}
}
add_tweet(users, '@taylorswift13: This is my last day of life before #RedTaylorsVersion. Midnight.')
After calling add_tweet()
, our users
dictionary should look like this. Notice that
the count for '#RedTaylorsVersion' has incremented by one.
{
'@taylorswift13': {'#RedTaylorsVersion': 2, '#willowMusicVideo': 1},
'@NASA': {'#CRS18': 2, '#Crew5': 1}
}
Hint: when updating the users
dictionary, use "decomposition-by-variable" to put the
specific user's tag_counts
dictionary into a variable and then update that using the counts
algorithm we've seen this week in lecture. How might this look different for a username we haven't
encountered before?
read_tweets()
Now, you'll read in all of the tweets from a tweets file and construct a users
dictionary.
To implement the function read_tweets()
, call add_tweet()
in your file reading loop
to build
up a users
dictionary, then
return it at the end of the function after reading all the lines of the file.
We provide a main
function that calls the
read_tweets()
function you implemented in a variety of
ways. To use it, run the program from the terminal. Run with just one
argument (a data filename) to read in all the data from that file
and print out a summary of each user and all their hashtags:
$ python3 big_tweet_data.py small-tweets.txt
@NASA
#CRS18 -> 2
#Crew5 -> 1
@taylorswift13
#RedTaylorsVersion -> 1
#willowMusicVideo -> 1
When run with the '-users'
argument,
main
prints out all the usernames:
$ python3 big_tweet_data.py -users small-tweets.txt
users
@NASA
@taylorswift13
When run with the '-user'
argument followed by a
username, the program prints out the data for just that user.
$ python3 big_tweet_data.py -user @taylorswift13 small-tweets.txt
user: @taylorswift13
#RedTaylorsVersion -> 1
#willowMusicVideo -> 1
flat_counts
If you happen to have some extra time and want to explore this data further, here's an optional extension
problem. It's natural to be curious about how often tags are used across all users, not just for a specific
user.
This function takes in a users
dictionary and
computes a new "flat" count dictionary:
users = {
'@elyse': {'#cs106a': 4, '#Stanford': 2},
'@nick': {'#cs106a': 5, '#yotter': 12}
}
flat_counts(users) -> {'#cs106a': 9, '#Stanford': 2, '#yotter': 12}
You can test flat_counts
like so:
$ python3 big_tweet_data.py -flat small-tweets.txt