Learning goals

Nested Dictionaries

To get some practice with dictionaries and nesting collections, implement the following functions:

  1. def int_counts(ints) Given a list of integers, create and return an int-count dict: that is, each unique integer in ints is a key in the dictionary and the corresponding value is the number of times that integer appeared in the list.

    Test out your solution here .

  2. def first_list(strs) Given a list of strings, create and return a dictionary whose keys are the unique first characters of the strings and whose values are lists of words beginning with those characters, in the same order that they appear in strs.

    Test out your solution here .

  3. def suffix_list(strs) Given a list of strings, create and return a dictionary whose keys are the suffixes of those strings and whose values are lists of words ending with those suffixes, in the same order that they appear in strs. A suffix is defined as the last 2 characters of a string, and a string that is less than 2 characters long has no suffix.

    Test out your solution here .

Word counts

Your job is to write a full python program, including a main function, for a file called special-count.py. This program should implement the following behavior:

  1. If special-count.py is called with just a filename (e.g. python3 special-count.py myfile.txt) then it should read in the file named myfile.txt (you may assume that this file is formatted as single words separated by newlines) and produce counts of words that share the same consonants in order. For example, if we had the following text in myfile.txt:


    Your program should produce the following output. Note that the output is in sorted order.

    $ python3 special-count.py myfile.txt
    bmp -> 1
    grt -> 3
    tny -> 2
  2. If special-count.py is called with the additional flag -vowels (e.g. python3 special-count.py -vowels myfile.txt) then it should produce the same output, only grouping words by their vowels rather than their consonants. So, using the same file as before produces the following output:
    $ python3 special-count.py -vowels myfile.txt
    ae -> 1
    ea -> 1
    ee -> 2
    i -> 1
    u -> 1

Big Tweet Data!

In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in tweets. This is a great example of how Python can be used in data analysis tasks. Begin by downloading the PyCharm project here .

Our Dataset

For the purposes of this problem, each tweet is represented as a single line of text in a file. Each line consists of the poster's username (prefixed by a '@' symbol), followed by a colon and then the text of the tweet. Each character in this file can be a character from any language, or an emoji, although you don't need to do anything special to deal with thse characters. One such file in the PyCharm project we provide is small-tweets.txt, which is reproduced here:

@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQā€¦
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: šŸ˜‚šŸ˜‚šŸ˜‚šŸ”„šŸ”„šŸ”„ https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education #vt #realestate

We provide 3 such files for you in the PyCharm Project: small-tweets.txt, big-tweets.txt and huge-tweets.txt.

Building a user_tags Dictionary

Central to this program is a user_tags dictionary, in which each key is a Twitter user's name like '@BarackObama'. The value for each key in this dictionary is a second, nested dictionary which counts how frequently that particular user has used particular hashtags. For example, a very simple user_tags dictionary might be:

{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}

We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary (since it uses the dict-count algorithm we've seen a bunch in class). Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.

1. Warmup questions

Given the dictionary above, what updates we would make to it in each of the following cases?

2. Implementing add_tweet

The add_tweet() function is the core of this whole program, and is responsible for performing the update to a user_tags dictionary described above. The tests shown below represent a sequence, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary ({}) and gets back a dictionary with 1 user and 2 tags. The 2nd test then takes that returned dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.

You're all string parsing experts by now, so we won't make you do that work anymore. We've provided you with two functions entitled parse_tags and parse_user, both of which take as a parameter the tweet in question and return a list of tags in the tweet and the username that posted the tweet, respectively.

def add_tweet(user_tags, tweet):
    Given a user_tags dict and a tweet, parse out the user and tags,
    and add those counts to the user_tags dict which is returned.
    If no user exists in the tweet, return the user_tags dict unchanged.
    Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
    the parts out of the tweet.
    >>> add_tweet({}, '@alice: #apple #banana')
    {'@alice': {'#apple': 1, '#banana': 1}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@alice: #banana')
    {'@alice': {'#apple': 1, '#banana': 2}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 2}}, '@bob: #apple')
    {'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
3. Implementing parse_tweets

Use add_tweet in a loop to build up and return a user_tags dict. This should look mostly like other file-reading functions you've written, and your job is to make sure you understand how to follow the pattern of creating and updating a dictionary suggested by the add_tweet function. Restated, the responsibility of add_tweet is to update a dictionary, and parse_tweets must create and maintain that dictionary as it is updated.

Running your program

We provide a main function that calls the parse_tweets function you implemented in a variety of ways. To use it, run the program from the terminal. Run with just 1 argument (a data filename), it reads in all the data from that file and prints out a summary of each user and all their tweets and counts:

$ python3 tweets.py small-tweets.txt
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
 #education -> 1
 #vt -> 1
 #realestate -> 1

When run with the '-users' argument, main prints out all the usernames:

$ python3 tweets.py -users small-tweets.txt

When run with the '-user' argument followed by a username, the program prints out the data for just that user.

$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3


You probably won't get to this extension in section, but if you have time, implement this additional function which you can then leverage to answer some interesting questions about Hashtag use.

Implementing flat_counts

It's natural to be curious about how often tags are used across users. This function takes in a user_tags dictionary and computes a new "flat" count dictionary:

def flat_counts(user_tags):
    Given a user_tags dicts, sum up the tag counts across all users,
    return a "flat" counts dict with a key for each tag,
    and its value is the sum of that tag's count across users.
    >>> flat_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
    {'#apple': 2, '#banana': 2}

main will call that function with the -flat argument, like so:

$ python3 tweets.py -flat small-tweets.txt
 #BigBlockOfCheeseDay -> 1
 #MAGA -> 2
 #SOTU -> 3