Section 5. File Reading, Dictionaries, Nested Structures


Juliette Woodrow, Brahm Capoor, Andrew Tierno, Peter Maldonado, Kara Eng, Tori Qiu and Parth Sarin

Here is the online IDE for section this week Online IDE

File Reading

Smallest Unique Positive Integer

Implement the following function:

def find_smallest_int(filename)

That takes as a parameter a filename string representing a file with a single integer on each line, and returns the smallest unique positive integer in the file. An integer is positive if is greater than 0, and unique if it occurs exactly once in the file. For example, suppose filename.txt looks like this:

 42
 1
 13
 12
 1
 -8
 20

Calling find_smallest_int('filename.txt') would return 12. You may assume that each line of the file contains exactly one integer, although it may not be positive and that there is at least one positive integer in the file.

def find_smallest_int(filename):
    nums_so_far = []
    duplicates = []
    with open(filename) as f:
        for line in f:
            num = int(line)
            if num > 0:
                # if we've seen this number already
                if num in nums_so_far:
                    # record that it's a duplicate
                    duplicates.append(num)
                # note that we've seen this number
                nums_so_far.append(num)
    
    uniques = []
    for elem in nums_so_far:
        if elem not in duplicates:
            uniques.append(elem)
    return min(uniques)

Warming up with Nested Dictionaries

To get some practice with dictionaries and nesting collections, implement the following functions:

  1. def first_chars(strs) Given a list of strings, create and return a counts dictionary whose keys are the unique first characters of the strings in strs and whose values are the count of the words that start with that character. As an example, the following strs list would lead to the following counts dict. strs = ['stanford', 'python', 'computer', 'science', 'democracy', 'day'] counts = {'s': 2, 'p': 1, 'c': 1, 'd': 2}

  2. def first_list(strs) Given a list of strings, create and return a dictionary whose keys are the unique first characters of the strings and whose values are lists of words beginning with those characters, in the same order that they appear in strs.

def first_chars(strs):
    counts = {}
    for s in strs:
        ch = s[0]
        if ch not in counts:
            counts[ch] = 0
        counts[ch] += 1
    return counts

        
def first_list(strs):
    unique_firsts = {}
    for s in strs:
        ch = s[0]
        if ch not in unique_firsts:
            unique_firsts[ch] = []
        unique_firsts[ch].append(s)
    return unique_firsts

Big Tweet Data

In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in tweets. This is a great example of how Python can be used in data analysis tasks.

Our Dataset

For the purposes of this problem, each tweet is represented as a single line of text in a file. Each line consists of the poster's username (prefixed by a '@' symbol), followed by a colon and then the text of the tweet. Each character in this file can be a character from any language, or an emoji, although you don't need to do anything special to deal with these characters. One such file in the PyCharm project we provide is small-tweets.txt, which is reproduced here:

@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQ…
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: πŸ˜‚πŸ˜‚πŸ˜‚πŸ”₯πŸ”₯πŸ”₯ https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education #vt #realestate
https://t.co/iqxXtruqrt

We provide 3 such files for you in the PyCharm Project: small-tweets.txt, big-tweets.txt and huge-tweets.txt.

Building a user_tags Dictionary

Central to this program is a user_tags dictionary, in which each key is a Twitter user's name like '@BarackObama'. The value for each key in this dictionary is a second, nested dictionary which counts how frequently that particular user has used particular hashtags. For example, a very simple user_tags dictionary might be:

  **{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}**

We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary. Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.

Warmup questions

Given the dictionary above, what updates we would make to it in each of the following cases?

  1. We encounter a new tweet that reads '@BarackObama: #Obamacare signups now!'.
  2. We encounter a new tweet that reads '@kanyewest: πŸ˜‚πŸ˜‚πŸ˜‚πŸ”₯πŸ”₯πŸ”₯ https://t.co/KmvxIwKkU6'.

Implementing add_tweet

The add_tweet function is the core of this whole program, and is responsible for performing the update to a user_tags dictionary described above. The tests shown below represent a sequence of tweets, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary ({}) and gets back a dictionary with 1 user and 2 tags. The 2nd test then takes that returned dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.

We've provided you with two functions entitled parse_tags and parse_user, both of which take as a parameter the tweet in question and return a list of tags in the tweet and the username that posted the tweet, respectively.

def add_tweet(user_tags, tweet):
    """
    Given a user_tags dict and a tweet, parse out the user and tags,
    and add those counts to the user_tags dict which is returned.
    If no user exists in the tweet, return the user_tags dict unchanged.
    Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
    the parts out of the tweet.
    >>> add_tweet({}, '@alice: #apple #banana')
    {'@alice': {'#apple': 1, '#banana': 1}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@alice: #banana')
    {'@alice': {'#apple': 1, '#banana': 2}}
    >>> add_tweet({'@alice': {'#apple': 1, '#banana': 2}}, '@bob: #apple')
    {'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
    """

Implementing parse_tweets

Use add_tweet in a loop to build up and return a user_tags dict. This should look mostly like other file-reading functions you've written, and your job is to make sure you understand how to follow the pattern of creating and updating a dictionary suggested by the add_tweet function. Restated, the responsibility of add_tweet is to update a dictionary, and parse_tweets must create and maintain that dictionary as it is updated.

Running your program

We provide a main function that calls the parse_tweets function you implemented in a variety of ways. To use it, run the program from the terminal. Run with just 1 argument (a data filename), it reads in all the data from that file and prints out a summary of each user and all their tweets and counts:

$ python3 tweets.py small-tweets.txt
@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
@GonzalezSarahA
 #education -> 1
 #vt -> 1
 #realestate -> 1

When run with the '-users' argument, main prints out all the usernames:

$ python3 tweets.py -users small-tweets.txt
users
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA

When run with the '-user' argument followed by a username, the program prints out the data for just that user.

$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
def add_tweet(user_tags, tweet):
    user = parse_user(tweet)
    if user == '':
        return user_tags

    # if user is not in there, put them in with empty counts
    if user not in user_tags:
        user_tags[user] = {}

    # counts is the nested tag -> count dict
    # go through all the tags and modify it
    counts = user_tags[user]
    parsed_tags = parse_tags(tweet)
    for tag in parsed_tags:
        if tag not in counts:
            counts[tag] = 0
        counts[tag] += 1

    return user_tags

def parse_tweets(filename):
    user_tags = {}
    # here we specify encoding 'utf-8' which is how this text file is encoded
    # python technically does this by default, but it's better to be explicit
    with open(filename, encoding='utf-8') as f:
        for line in f:
            add_tweet(user_tags, line)
    return user_tags

def user_total(user_tags, user):
    """
    Optional. Given a user_tags dict and a user, figure out the total count
    of all their tags and return that number.
    If the user is not in the user_tags, return 0.
    """
    if user not in user_tags:
        return 0
    counts = user_tags[user]
    total = 0
    for tag in counts.keys():
        total += counts[tag]
    return total

def flat_counts(user_tags):
    """
    Given a user_tags dicts, sum up the tag counts across all users,
    return a "flat" counts dict with a key for each tag,
    and its value is the sum of that tag's count across users.
    """
    counts = {}
    for user in user_tags.keys():
        tags = user_tags[user]
        for tag in tags:
            if tag not in counts:
                counts[tag] = 0
            counts[tag] += tags[tag]
    return counts