Back to CS 106A Homepage
Written by Nick Parlante, Jonathan Kula, and Brahm Capoor
February 17th, 2020
To get some practice with dictionaries and nesting collections, implement the following functions:
def int_counts(ints) Given a list of integers, create
and return an int-count dict: that is, each unique integer in
ints
is a key in the dictionary and the corresponding value is the
number of times that integer appeared in the list.
Test out your solution here .
def first_list(strs) Given a list of strings, create
and return a dictionary whose keys are the unique first characters
of the strings and whose values are lists of words beginning with
those characters, in the same order that they appear in
strs.
Test out your solution here .
def suffix_list(strs) Given a list of strings, create
and return a dictionary whose keys are the suffixes of those
strings and whose values are lists of words ending with those
suffixes, in the same order that they appear in strs.
A suffix is defined as the last 2 characters of a string, and a
string that is less than 2 characters long has no suffix.
Test out your solution here .
Your job is to write a full python program, including a
main function, for a file called
special-count.py. This program should implement the
following behavior:
If special-count.py is called with just a filename
(e.g. python3 special-count.py myfile.txt) then it
should read in the file named myfile.txt (you may assume that this
file is formatted as single words separated by newlines) and
produce counts of words that share the same consonants in order.
For example, if we had the following text in
myfile.txt:
great
grate
greet
teeny
tiny
bump
Your program should produce the following output. Note that the output is in sorted order.
$ python3 special-count.py myfile.txt
bmp -> 1
grt -> 3
tny -> 2
special-count.py is called with the additional flag
-vowels (e.g.
python3 special-count.py -vowels myfile.txt) then it
should produce the same output, only grouping words by their vowels
rather than their consonants. So, using the same file as before
produces the following output:
$ python3 special-count.py -vowels myfile.txt
ae -> 1
ea -> 1
ee -> 2
i -> 1
u -> 1
In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in tweets. This is a great example of how Python can be used in data analysis tasks. Begin by downloading the PyCharm project here .
For the purposes of this problem, each tweet is represented as a
single line of text in a file. Each line consists of the poster's
username (prefixed by a '@' symbol), followed by a colon and then the
text of the tweet. Each character in this file can be a character from
any language, or an emoji, although you don't need to do anything
special to deal with thse characters. One such file in the PyCharm
project we provide is small-tweets.txt, which is
reproduced here:
@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQā¦
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: šššš„š„š„ https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education #vt #realestate
https://t.co/iqxXtruqrt
We provide 3 such files for you in the PyCharm Project:
small-tweets.txt, big-tweets.txt and
huge-tweets.txt.
user_tags Dictionary
Central to this program is a user_tags dictionary, in
which each key is a Twitter user's name like
'@BarackObama'. The value for each key in this dictionary
is a second, nested dictionary which counts how frequently that
particular user has used particular hashtags. For example, a very
simple user_tags dictionary might be:
{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}
We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary (since it uses the dict-count algorithm we've seen a bunch in class). Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.
Given the dictionary above, what updates we would make to it in each of the following cases?
'@BarackObama: #Obamacare signups now!'.
'@kanyewest: šššš„š„š„ https://t.co/KmvxIwKkU6'.
add_tweet
The add_tweet() function is the core of this whole program, and is
responsible for performing the update to a
user_tags dictionary described above. The tests shown
below represent a sequence, expressed as a series of Doctests. For
each call, you can see the dictionary that is passed in, and the
dictionary that is returned on the next line. The first test passes in
the empty dictionary ({}) and gets back a dictionary with
1 user and 2 tags. The 2nd test then takes that returned dictionary as
its input, and so on. Each call adds more data to the
user_tags dictionary.
You're all string parsing experts by now, so we won't make you do that
work anymore. We've provided you with two functions entitled
parse_tags and parse_user, both of which
take as a parameter the tweet in question and return a list of tags in
the tweet and the username that posted the tweet, respectively.
def add_tweet(user_tags, tweet):
"""
Given a user_tags dict and a tweet, parse out the user and tags,
and add those counts to the user_tags dict which is returned.
If no user exists in the tweet, return the user_tags dict unchanged.
Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
the parts out of the tweet.
>>> add_tweet({}, '@alice: #apple #banana')
{'@alice': {'#apple': 1, '#banana': 1}}
>>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@alice: #banana')
{'@alice': {'#apple': 1, '#banana': 2}}
>>> add_tweet({'@alice': {'#apple': 1, '#banana': 2}}, '@bob: #apple')
{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
"""
parse_tweets
Use add_tweet in a loop to build up and return a
user_tags dict. This should look mostly like other
file-reading functions you've written, and your job is to make sure
you understand how to follow the pattern of creating and updating a
dictionary suggested by the add_tweet function. Restated,
the responsibility of add_tweet is to update a
dictionary, and parse_tweets must create and maintain
that dictionary as it is updated.
We provide a main function that calls the
parse_tweets function you implemented in a variety of
ways. To use it, run the program from the terminal. Run with just 1
argument (a data filename), it reads in all the data from that file
and prints out a summary of each user and all their tweets and counts:
$ python3 tweets.py small-tweets.txt
@BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
@GonzalezSarahA
#education -> 1
#vt -> 1
#realestate -> 1
When run with the '-users' argument,
main prints out all the usernames:
$ python3 tweets.py -users small-tweets.txt
users
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA
When run with the '-user' argument followed by a
username, the program prints out the data for just that user.
$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
You probably won't get to this extension in section, but if you have time, implement this additional function which you can then leverage to answer some interesting questions about Hashtag use.
flat_counts
It's natural to be curious about how often tags are used across users.
This function takes in a user_tags dictionary and
computes a new "flat" count dictionary:
def flat_counts(user_tags):
"""
Given a user_tags dicts, sum up the tag counts across all users,
return a "flat" counts dict with a key for each tag,
and its value is the sum of that tag's count across users.
>>> flat_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
{'#apple': 2, '#banana': 2}
"""
main will call that function with the
-flat argument, like so:
$ python3 tweets.py -flat small-tweets.txt
flat
#BigBlockOfCheeseDay -> 1
#MAGA -> 2
#SOTU -> 3
.
.
.