Section #5: Strings & File Reading

October 18th, 2020

Written by Brahm Capoor, Juliette Woodrow, Peter Maldonado, Kara Eng, Tori Qiu and Parth Sarin

Strings

String Slicing

s = 'PythonTime'

How would you slice into this string to obtain the following results?

• 'ython'
• 'Py'
• 'Tim'
• 'Time'
• 'T'
• 'PythonTime'

Remember, strings in Python are 0-indexed. In addition, the slice s[1:8] is inclusive of the first index, and exculsive of the second (that is, it will get the string beginning at index 1 and up to, but not including, index 8, i.e. 'ythonTi').

String Construction

Implement the following functions:

1. only_one_first_char(s): removes all occurrences of the first character of s except the first character itself. For example, only_one_first_char('recurrence') returns 'recuence'. You may assume s has at least one character.
2. make_gerund(s): which adds 'ing' to the end of the given string s and returns this new word. If s already ends with 'ing', add an 'ly' to the end of s instead. You may assume that s is at least 3 characters long.
3. put_in_middle(outer, inner): which returns a string where inner has been inserted into the middle of the string outer. To find the middle of a string, take the length of the string and divide it by 2 using integer division. The first half of the string should be all characters leading up to, but not including, the character at this index. The second half should start with the character at this index and include the rest of the characters in the string.

Word Puzzles

In these problems, we'll investigate properties of words in the English language. In each problem, we'll define a special rule and write a function to determine whether a word obeys that rule or violates that rule. For this problem, you can assume that word will be a string containing uppercase alphabetic characters only.

Palindromes

We say that a word is a palindrome if it reads the same forwards as backwards. For example, "Abba" is a palindrome because it is the same word read forwards and backwards. Here are some more examples:

• Racecar
• Kayak
• Mr. Owl ate my metal worm
Here are some examples of Palindromes in other languages:
• حلب قلعة تحت تعلق ب (Dates hang underneath a castle in Halab)
• 여보, 안경 안보여 (Honey, I can't see my glasses)
• कड़क (a loud thunderous sound)
Write a function is_palindrome(word) that returns True if a word is a palindrome and False otherwise.

Tridromes

We say that a word is a tridrome if the first three letters of the word are the same as the last three letters of the word (and appear in the same order). All tridromes must be at least 6 letters long. For example, ENTERTAINMENT, MURMUR, and UNDERGROUND are tridromes. Write a function is_tridrome(word) that returns True if a word is a tridrome and False otherwise.

Peaceful Words

We say that a word is peaceful if its letters are in alphabetical order. For example, ALMOST, CHIPS, DIRTY, FIRST, and HOST are all peaceful words. Write a function is_peaceful(word) that returns True if a word is peaceful and False otherwise. You may assume you have access to a constant ALPHABET which is a string of the uppercase letters in the alphabet, in sequential order, i.e., ALPHABET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.

Stacatto Words

We say that a word is a stacatto word if all of the letters in even positions are vowels (i.e., the second, fourth, sixth, etc. letters are vowels). For this problem, the vowels are A, E, I, O, U, and Y. For example, AUTOMATIC, CAFETERIA, HESITATE, LEGITIMATE, and POPULATE are stacatto words. Write a function is_stacatto(word)  that returns True if a word is a stacatto word and False otherwise.

No More Counting Dollars, We'll Be Counting Words

Suppose you're given a file that contains all the words in the English language, where each one is on a different line. Write the following functions, using the functions you wrote in the previous problem:

1. count_tridromes(filename) which returns the number of English words that are tridromes.
2. count_peaceful(filename) which returns the number of English words that are peaceful.
3. count_stacatto(filename) which returns the number of English words that are stacatto words.

A few things to note:

• You're given the name of a file that contains English words, so you have to open the file and process the words.
• The words are not all uppercase. Convert them to uppercase before you call the function that you wrote in the last problem.
• Lines will end with a newline character, \n. You can remove this character from a string using the strip function, i.e. s.strip().

You can actually run this program! We've provided a file of all the words in the English language called words.txt in the section project.

String Parsing

Introduction to String Parsing

Implement the following functions:

• exclaim(s): Given a string s, look for the first exclamation mark. If there is a substring of 1 or more alphabetic characters immediately to the left of the exclamation mark, return this substring including the exclamation mark. Otherwise, return the empty string. For example, exclaim('xx Hello! yy') returns 'Hello!'.
• vowels(s): Given a string s, look for the first colon. If there is a substring of 1 or more vowels immediately to the right of the colon, return this substring without the colon. Otherwise, return the empty string. For example, vowels('xy:aieee?') returns 'aieee'.

Finding the smallest unique positive integer

Implement the following function:

 def find_smallest_int(filename) 

That takes as a parameter a filename string representing a file with a single integer on each line, and returns the smallest unique positive integer in the file. An integer is positive if is greater than 0, and unique if it occurs exactly once in the file. For example, suppose filename.txt looks like this:

            
42
1
13
12
1
-8
20



Calling find_smallest_int('filename.txt') would return 12.

You may assume that each line of the file contains exactly one integer, although it may not be positive and that there is at least one positive integer in the file.

Putting it all Together

Extracting Email Hostnames

Now, we're going to turn our attention to a parsing task we'd be more likely to see in the real world: parsing email addresses. For the purposes of this problem, we'll be using a simplified format of an email address as follows:

username@hostname

where hostname is a string with at least 4 characters. It consists of alphabetic characters and at least one period. In addition, the username can be any length, including 0 characters. Some examples are:

              
jillian@website      # invalid email address, needs at least one period.
sheridan@email1.com  # invalid, since 1 isn't a letter or period
sam@a.b              # invalid,  less than 4 characters long.



Suppose you have a file called emails.txt that looks like this:

              
Please forward this email to ingrid@stanford.edu for me. Thanks!
Can someone tell me who owns the parth@yahoo.com email address?
The email jwoodrow@gmail.com keeps sending me spam mail.
Please forward this email to justin@stanford.edu for me. Thanks!
Omg @ye is my favorite!
Do you think a@b.c is spam?
This one isn't spam: a@d.tv
Hello, world!
Why am I getting emails from trey@spam.com?



which has at most one email address per line of the file. Your job is to write the following function:

def extract_all_hostnames(filename)

which takes in a string representing a file's name and returns a list of all the unique hostnames in the file. For example, calling the function with the parameter emails.txt would have the following result:

              
>>> extract_all_hostnames('emails.txt')
['d.tv', 'gmail.com', 'spam.com', 'stanford.edu', 'yahoo.com']



In writing this function, think about how best to decompose it into functions that are responsible for subparts of the problem. For example, consider implementing a function which extracts a hostname from a single line and how you might use it.

A much better email parser

In the last problem, you built a program to retrieve email hostnames from a file. Unfortunately, that program was limited in several ways. For example, it could only parse a single email from each line of the file, only retrieved the hostname of each email, and finally wasn't robust to peculiar cases such as punctuation occurring immediately after the hostname.

This time, you'll leverage your skills with nested loops and string parsing to build a more sophisticated program to grab emails from a file. You'll start by writing a program that simply grabs every email address from the file by implementing functions which we specify and whose definitions we provide for you. Then, you'll make your program a little more flexible by having it to support a variety of command line arguments which alter its behaviour.

Step 1: Parsing Emails

Detecting Valid Characters

First, we're going to refine our definition of what constitutes an email address. An email address must be formatted in the following way:

username@hostname

Every character in both the username and the hostname must be a letter, a digit, a period, a dash, or an underscore (the '_' character). The username must be at least one character long, and the hostname must be at least 4 characters long, one of which is a period. With this in mind, implement the following useful helper function:

def is_email_char(ch)

that takes in a character, and returns whether that character is a valid part of an email address. This will not be a long function, but will be instrumental in the readability of the more complex functions you write later.

Getting email addresses from a line

Your job here is to implement the following function:

def get_all_emails(line)

Which takes as input a string representing a line of text from a file, and returns a list of all the valid email addresses in that line.

Here's some sample output for the get_all_emails function:

            
>>> get_all_emails('xx aa@bb.com 1.2@3.45')
['aa@bb.com', '1.2@3.45']
>>> get_all_emails('_@_ aa-bb@TV.org**meh@meh.com')
['aa-bb@TV.org', 'meh@meh.com']
>>> get_all_emails('abc @ @ 123')
[]
>>> get_all_emails('')
[]



Some words of wisdom:

• You might find it helpful to first find an @ character in the email, and then scan backwards and forwards to find the other characters in the email address. As a reminder, the str.find() function accepts an optional second parameter which specifies which index to begin searching in the string from.
• Leveraging the is_email_char() function you wrote in the previous section will be very helpful here.
• Remember, the username in an email address must be at least one character long and the hostname must be at least 4 characters long, with one period.
Getting all the email addresses from a file

Finally, implement the following function:

def get_emails_from_file(filename)

that takes as input a filename for your function to read through and returns a list of all the email addresses in the file. For example, if the file emails.txt looks like this:

              
Hello john@example.com this is alice@microsoft.com
And a.7@d_e.org and a@a.com
jwoodrow@stanford.edu is not nick's email



then the function would behave as below:

              
>>> get_emails_from_file('emails.txt')
['a.7@d_e.org', 'a@a.com', 'alice@microsoft.com', 'jwoodrow@stanford.edu', 'john@example.com']


Putting it all together

We've written a main function for you that puts all of these together, so you don't need to worry about modifying it for this section.

              
def main():
args = sys.argv[1:]
if len(args) == 1:
emails = parse_all_emails(args[0])
for email in emails:
print(email)
# some other bookkeeping here



You can use your program as demonstrated below:

            
$python3 emails.py emails.txt a.7@d_e.org a@a.com alice@microsoft.com jwoodrow@stanford.edu john@example.com$ python3 emails.py big-emails.txt
--@
--@and.com
--@bill
--@come
--@oh
--@oh.com
--@the.com
....lots and lots of emails....
you@thinking.com
your@acceptance.com
your@walk



Step 2: Command Line Arguments

Now that you have a basic version of your program working, you'll now turn your attention to making it a more flexible and powerful by implementing various optional command line options for the user:

• -max: The -max command line option allows you to specify the maximum number of emails you'd like to grab from each line. For example, if one of the lines in the file is jwoodrow@stanford.edu and nick@stanford.edu and chris@stanford.edu, but your program is called as below, only jwoodrow@stanford.edu and nick@stanford.edu should be printed to the terminal.
                
$python3 emails.py emails.txt ... emails from other lines in the file ... jwoodrow@stanford.edu nick@stanford.edu chris@stanford.edu ... emails from other lines in the file ...$ python3 emails.py -max 2 emails.txt
... emails from other lines in the file ...
jwoodrow@stanford.edu
nick@stanford.edu
... emails from other lines in the file ...



• -host: The -host command line option allows you to specify that you would only like to grab emails with a paricular hostname. For example, calling the program as below will only print stanford.edu emails in the shell.
                
$python3 emails.py emails.txt carol@avengers.com jwoodrow@stanford.edu julia@gmail.com$ python3 emails.py -host stanford.edu emails.txt
jwoodrow@stanford.edu



You can assume that a user will use either the -max option, or the -host option, but not both.

Elegantly supporting both these options is primarily a challenge in decomposition and style - there is no one 'correct' way to do it. You are free to make whatever modifications you want to the program's functions, their parameters and return values. As a reference, the sample solution modifies the main, get_emails_from_file, and get_all_emails functions, although you are welcome to pursue an alternative strategy.