#!/usr/bin/env python
# encoding: utf-8

"""
Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2013
Christopher Potts

Assignment 7

Distributed 2013-11-05
Due 2009-11-12

NOTE: Please submit a modified version of this file, including
comments.  Python should be able to run through your file without
errors.

**********************************************************************
IMPORTANT: ONLY THE FIRST PROBLEM IS REQUIRED. AFTER THAT, YOU JUST
NEED TO DO AT LEAST 6 POINTS' WORTH OF PROBLEMS.

IF YOUR PROBLEM IS ONE OF THE CHOICES, THEN YOU HAVE THE OPTION OF
TURNING IN YOUR PREVIOUS CODE (PERHAPS UPDATED) FOR CREDIT. (THAT IS,
THIS CAN CONTRIBUTE TO YOUR 6-POINT TOTAL.)

TO MAKE GRADING EASIER TO ME, PLEASE DELETE THE PROBLEMS YOU DON'T DO,
SO THAT I DON'T GET CONFUSED ABOUT WHAT TO LOOK AT.
**********************************************************************
"""

import re

"""===================================================================
1. Begin designing your final project by answering these three 
questions, devoting 3-5 sentences to each. [4 points in total]

a. What is the central goal of your project?

b. Where will you (and others) be using the code from this project?

c. What do you still need to learn or figure out in order to complete 
the project?

Some notes: 

* This is meant to get you started, so it's okay if your idea is 
general at this point.  However, this will be most productive if you
use it as an opportunity to try to seriously map out your project's
goals.

* You should have some things to say under (c).  If you don't, you're 
probably not being ambitious enough.

* It's fine if your project is very specific (e.g., write code for an 
experiment in a paper I'm writing) but the best projects will be more 
general than that (e.g., a library for processing and analyzing
reading-time data).

* I encourage you to build a project that makes use of existing
code packages like NLTK, matplotlib, scipy, etc.  This is by no
means a requirement, but the best programmers know when to use 
existing libraries rather than writing their own!"""


"""===================================================================
2. Multiplication using only summing [2 points; thanks Katherine!]

Write a function that finds the product of two numbers without using 
any built-in multiplication functions from any packages or libraries.

Extra credit: 1 point to anyone whose code is faster than mine. I've
already registered my code by sending it to Katherine."""

def multiply(num1, num2):
    pass

"""===================================================================
3. Finding emoticons [2 points; thanks Bonnie!]

Write a tokenizer that is sensitive to emoticons. You can break this 
down into two parts: 

* First, define the function emoticon_finder(s) so that it takes a 
string as input and returns a list of emoticons in the string. The 
finder should be able to handle a diverse set of emoticons. For 
guidance, see the list of Western emoticons here: 

http://en.wikipedia.org/wiki/List_of_emoticons 

Emoticons may have any number of spaces in them, but no other kinds of 
whitespace.

* Second, define emoticon_tokenizer(s) to tokenize a string that might 
contain emoticons. The function should delete punctuation when it 
appears on a word, but not delete any emoticons.

To test you function, you could try to match as many lines as you
can in

http://www.infochimps.com/datasets/twitter-census-smileys

while matching none in our words-english.txt file from earlier
assignments and none in

import string
string.punctuation

Extra credit: 1 point to anyone whose regular expression achieves a
higher F1 score (average of precision and recall) using the above
data. I've already registered my regular expression by sending it 
to Katherine. For details on the scoring:

http://en.wikipedia.org/wiki/Precision_and_recall
"""

def emoticon_finder(s):
	"""Takes a string as input and returns a list of all the emoticons in the string."""
    pass

def emoticon_aware_tokenizer(s):
    """Takes a string as input and returns a list of words and emoticons in the string."""
    pass
    

"""===================================================================
4. Generalized word-beauty experiments [2 points; thanks Hsin-Chang!]

Modify wordbeauty_cmu_counts from assignment 5 so that it extracts
all of the "most beautiful" words from wordbeauty.csv, ignores all of 
the "regular" words in there, and instead extracts a random sample of 
words from the CMU dictionary to use in the experiment. Requirements
for the random sample:

* Disjoint from the "most beautiful" words.
* Same size (in terms of number of words) as the "most beautiful" 
words you can get pronunciations for from your CMU pickle.

Your code must not break compatibility with wordbeauty_counts2freqcsv.
This means that the counts dictionary you create has to have
exactly the same structure as before.

Note: the random library provides a function for extracting a fixed
length sample with no repeats:

http://docs.python.org/2/library/random.html

Your sampling code should not use any kind of for or while loop."""

def wordbeauty_cmu_counts(src_filename='wordbeauty.csv',
                          cmu_filename='cmudict.0.7a.pickle'): 
    # You'll be rewriting this function to sample regular words
    # on its own, ignoring the regular words in wordbeauty.csv.
    pass


"""===================================================================
5. Lemma counts [2 points; thanks Phil!]

Return to the Brown corpus (using brown.txt as an example) and provide 
counts for every word-tag pair. The program should take a file 
containing a section of the Brown corpus as input. The output should 
be a defaultdict mapping tuples of the form (word, POS-tag) to their 
counts within the portion of the corpus in the input.

Additional requirement: POS-taggers differ with regard to what 
character they use for their delimiter. / is probably the most 
popular, but _ is a close runner-up.  Allow the user to optionally
specify this delimiter by using the keyword argument delimiter."""

def wordtag_distribution(filename='brown.txt', delimiter='/'):
	pass


"""===================================================================
6. Most common tags [2 points; thanks Phil!]

Using the result from wordtag_distribution, write a function that 
allows the user to input a list of words and returns a dictionary 
mapping each word in the input to a list of that word's most common 
POS tags. 

The output should be a dictionary mapping each word from the input to 
a list of tuples, each tuple containing a POS tag and a count for the 
number of occurrences with the given tag. The lists should be ordered 
from most common to least common tags, using the key parameter of the 
function sorted(). Users may optionally supply a lower limit for the
counts that appear in the output."""

def most_frequent_tags(counts, cutoff=0, filename='brown.txt'):
	pass

"""Extra credit [1 point]: take the output from most_frequent_tags and 
write a function tag_frequency_csv that writes it to a CSV file. Each 
line should have the format 

Word,POS tag,Count,Frequency 

where Frequency is Count/total-tokens-of-Word."""

def tag_frequency_csv(counts,output_filename):
    pass


"""===================================================================
7. Getting Brown corpus tag descriptions [2 points; thanks Simon!] 

Download

http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html

to your home machine and make sure it is accessible. This page lists
the tags used in the Brown corpus, each with a description and examples. 
They are listed in table format.  The tags are strings of characters 
without whitespace, the descriptions are words or phrases joined by 
commas or /, and the examples are space-separated words.

The following HTML tags are used to create a table:

    <table> creates a table, and </table> ends table creation.
    <tr> creates a new row in the table, and </tr> ends row creation.
    <td> creates a new cell in the row, and </td> ends cell creation.
    Other parameters may be included inside the tags, and they are case-insensitive
    (e.g. <TR ALIGN=top> is a legitimate row creation tag).
    Other tags, enclosed in < >, may be used within a cell for formatting.

Write a function which reads in the webpage and returns a dictionary 
which maps from each tag to a tuple containing the description and a 
list of examples; e.g.

{...
    'ABL': ('determiner/pronoun, pre-qualifier', ['quite', 'such', 'rather'])
...}

Note: The functions table_rowiter and stripstrings have been 
provided to help you. table_rowiter is an iterator that yields a string 
(in HTML) for each row. stripstrings removes padding, extra spaces, and 
... from a list of strings.

The first row of the table contains headers, and the second row 
contains no cells. Your code should pass by rows with fewer cells than 
the table has columns."""

def brown_tag_scraper(filename="brown.html"):
    #Open the webpage and read contents:
    with open(filename, 'rb') as webpage:  
        contents = webpage.read()
        
        #Regular expression for table
        re_table = re.compile(r'\<table.*?\>(.*?)\</table\>', re.I)
        
        #Get the table. In the case of multiple tables, use the first one.
        myTable = re_table.search(contents.replace('\n','')).group()
        
    #Get a row iterator. This iterator returns a string containing the HTML
    #representation for each row, row-by-row.
    rowiter = table_rowiter(myTable)
    
    #Preallocate dictionary
    tags = dict()    
    
    #The first row is the header:
    
    #Iterate through the rows:
    
    #Return tags:
    return tags

def row2list(rowHTML):
    """Takes in a string representation of a table row in HTML, and returns a
    list of the cell entries in that row.
    
    You should use a regular expression to extract a list of cell entries in
    the given row, then use stripstrings() to tidy up the strings in this list."""
    pass
    
def table_rowiter(tableStr):
    """Takes in a string representing an HTML table and returns an iterator which
    yields string representations of rows of that table, in HTML."""    
    #Regular expression for table rows
    re_tableRow = re.compile(r'\<tr.*?\>(.*?)\</tr\>', re.I)    
    #Get an iterator for the table rows
    rows = re_tableRow.finditer(tableStr)    
    #Iterate through the rows:
    for row in rows:
        #Yield the current row cells (with padding removed)
        yield row.group()
        
def stripstrings(strList):
    """Returns a copy of the list of provided strings, with padding stripped"""    
    return [re.sub(r'\s{2,}', ' ', s.replace('...','').strip()) for s in strList]


"""===================================================================
8. Exploring tags from the Brown corpus [2 points; thanks Simon!] 

Write a function which enables you to explore the dictionary of Brown 
corpus tags you obtained from brown_tag_scraper. 

This function should take in three optional parameters: a tag 
substring, a  description substring, and/or an example word. Any of 
these may be string representations of regular expressions.

In each case, if a single parameter is provided, the function should 
return the full information for a tag whose name / description / 
examples (as appropriate) contain the given substring/word. If multiple 
different parameters are provided, the function should return the 
full information for a tag which contains all provided values in the 
relevant places.

The default search values of '' should match anything."""

def brown_tag_explorer(browntags, tagSearch='', descSearch='', exSearch=''):
    """Searches through the provided dictionary of brown corpus tag information 
    for tags which match the given tagSearch (in tag name), descSearch (in tag
    description) and exSearch (in tag examples). Case is ignored."""
    
    #Create regular expressions for matching each search string:
    
    #Preallocate the list of tag hits to be those tags which match tagSearch:
    
    #Preallocate a dictionary of hits
    hits = dict()
    
    #Add to the return dictionary entries which match exSearch and descSearch

    #Return hits:
    return hits
    

def re_filterlist(regex, l):
    """Given a list l of strings and a regex, return a list of entries in l which match regex."""
    
    #Preallocate matches
    matches = list()
    
    #Iterate through l:

    #Return matches:    
    return matches


"""===================================================================
9. Rhyming [2 points; thanks Bonnie!]

Write a basic rhyming dictionary using the CMU pronouncing dictionary. 

The user inputs a string as the function's argument, and the function 
returns a list of words that, for at least one of their pronunciations, 
rhyme with the input word's first pronunciation. 

For the purposes of this problem, two words rhyme if they have an 
identical set of phonemes from the primary-stressed vowel to the end of 
the word. You can assume that the user will input a word that is in 
the dictionary.

You can modify cmu2pickle from problem 4 on Assignment 5 so that it 
does not strip off the digits that mark stress, and then use the 
resulting pickled dictionary to get your pronunciations. The 
dictionary should look like this:

{...
'ABATE': [['AH0', 'B', 'EY1', 'T']]
'ABATED' : [['AH0', 'B', 'EY1', 'T', 'IH0', 'D']]
'ABATEMENT' : [['AH0', 'B', 'EY1', 'T', 'M', 'AH0', 'N', 'T']]
...}"""

def cmu2pickle_new(src_filename='cmudict.0.7a.txt', 
                   output_filename="cmudict.0.7a.pickle.new"):
    """Get cmu2pickle from Assignment 5, and modify it so that 
    it does not delete digits that indicate stress."""
	pass

def rhyming_dictionary(s):
    """Takes a string and returns a list of rhyming words."""
	pass


"""===================================================================
10. PDF crawler [2 points thanks Masoud!]

Complete pdf_crawler so that it does the following:

* It downloads all of the PDF pages linked from link, and puts them into
the user's directory output_dirname.

* For each HTML page H linked from link, it follows H and calls 
pdf_crawler on H.

* It continues this crawling to depth, supplied by the user.  depth=0
0 just does the user's page, depth=1 follows the links at the user's
page, downloads those files, but doesn't go further, and so forth.

* It is respectful in that it uses the time module to rest between
downloads, so that no one's server gets slammed:

http://docs.python.org/2/library/time.html

* It avoids downloading files it already has.

For experimentation, please use the the link supplied by default.
This takes you to a directory in our class space. You can see
the contents of this directory here:

http://www.stanford.edu/class/linguist278/data/crawler/

You can also start at one of these other pages if you prefer:

http://www.stanford.edu/class/linguist278/data/crawler/compling.html
http://www.stanford.edu/class/linguist278/data/crawler/phonetics.html
http://www.stanford.edu/class/linguist278/data/crawler/phonology.html
http://www.stanford.edu/class/linguist278/data/crawler/psycholinguistics.html
http://www.stanford.edu/class/linguist278/data/crawler/semprag.html
http://www.stanford.edu/class/linguist278/data/crawler/sink.html
http://www.stanford.edu/class/linguist278/data/crawler/socio.html
http://www.stanford.edu/class/linguist278/data/crawler/start.html
http://www.stanford.edu/class/linguist278/data/crawler/syntax.html
"""

def pdf_crawler(link="http://www.stanford.edu/class/linguist278/data/crawler/start.html", 
                output_dirname=".", 
                depth=0):
    """Crawls webpages looking for PDFs to download. Branches
    out from the original to the specified depth."""
    pass


"""===================================================================
11. Processing an email corpus [4 points; thanks Li Sha!]

The Enron Email Corpus is a large database of over 600,000 emails 
generated by 158 employees (mostly senior management) of the Enron 
Corporation. This dataset was acquired and made public by the Federal 
Energy Regulatory Commission during the company's investigation, and 
then purchased by a computer scientist in MIT after the bankruptcy of 
the company. It is so far the largest collection of real emails that 
is publicly available for research. You can download this dataset from:

https://www.cs.cmu.edu/~enron/

Since the whole dataset contains more than 0.5M emails, we only use 
files in one folder in this homework. Specifically, we use 
_sent_mail folder under allen-p folder.

The goal is to create a csv file that contains the information of 
sender name, receiver email accounts, and the number of emails among 
them. To achieve this goal, you might want to consider the following:

a. Write a function that capture the sender and receiver information
from each file. Since all the files in this folder share the same
sender, you only need to extract the receiver information. The tricky
part is sometimes one email is sent to multiple receivers. In this
case, you can be imaginative about how to deal with it, but it has
to make sense.

b. Form a file iterator that iterates through all the files
in the folder and counts the number of emails exchange between
one receiver and the sender. You can store the information
in a dictionary.

c. Create a csv file that matches the sender and receivers,
with the header being the sender name, each row name being the
receiver names, and the associated values being the number of
email exchanges.

d. The receivers need to be sorted according to their in-degree in
a descending order. Based on the definition of directed graph,
in-degree of a receiver can be calculated as the number of emails
each receiver gets."""

def receiver_lst(dirname):
    receivers = []     

def email_counts(receivers):
    # Creates a default dictionary to store information
    counts = defaultdict(int)    

def email_csv(counts, output_filename='email-exchange.csv'):
    # Create a csv writer with output_filename and a header
    pass
    # Get a sorted representation of the counts:
    pass
    # Fill writer with corresponding stats:
    pass


"""===================================================================
12. Location data [6 points; thanks Laura!]

This problem asks you to create a labeled map programmatically using 
the Basemap toolkit from matplotlib:

http://matplotlib.org/basemap/users/examples.html

The problem is challenging because it involves learning how to use 
Basemap, processing CSV data, and converting location data.

The source data is this CSV file:

http://www.stanford.edu/class/linguist278/data/Places_locations2.csv

The output should be an image with the points on this map labeled
using the City,State,Country labels from the file."""

