Well, I decided I wouldn't teach much new this week. 😀
There were a couple of reasons for that. With just having the Wednesday class – and thinking that a weekly assignment really is very useful for making people struggle and learn – well, it seemed like I needed material for an assignment today, and that couldn't exactly be how to use the command line. But, also, based on the survey feedback, I thought it might be a good idea to put on the brakes a little and to try to consolidate a bit more what we have already done.
Really, you already know enough to be able to write a lot of programs to do a lot of things in Python. In coming weeks, we will certainly cover the command line and IDEs, libraries for data analysis, plotting, web scraping, and even machine learning. Nevertheless one of the main other skills you need is simply practice in writing bigger programs and getting them to work without becoming 😕 confused. So, let's practice that for today, and you can keep going as homework….
Now you know about importing libraries … you can do anything in Python! You just need to know what libraries are out there and how to use them!
We have already written functions, with def that take zero or more arguments. We give the arguments names in the function definition, so that we can refer to them in the body of the function with those variable names. But for calling the function, we just call it by passing some arguments to it, and we just line them up with each other (first with first, second with second, …) as positional arguments.
However, if a function can take 5 arguments, then it become real hard for the function user to remember the order they go in. Moreover, if a function has five arguments, often some of the arguments have sensible default values and you might want to make it optional for the user to specify them, otherwise using the sensible default. You can do both these things with keyword arguments. Here's how.
def greet(name, greeting='Greetings', followup='We come in peace!'):
print(greeting + ' ' + name + '! ' + followup)
x = 'hello'
greet('Chris')
greet('Chantal', followup='How are you?')
greet('Luis', followup='How\'s it going?', greeting='Hey')
Note that positional arguments must come first, and that keyword arguments with defaults are optional and can come in any order. There's even more you can do with keyword argument – you can make them compulsory or have them pre-populate a dict. I won't go through that here, though you can get details and more details.
You may not use positional arguments all that much in functions you write, but you will have to get used to them in the methods of major libraries we use. (Note: Remember that methods are things that come after a variable followed by a dot like str.lower() or re.search(). We haven't yet covered how to define methods, but they're just like functions but connected with a class or objects of a class.)
Python allows you to make graphs and charts! With enough knowledge of Python libraries, you can make all the different kinds of charts that you might otherwise make with R, Excel, Google Sheets, or whatever. Here's a very brief intro on how to do that. We'll do more with a high-level chart library another class.
The basic and common tool for charts in Python is called matplotlib. Ugly name. Ugly library. It's powerful, but low level and hard to use. Here's what François Chollet, a deep learning scientist at Google Brain says about it: link. There are alternatives, such as ggplot, bokeh, and seaborn, and we'll look at them later. One of them was written by Michael Waskom while he was a Ph.D. student in Neuroscience at Stanford! Nevertheless, matplotlib is the lowest common denominator, and it's everywhere in the Python ecosystem. Indeed, some of the prettier higher level libraries are written on top of matplotlib. You sort of have to be vaguely familiar with it.
matplotlib in IPython/Jupyter notebooks. ❗️❗️❗️ This bit is really important ❗️❗️❗️¶# The below line with % is IPython notebook magic.
# You must do this to have plotting work right in iPython notebooks!!!
# You must do it before you import matplotlib!
# 'notebook' is good, but there are other options. With none you get graphs in a separate window.
# %matplotlib
# %matplotlib inline
%matplotlib notebook
import matplotlib.pyplot as plt
IPython has some special commands, called magic that start with %. In this case, you have to set up the graphics environment to support matplotlib graphics. If you don't, maybe things won't work at all, maybe they'll appear to work for a bit and then everything will crash and burn 🔥 – and you'll need to shutdown your notebook and reload it. You have been warned.
plt.scatter([3,4,0,5,2], [11,19,8,15,12])
We have just passed in two arrays of numbers. Internally matplotlib does everything with numpy objects such as arrays and you can also pass those in. numpy is central to the Python scientific and mathematical computing ecosystem. We'll also cover it later. But for now we will just call methods with arrays and dicts.
Once you have a plot, it becomes your active plotting environment. Other graphics commands you give will add stuff to that plot. Try one:
plt.title('My first matplotlib chart!')
At some point you'll want to start a new chart. You can do that by deactiving the current chart by pressing the power button icon in the top right.
In code, you can get rid of current around and maybe active charts with plt.close("all").
plt.close("all")
Try starting at: Nicolas Rougier's matplotlib tutorial or the pyplot tutorial or the scipy lecture notes on matplotlib.
Most people couldn't possibly write code to produce a good plot by themselves. What you do is use Google or go to something like Nicolas Rougier's page, find a plot that looks roughly like what you want and then fiddle with matplotlib commands until you get what you want (or fail trying). It's the same as what people do in R (in my experience).
ages = {'sue':21, 'billy': 33, 'niawen':11, 'neha':43}
plt.bar(range(len(ages)), ages.values())
ages = {'sue':21, 'billy': 33, 'niawen':11, 'neha':43}
plt.bar(range(len(ages)), ages.values(), align='center')
ages = {'sue':21, 'billy': 33, 'niawen':11, 'neha':43}
plt.bar(range(len(ages)), ages.values(), align='center')
plt.xticks(range(len(ages)), list(ages.keys()))
plt.title('Ages of people')
We can put the columns in some order by using an OrderedDict rather than a plain dict. We can specify how to order the columns either using fancy stuff that we haven't covered (either list comprehensions or lambda functions) or using the operator library that was added in Python 3. We'll use the latter here.
import operator
import collections
# oages = collections.OrderedDict(sorted(ages.items()))
oages = collections.OrderedDict(sorted(ages.items(), key=operator.itemgetter(1)))
plt.bar(range(len(oages)), oages.values(), align='center')
plt.xticks(range(len(oages)), list(oages.keys()))
Since our charts below will have a lot of bars, it will be useful to draw them sideways. We can turn things sideways by instead using barh() and now specifying the labels on the Y axis
plt.barh(range(len(ages)), ages.values(), align='center')
plt.yticks(range(len(ages)), list(ages.keys()))
plt.xlabel('Age')
plt.title('People\'s ages')
One of my favorite linguists is Bruce Hayes. Here's a picture of him. He's always been a favorite linguist of mine. I remember really well his reading poetry and chanting folk songs while teaching about metrical structure at the 1991 Linguistic Institute at Santa Cruz. 
However, over the years, I've become even fonder of him with his really interesting work in exploring empirical and even computational approaches to phonology. We're not really going to do any heavy duty phonology here, but this assignment is inspired by Bruce's work on Hungarian vowel harmony.
Many languages have systems of vowel harmony where affixes on words must contain certain vowels based on the vowels that precede or follow them. For example, there are verb suffixes in Hungarian like -ok/ek/ük, where the choice between back, short-front and long-front vowel words depends on the preceding vowel:
Vowel harmony is something that you find not only in Hungarian but in many other languages, from nearby but unrelated ones like Turkish to far away ones like many Australian languages, such as Warlpiri. It's a somewhat strange phenomenon. A language loses information density by having vowel harmony (you can distinguish less words of a certain length) but gains in distinguishability (you get multiple chances to hear a sound right) and in having a harmonious sound.
Often vowel harmony doesn't always happen. Hungarian also has suffixes that do no show vowel harmony. An example is -ért, the causal-final case:
Stems need not show harmony, but tend to. Bruce has been interested in the idea that there tend to be soft phonotactic constraints in languages – not hard rules, but tendencies – and in Hungarian vowel harmony. Here are a couple of papers: Comparative Phonotactics, Natural and Unnatural Constraints in Hungarian Vowel Harmony, and you can find a lot of other interesting stuff on his webpage.
We're going to see how strong the overall preference for vowel harmony in words is in Hungarian.
First, let's learn a bit about Hungarian vowels. We're just going to use Hungarian orthography here, rather than converting things into phonological representations with IPA. Fortunately, Hungarian orthography is fairly phonemic. Below, is a regex for the set of all Hungarian vowels.
They divide in matched pairs as short and long vowels. Long vowels all have an acute (or two) above them. There are then back vowels and front vowels. Finally short and long [i] and [í] are sort of special. They are central vowels.
vowels = '[aeioöuüáéíóőúű]'
short_vowels = '[aeioöuü]'
long_vowels = '[áéíóőúű]'
front_vowels = '[eöüéőű]'
back_vowels = '[aouáóú]'
central_vowels = '[ií]'
short_front_vowels = '[eöü]' # f
long_front_vowels = '[éőű]' # F
short_back_vowels = '[aou]' # b
long_back_vowels = '[áóú]' # B
short_central_vowels = '[i]' # c
long_central_vowels = '[í]' # C
Q1: What we want to do is some counts of words and see what patterns we observe with different patterns of vowels in these classes. First, we'll need some Hungarian text. There's some Hungarian text, one sentence per line in the file 1984-hungarian.txt. (It's the translation of the novel "1984".)
Let's write a program that does these things:
1984-hungarian.txtsub() function# Write a program to do all that here!
%matplotlib notebook
import matplotlib.pyplot as plt
import re
import pprint
counts = {}
vowel_counts = {}
# #1
file = open('1984-hungarian.txt', encoding='utf-8')
punct = re.compile('[,-.?!;:]')
nonvowels = vowels[:1] + '^' + vowels[1:]
consonant_re = re.compile(nonvowels)
for line in file:
line = line.strip()
# #2
line = punct.sub('', line)
# #3
line = line.lower()
# #4
words = line.split()
for word in words:
# #5
word = consonant_re.sub('', word)
# #6
if word == '':
continue
# #7
word = re.sub(short_front_vowels, 'f', word)
word = re.sub(long_front_vowels, 'F', word)
word = re.sub(short_back_vowels, 'b', word)
word = re.sub(long_back_vowels, 'B', word)
word = re.sub(short_central_vowels, 'c', word)
word = re.sub(long_central_vowels, 'C', word)
# #8
l = len(word)
if l not in counts:
counts[l] = {}
dict = counts[l]
dict[word] = dict.get(word, 0) + 1
# #9
for c in word:
vowel_counts[c] = vowel_counts.get(c, 0) + 1
# #10
file.close()
Then we will want to be able to:
Q2: Try printing out some of the counts below and look at the data
# Code to print some of the counts for word vowel patterns and overall vowel frequency
pprint.pprint(counts[1])
# Code to print some of the counts for word vowel patterns and overall vowel frequency
pprint.pprint(vowel_counts)
vowel_total = sum(vowel_counts.values())
vowel_percents = {}
for v in vowel_counts.keys():
vowel_percents[v] = vowel_counts[v] / vowel_total * 100
pprint.pprint(vowel_percents)
print('=== 2 vowel words ===')
pprint.pprint(counts[2])
print('=== 3 vowel words ===')
pprint.pprint(counts[3])
def table_of_observed_expected(dic):
print('This compares observed versus expected counts of vowels')
print('pattern observed expected ratio')
print('-------------------------------')
num_k_syl = sum(dic.values())
for k in sorted(dic.keys()):
expected = num_k_syl
for v in k:
expected = expected * (vowel_counts[v] / vowel_total)
# This is the fancier way to print things nicely with .format()!
print('{}: {:>6} {:>8.2f} {:>8.2f}'.format(
k, dic[k], expected, dic[k] / expected))
table_of_observed_expected(counts[2])
table_of_observed_expected(counts[1])
print()
table_of_observed_expected(counts[3])
Q3: Now define a function bar_graph that will produce a horizontal bar graph for a single dict.
# Code to produce a bar graph
def bar_graph(dict):
plt.barh(range(len(dict)), dict.values(), align='center')
plt.yticks(range(len(dict)), list(dict.keys()))
plt.show()
You should then be able to call it on the dict you made for all 2 vowel words, and again for all 3 vowel words. And for the overall vowel distribution. You'll need to substitute in the right variable names for my placeholders x, etc.
bar_graph(counts[2])
bar_graph(counts[3])
bar_graph(vowel_counts)
Q4: It'd be handy to be able to sort the keys by their values. Try to make a plot that sorts the bars by their value (pattern frequency) using OrderedDict and the operator library as shown at the beginning!
import collections
import operator
def bar_graph_sorted(dict):
odict = collections.OrderedDict(sorted(dict.items(), key=operator.itemgetter(1)))
plt.barh(range(len(odict)), odict.values(), align='center')
plt.yticks(range(len(odict)), list(odict.keys()))
plt.show()
We can then call it to display some graphs. Remember to substitute in your variable names.
bar_graph_sorted(vowel_counts)
bar_graph_sorted(counts[1])
bar_graph_sorted(counts[2])
bar_graph_sorted(counts[3])
Q5: Final question: What observations can you make about the frequencies of different vowel patterns in Hungarian words? Maybe you could write about a 10 line paragraph. Not an essay!
For the overall vowel counts, short vowels are much more common than long vowels - about 3 times as common. For two and three vowel words, all the most common patterns by count are patterns that involve all front or all back vowels (bb, bB, etc.). It should be remembered that it is not too surprising that these patterns are more common than ones with central vowels in them, since there is only 1 central vowel of each length versus 3 front and 3 back vowels of each length. Given that, we note that central vowels freely occur with both front and back vowels with a similar frequency. What is uncommon is words that mix front and back vowels. It's not that they're impossible - they certainly occur. But their occurrence is marked. If we work out expected counts for vowel patterns based on overall vowel frequency, we see that patterns that are all (back or central) or all (front or central) usually occur twice or more as often as expected, whereas patterns that combine a front and a back vowel commonly occur 1/3 or less often than expected. There are some other patterns in the data, too: It seems to be uncommon for the last syllable to contain a central vowel. And there are a few patterns that combine back and front vowels that do actually occur more often than expected, such as "BfF" and "bcf". It's not clear why that is.