Linguistics 278: Class 6 and Assignment 4

Well, I decided I wouldn't teach much new this week. 😀

There were a couple of reasons for that. With just having the Wednesday class – and thinking that a weekly assignment really is very useful for making people struggle and learn – well, it seemed like I needed material for an assignment today, and that couldn't exactly be how to use the command line. But, also, based on the survey feedback, I thought it might be a good idea to put on the brakes a little and to try to consolidate a bit more what we have already done.

Really, you already know enough to be able to write a lot of programs to do a lot of things in Python. In coming weeks, we will certainly cover the command line and IDEs, libraries for data analysis, plotting, web scraping, and even machine learning. Nevertheless one of the main other skills you need is simply practice in writing bigger programs and getting them to work without becoming 😕 confused. So, let's practice that for today, and you can keep going as homework….

A couple of small new things … not exactly hard concepts, but good to know

Now you know about importing libraries … you can do anything in Python! You just need to know what libraries are out there and how to use them!

Keyword arguments

We have already written functions, with def that take zero or more arguments. We give the arguments names in the function definition, so that we can refer to them in the body of the function with those variable names. But for calling the function, we just call it by passing some arguments to it, and we just line them up with each other (first with first, second with second, …) as positional arguments.

However, if a function can take 5 arguments, then it become real hard for the function user to remember the order they go in. Moreover, if a function has five arguments, often some of the arguments have sensible default values and you might want to make it optional for the user to specify them, otherwise using the sensible default. You can do both these things with keyword arguments. Here's how.

In [9]:
def greet(name, greeting='Greetings', followup='We come in peace!'):
    print(greeting + ' ' + name + '! ' + followup)
    x = 'hello'
In [8]:
greet('Chris')
Greetings Chris! We come in peace!
In [12]:
greet('Chantal', followup='How are you?')
Greetings Chantal! How are you?
In [7]:
greet('Luis', followup='How\'s it going?', greeting='Hey')
Hey Luis! How's it going?

Note that positional arguments must come first, and that keyword arguments with defaults are optional and can come in any order. There's even more you can do with keyword argument – you can make them compulsory or have them pre-populate a dict. I won't go through that here, though you can get details and more details.

You may not use positional arguments all that much in functions you write, but you will have to get used to them in the methods of major libraries we use. (Note: Remember that methods are things that come after a variable followed by a dot like str.lower() or re.search(). We haven't yet covered how to define methods, but they're just like functions but connected with a class or objects of a class.)

Making charts

Python allows you to make graphs and charts! With enough knowledge of Python libraries, you can make all the different kinds of charts that you might otherwise make with R, Excel, Google Sheets, or whatever. Here's a very brief intro on how to do that. We'll do more with a high-level chart library another class.

The basic and common tool for charts in Python is called matplotlib. Ugly name. Ugly library. It's powerful, but low level and hard to use. Here's what François Chollet, a deep learning scientist at Google Brain says about it: link. There are alternatives, such as ggplot, bokeh, and seaborn, and we'll look at them later. One of them was written by Michael Waskom while he was a Ph.D. student in Neuroscience at Stanford! Nevertheless, matplotlib is the lowest common denominator, and it's everywhere in the Python ecosystem. Indeed, some of the prettier higher level libraries are written on top of matplotlib. You sort of have to be vaguely familiar with it.

Starting matplotlib in IPython/Jupyter notebooks. ❗️❗️❗️ This bit is really important ❗️❗️❗️

In [13]:
# The below line with % is IPython notebook magic. 
# You must do this to have plotting work right in iPython notebooks!!!
# You must do it before you import matplotlib!  
# 'notebook' is good, but there are other options. With none you get graphs in a separate window.
# %matplotlib
# %matplotlib inline
%matplotlib notebook

import matplotlib.pyplot as plt

IPython has some special commands, called magic that start with %. In this case, you have to set up the graphics environment to support matplotlib graphics. If you don't, maybe things won't work at all, maybe they'll appear to work for a bit and then everything will crash and burn 🔥 – and you'll need to shutdown your notebook and reload it. You have been warned.

A very simple scatter plot

In [53]:
plt.scatter([3,4,0,5,2], [11,19,8,15,12])
Out[53]:
<matplotlib.collections.PathCollection at 0x1108b8dd8>

We have just passed in two arrays of numbers. Internally matplotlib does everything with numpy objects such as arrays and you can also pass those in. numpy is central to the Python scientific and mathematical computing ecosystem. We'll also cover it later. But for now we will just call methods with arrays and dicts.

Once you have a plot, it becomes your active plotting environment. Other graphics commands you give will add stuff to that plot. Try one:

In [54]:
plt.title('My first matplotlib chart!') 
Out[54]:
<matplotlib.text.Text at 0x110581550>

At some point you'll want to start a new chart. You can do that by deactiving the current chart by pressing the power button icon in the top right.

In code, you can get rid of current around and maybe active charts with plt.close("all").

In [55]:
plt.close("all")

Finding how to do more

Try starting at: Nicolas Rougier's matplotlib tutorial or the pyplot tutorial or the scipy lecture notes on matplotlib.

Most people couldn't possibly write code to produce a good plot by themselves. What you do is use Google or go to something like Nicolas Rougier's page, find a plot that looks roughly like what you want and then fiddle with matplotlib commands until you get what you want (or fail trying). It's the same as what people do in R (in my experience).

Bar charts … in easy steps

In [22]:
ages = {'sue':21, 'billy': 33, 'niawen':11, 'neha':43}
plt.bar(range(len(ages)), ages.values())
Out[22]:
<Container object of 4 artists>
In [24]:
ages = {'sue':21, 'billy': 33, 'niawen':11, 'neha':43}
plt.bar(range(len(ages)), ages.values(), align='center')
Out[24]:
<Container object of 4 artists>
In [27]:
ages = {'sue':21, 'billy': 33, 'niawen':11, 'neha':43}
plt.bar(range(len(ages)), ages.values(), align='center')
plt.xticks(range(len(ages)), list(ages.keys()))
plt.title('Ages of people')
Out[27]:
<matplotlib.text.Text at 0x1102cf780>

Making it prettier with OrderedDict and the operator library

We can put the columns in some order by using an OrderedDict rather than a plain dict. We can specify how to order the columns either using fancy stuff that we haven't covered (either list comprehensions or lambda functions) or using the operator library that was added in Python 3. We'll use the latter here.

In [31]:
import operator
import collections

# oages = collections.OrderedDict(sorted(ages.items()))
oages = collections.OrderedDict(sorted(ages.items(), key=operator.itemgetter(1)))
plt.bar(range(len(oages)), oages.values(), align='center')
plt.xticks(range(len(oages)), list(oages.keys()))
Out[31]:
([<matplotlib.axis.XTick at 0x10f635dd8>,
  <matplotlib.axis.XTick at 0x11073c6a0>,
  <matplotlib.axis.XTick at 0x11072b630>,
  <matplotlib.axis.XTick at 0x1109d2780>],
 <a list of 4 Text xticklabel objects>)

Since our charts below will have a lot of bars, it will be useful to draw them sideways. We can turn things sideways by instead using barh() and now specifying the labels on the Y axis

In [33]:
plt.barh(range(len(ages)), ages.values(), align='center')
plt.yticks(range(len(ages)), list(ages.keys()))
plt.xlabel('Age')
plt.title('People\'s ages')
Out[33]:
<matplotlib.text.Text at 0x110a13d30>

Assignment 4

Lengthy introduction

One of my favorite linguists is Bruce Hayes. Here's a picture of him. He's always been a favorite linguist of mine. I remember really well his reading poetry and chanting folk songs while teaching about metrical structure at the 1991 Linguistic Institute at Santa Cruz. Bruce Hayes

However, over the years, I've become even fonder of him with his really interesting work in exploring empirical and even computational approaches to phonology. We're not really going to do any heavy duty phonology here, but this assignment is inspired by Bruce's work on Hungarian vowel harmony.

Many languages have systems of vowel harmony where affixes on words must contain certain vowels based on the vowels that precede or follow them. For example, there are verb suffixes in Hungarian like -ok/ek/ük, where the choice between back, short-front and long-front vowel words depends on the preceding vowel:

  • lát becomes látok 'I see'
  • szeret becomes szeretek 'I like'
  • ül becomes ülök 'I sit'

Vowel harmony is something that you find not only in Hungarian but in many other languages, from nearby but unrelated ones like Turkish to far away ones like many Australian languages, such as Warlpiri. It's a somewhat strange phenomenon. A language loses information density by having vowel harmony (you can distinguish less words of a certain length) but gains in distinguishability (you get multiple chances to hear a sound right) and in having a harmonious sound.

Often vowel harmony doesn't always happen. Hungarian also has suffixes that do no show vowel harmony. An example is -ért, the causal-final case:

  • az + ért = azért 'for that reason, therefore'
  • ez + ért = ezért 'for this reason, herefore'

Stems need not show harmony, but tend to. Bruce has been interested in the idea that there tend to be soft phonotactic constraints in languages – not hard rules, but tendencies – and in Hungarian vowel harmony. Here are a couple of papers: Comparative Phonotactics, Natural and Unnatural Constraints in Hungarian Vowel Harmony, and you can find a lot of other interesting stuff on his webpage.

We're going to see how strong the overall preference for vowel harmony in words is in Hungarian.

First, let's learn a bit about Hungarian vowels. We're just going to use Hungarian orthography here, rather than converting things into phonological representations with IPA. Fortunately, Hungarian orthography is fairly phonemic. Below, is a regex for the set of all Hungarian vowels.

They divide in matched pairs as short and long vowels. Long vowels all have an acute (or two) above them. There are then back vowels and front vowels. Finally short and long [i] and [í] are sort of special. They are central vowels.

In [3]:
vowels = '[aeioöuüáéíóőúű]' 

short_vowels = '[aeioöuü]'
long_vowels = '[áéíóőúű]'
front_vowels = '[eöüéőű]'
back_vowels = '[aouáóú]'
central_vowels = '[ií]'

short_front_vowels = '[eöü]'  # f
long_front_vowels = '[éőű]'   # F
short_back_vowels = '[aou]'   # b
long_back_vowels = '[áóú]'    # B
short_central_vowels = '[i]'  # c
long_central_vowels = '[í]'   # C

Q1: What we want to do is some counts of words and see what patterns we observe with different patterns of vowels in these classes. First, we'll need some Hungarian text. There's some Hungarian text, one sentence per line in the file 1984-hungarian.txt. (It's the translation of the novel "1984".)

Let's write a program that does these things:

  1. Opens the file 1984-hungarian.txt
  2. Reads each line and deletes punctuation, including at least [,-.?!;:]
    • Remember that you can do replacement of a bunch of things - that match a regex with the sub() function
  3. lowercases the string
  4. splits it into words (on whitespace)
  5. deletes all the consonants in each word
  6. some words have no vowels, so ignore them and go on (this includes numbers like "1984" and the word "s" which means "and")
  7. recodes the vowels into the six vowel classes shown above (f, F, b, B, c, C)
  8. counts how often each vowel pattern occurs for a word. Organize these counts by the number of vowels, so things are easy to print out for two vowel, three vowel, etc. words. Keep them in some data object.
  9. Also just count how many occurrences of each vowel there is. It'd also be just useful to know the overall frequency of different vowels when examining whether patterns of vowels are distinctively common or not. Keep these counts in some data object.
  10. Close the file when done
In [56]:
# Write a program to do all that here!
%matplotlib notebook

import matplotlib.pyplot as plt
import re
import pprint

counts = {}
vowel_counts = {}

# #1
file = open('1984-hungarian.txt', encoding='utf-8') 
punct = re.compile('[,-.?!;:]')
nonvowels = vowels[:1] + '^' + vowels[1:]
consonant_re = re.compile(nonvowels)

for line in file:
    line = line.strip()
    # #2
    line = punct.sub('', line)
    # #3
    line = line.lower()
    # #4
    words = line.split()
    for word in words:
        # #5
        word = consonant_re.sub('', word)
        # #6
        if word == '':
            continue
        # #7
        word = re.sub(short_front_vowels, 'f', word)
        word = re.sub(long_front_vowels, 'F', word)
        word = re.sub(short_back_vowels, 'b', word)
        word = re.sub(long_back_vowels, 'B', word)
        word = re.sub(short_central_vowels, 'c', word)
        word = re.sub(long_central_vowels, 'C', word)
        # #8
        l = len(word)
        if l not in counts:
            counts[l] = {}
        dict = counts[l]
        dict[word] = dict.get(word, 0) + 1
        # #9
        for c in word:
            vowel_counts[c] = vowel_counts.get(c, 0) + 1
            
# #10
file.close()
    

Then we will want to be able to:

  1. Print out a table of the counts of each vowel pattern for a certain length of words (two vowel words, three vowel words, etc.). For example, we might find that the text had 300 2 syllable words with just short central vowels, which would have the pattern 'cc' but only 20 with long central vowels and the pattern 'CC'
  2. Print out the overall distribution of vowels by class
  3. Draw charts!

Q2: Try printing out some of the counts below and look at the data

In [57]:
# Code to print some of the counts for word vowel patterns and overall vowel frequency
pprint.pprint(counts[1])
{'B': 1887, 'C': 187, 'F': 2598, 'b': 14391, 'c': 2013, 'f': 5839}
In [58]:
# Code to print some of the counts for word vowel patterns and overall vowel frequency
pprint.pprint(vowel_counts)
vowel_total = sum(vowel_counts.values())
vowel_percents = {}
for v in vowel_counts.keys():
    vowel_percents[v] = vowel_counts[v] / vowel_total * 100
pprint.pprint(vowel_percents)
{'B': 19521, 'C': 2150, 'F': 19503, 'b': 67810, 'c': 18559, 'f': 56932}
{'B': 10.581921669602927,
 'C': 1.1654695758232823,
 'F': 10.57216424989836,
 'b': 36.75836834259385,
 'c': 10.060441794281068,
 'f': 30.861634367800516}
In [59]:
print('=== 2 vowel words ===')
pprint.pprint(counts[2])
print('=== 3 vowel words ===')
pprint.pprint(counts[3])
=== 2 vowel words ===
{'BB': 369,
 'BC': 23,
 'BF': 53,
 'Bb': 1380,
 'Bc': 221,
 'Bf': 32,
 'CB': 46,
 'CF': 23,
 'Cb': 85,
 'Cc': 14,
 'Cf': 129,
 'FB': 175,
 'FC': 3,
 'FF': 329,
 'Fb': 149,
 'Fc': 284,
 'Ff': 1322,
 'bB': 1692,
 'bC': 55,
 'bF': 87,
 'bb': 4993,
 'bc': 650,
 'bf': 370,
 'cB': 260,
 'cC': 6,
 'cF': 306,
 'cb': 1300,
 'cc': 251,
 'cf': 975,
 'fB': 147,
 'fC': 9,
 'fF': 1619,
 'fb': 142,
 'fc': 742,
 'ff': 3464}
=== 3 vowel words ===
{'BBB': 45,
 'BBb': 215,
 'BBc': 17,
 'BCB': 15,
 'BCb': 19,
 'BCc': 1,
 'BFB': 7,
 'BFF': 9,
 'BFb': 18,
 'BFc': 3,
 'BFf': 26,
 'BbB': 262,
 'BbC': 1,
 'BbF': 17,
 'Bbb': 477,
 'Bbc': 81,
 'Bbf': 11,
 'BcB': 6,
 'BcF': 3,
 'Bcb': 47,
 'Bcc': 6,
 'Bcf': 29,
 'BfF': 84,
 'Bfc': 15,
 'Bff': 63,
 'CBB': 6,
 'CBb': 30,
 'CBc': 1,
 'CCF': 2,
 'CCf': 1,
 'CFF': 16,
 'CFc': 4,
 'CFf': 25,
 'CbB': 4,
 'Cbb': 13,
 'Cbc': 8,
 'Ccb': 4,
 'Ccf': 2,
 'CfF': 8,
 'Cfc': 6,
 'Cff': 25,
 'FBB': 9,
 'FBF': 1,
 'FBb': 109,
 'FCF': 7,
 'FCb': 3,
 'FCf': 6,
 'FFB': 1,
 'FFF': 45,
 'FFb': 4,
 'FFc': 17,
 'FFf': 215,
 'FbB': 38,
 'FbF': 3,
 'Fbb': 59,
 'Fbc': 11,
 'FcB': 2,
 'FcF': 10,
 'Fcb': 19,
 'Fcc': 1,
 'Fcf': 21,
 'FfB': 2,
 'FfC': 1,
 'FfF': 146,
 'Ffb': 1,
 'Ffc': 188,
 'Fff': 715,
 'bBB': 211,
 'bBC': 2,
 'bBF': 21,
 'bBb': 771,
 'bBc': 102,
 'bBf': 6,
 'bCB': 13,
 'bCb': 46,
 'bFB': 6,
 'bFF': 3,
 'bFb': 22,
 'bFf': 61,
 'bbB': 654,
 'bbC': 1,
 'bbF': 37,
 'bbb': 2000,
 'bbc': 599,
 'bbf': 140,
 'bcB': 23,
 'bcF': 19,
 'bcb': 350,
 'bcf': 304,
 'bfB': 9,
 'bfC': 7,
 'bfF': 43,
 'bfb': 33,
 'bfc': 46,
 'bff': 323,
 'cBB': 45,
 'cBF': 3,
 'cBb': 222,
 'cBc': 25,
 'cCB': 2,
 'cCb': 2,
 'cFF': 38,
 'cFb': 4,
 'cFc': 7,
 'cFf': 118,
 'cbB': 135,
 'cbF': 12,
 'cbb': 387,
 'cbc': 44,
 'cbf': 12,
 'ccB': 10,
 'ccC': 1,
 'ccF': 7,
 'ccb': 30,
 'ccc': 1,
 'ccf': 24,
 'cfB': 1,
 'cfC': 2,
 'cfF': 138,
 'cfb': 22,
 'cfc': 164,
 'cff': 385,
 'fBB': 45,
 'fBF': 3,
 'fBb': 171,
 'fBc': 6,
 'fCB': 13,
 'fCF': 21,
 'fCb': 17,
 'fCc': 6,
 'fCf': 34,
 'fFB': 7,
 'fFC': 7,
 'fFF': 187,
 'fFb': 4,
 'fFc': 124,
 'fFf': 942,
 'fbB': 139,
 'fbC': 3,
 'fbb': 360,
 'fbc': 43,
 'fcB': 10,
 'fcF': 23,
 'fcb': 43,
 'fcc': 10,
 'fcf': 158,
 'ffB': 19,
 'ffC': 4,
 'ffF': 669,
 'ffb': 18,
 'ffc': 334,
 'fff': 2126}
In [61]:
def table_of_observed_expected(dic):
    print('This compares observed versus expected counts of vowels')
    print('pattern observed expected ratio')
    print('-------------------------------')
    num_k_syl = sum(dic.values())
    for k in sorted(dic.keys()):
        expected = num_k_syl
        for v in k:
            expected = expected * (vowel_counts[v] / vowel_total)
        # This is the fancier way to print things nicely with .format()!
        print('{}:  {:>6}  {:>8.2f}  {:>8.2f}'.format(
                k, dic[k], expected, dic[k] / expected))
    
table_of_observed_expected(counts[2])
This compares observed versus expected counts of vowels
pattern observed expected ratio
-------------------------------
BB:     369    243.05      1.52
BC:      23     26.77      0.86
BF:      53    242.82      0.22
Bb:    1380    844.27      1.63
Bc:     221    231.07      0.96
Bf:      32    708.83      0.05
CB:      46     26.77      1.72
CF:      23     26.74      0.86
Cb:      85     92.99      0.91
Cc:      14     25.45      0.55
Cf:     129     78.07      1.65
FB:     175    242.82      0.72
FC:       3     26.74      0.11
FF:     329    242.60      1.36
Fb:     149    843.49      0.18
Fc:     284    230.86      1.23
Ff:    1322    708.18      1.87
bB:    1692    844.27      2.00
bC:      55     92.99      0.59
bF:      87    843.49      0.10
bb:    4993   2932.73      1.70
bc:     650    802.66      0.81
bf:     370   2462.27      0.15
cB:     260    231.07      1.13
cC:       6     25.45      0.24
cF:     306    230.86      1.33
cb:    1300    802.66      1.62
cc:     251    219.68      1.14
cf:     975    673.90      1.45
fB:     147    708.83      0.21
fC:       9     78.07      0.12
fF:    1619    708.18      2.29
fb:     142   2462.27      0.06
fc:     742    673.90      1.10
ff:    3464   2067.27      1.68
In [62]:
table_of_observed_expected(counts[1])
print()
table_of_observed_expected(counts[3])
This compares observed versus expected counts of vowels
pattern observed expected ratio
-------------------------------
B:    1887   2848.12      0.66
C:     187    313.69      0.60
F:    2598   2845.50      0.91
b:   14391   9893.51      1.45
c:    2013   2707.77      0.74
f:    5839   8306.41      0.70

This compares observed versus expected counts of vowels
pattern observed expected ratio
-------------------------------
BBB:      45     19.56      2.30
BBb:     215     67.94      3.16
BBc:      17     18.59      0.91
BCB:      15      2.15      6.96
BCb:      19      7.48      2.54
BCc:       1      2.05      0.49
BFB:       7     19.54      0.36
BFF:       9     19.52      0.46
BFb:      18     67.87      0.27
BFc:       3     18.58      0.16
BFf:      26     56.99      0.46
BbB:     262     67.94      3.86
BbC:       1      7.48      0.13
BbF:      17     67.87      0.25
Bbb:     477    235.99      2.02
Bbc:      81     64.59      1.25
Bbf:      11    198.13      0.06
BcB:       6     18.59      0.32
BcF:       3     18.58      0.16
Bcb:      47     64.59      0.73
Bcc:       6     17.68      0.34
Bcf:      29     54.23      0.53
BfF:      84     56.99      1.47
Bfc:      15     54.23      0.28
Bff:      63    166.35      0.38
CBB:       6      2.15      2.79
CBb:      30      7.48      4.01
CBc:       1      2.05      0.49
CCF:       2      0.24      8.44
CCf:       1      0.69      1.45
CFF:      16      2.15      7.44
CFc:       4      2.05      1.96
CFf:      25      6.28      3.98
CbB:       4      7.48      0.53
Cbb:      13     25.99      0.50
Cbc:       8      7.11      1.12
Ccb:       4      7.11      0.56
Ccf:       2      5.97      0.33
CfF:       8      6.28      1.27
Cfc:       6      5.97      1.00
Cff:      25     18.32      1.36
FBB:       9     19.54      0.46
FBF:       1     19.52      0.05
FBb:     109     67.87      1.61
FCF:       7      2.15      3.26
FCb:       3      7.48      0.40
FCf:       6      6.28      0.96
FFB:       1     19.52      0.05
FFF:      45     19.50      2.31
FFb:       4     67.81      0.06
FFc:      17     18.56      0.92
FFf:     215     56.93      3.78
FbB:      38     67.87      0.56
FbF:       3     67.81      0.04
Fbb:      59    235.77      0.25
Fbc:      11     64.53      0.17
FcB:       2     18.58      0.11
FcF:      10     18.56      0.54
Fcb:      19     64.53      0.29
Fcc:       1     17.66      0.06
Fcf:      21     54.18      0.39
FfB:       2     56.99      0.04
FfC:       1      6.28      0.16
FfF:     146     56.93      2.56
Ffb:       1    197.95      0.01
Ffc:     188     54.18      3.47
Fff:     715    166.19      4.30
bBB:     211     67.94      3.11
bBC:       2      7.48      0.27
bBF:      21     67.87      0.31
bBb:     771    235.99      3.27
bBc:     102     64.59      1.58
bBf:       6    198.13      0.03
bCB:      13      7.48      1.74
bCb:      46     25.99      1.77
bFB:       6     67.87      0.09
bFF:       3     67.81      0.04
bFb:      22    235.77      0.09
bFf:      61    197.95      0.31
bbB:     654    235.99      2.77
bbC:       1     25.99      0.04
bbF:      37    235.77      0.16
bbb:    2000    819.76      2.44
bbc:     599    224.36      2.67
bbf:     140    688.25      0.20
bcB:      23     64.59      0.36
bcF:      19     64.53      0.29
bcb:     350    224.36      1.56
bcf:     304    188.37      1.61
bfB:       9    198.13      0.05
bfC:       7     21.82      0.32
bfF:      43    197.95      0.22
bfb:      33    688.25      0.05
bfc:      46    188.37      0.24
bff:     323    577.84      0.56
cBB:      45     18.59      2.42
cBF:       3     18.58      0.16
cBb:     222     64.59      3.44
cBc:      25     17.68      1.41
cCB:       2      2.05      0.98
cCb:       2      7.11      0.28
cFF:      38     18.56      2.05
cFb:       4     64.53      0.06
cFc:       7     17.66      0.40
cFf:     118     54.18      2.18
cbB:     135     64.59      2.09
cbF:      12     64.53      0.19
cbb:     387    224.36      1.72
cbc:      44     61.41      0.72
cbf:      12    188.37      0.06
ccB:      10     17.68      0.57
ccC:       1      1.95      0.51
ccF:       7     17.66      0.40
ccb:      30     61.41      0.49
ccc:       1     16.81      0.06
ccf:      24     51.55      0.47
cfB:       1     54.23      0.02
cfC:       2      5.97      0.33
cfF:     138     54.18      2.55
cfb:      22    188.37      0.12
cfc:     164     51.55      3.18
cff:     385    158.15      2.43
fBB:      45     57.04      0.79
fBF:       3     56.99      0.05
fBb:     171    198.13      0.86
fBc:       6     54.23      0.11
fCB:      13      6.28      2.07
fCF:      21      6.28      3.35
fCb:      17     21.82      0.78
fCc:       6      5.97      1.00
fCf:      34     18.32      1.86
fFB:       7     56.99      0.12
fFC:       7      6.28      1.12
fFF:     187     56.93      3.28
fFb:       4    197.95      0.02
fFc:     124     54.18      2.29
fFf:     942    166.19      5.67
fbB:     139    198.13      0.70
fbC:       3     21.82      0.14
fbb:     360    688.25      0.52
fbc:      43    188.37      0.23
fcB:      10     54.23      0.18
fcF:      23     54.18      0.42
fcb:      43    188.37      0.23
fcc:      10     51.55      0.19
fcf:     158    158.15      1.00
ffB:      19    166.35      0.11
ffC:       4     18.32      0.22
ffF:     669    166.19      4.03
ffb:      18    577.84      0.03
ffc:     334    158.15      2.11
fff:    2126    485.15      4.38

Q3: Now define a function bar_graph that will produce a horizontal bar graph for a single dict.

In [63]:
# Code to produce a bar graph

def bar_graph(dict):
    plt.barh(range(len(dict)), dict.values(), align='center')
    plt.yticks(range(len(dict)), list(dict.keys()))
    plt.show()

You should then be able to call it on the dict you made for all 2 vowel words, and again for all 3 vowel words. And for the overall vowel distribution. You'll need to substitute in the right variable names for my placeholders x, etc.

In [43]:
bar_graph(counts[2])
In [44]:
bar_graph(counts[3])
In [39]:
bar_graph(vowel_counts)

Q4: It'd be handy to be able to sort the keys by their values. Try to make a plot that sorts the bars by their value (pattern frequency) using OrderedDict and the operator library as shown at the beginning!

In [64]:
import collections
import operator

def bar_graph_sorted(dict):
    odict = collections.OrderedDict(sorted(dict.items(), key=operator.itemgetter(1)))
    plt.barh(range(len(odict)), odict.values(), align='center')
    plt.yticks(range(len(odict)), list(odict.keys()))
    plt.show()

We can then call it to display some graphs. Remember to substitute in your variable names.

In [49]:
bar_graph_sorted(vowel_counts)
In [50]:
bar_graph_sorted(counts[1])
In [51]:
bar_graph_sorted(counts[2])
In [52]:
bar_graph_sorted(counts[3])

Q5: Final question: What observations can you make about the frequencies of different vowel patterns in Hungarian words? Maybe you could write about a 10 line paragraph. Not an essay!

For the overall vowel counts, short vowels are much more common than long vowels - about 3 times as common. For two and three vowel words, all the most common patterns by count are patterns that involve all front or all back vowels (bb, bB, etc.). It should be remembered that it is not too surprising that these patterns are more common than ones with central vowels in them, since there is only 1 central vowel of each length versus 3 front and 3 back vowels of each length. Given that, we note that central vowels freely occur with both front and back vowels with a similar frequency. What is uncommon is words that mix front and back vowels. It's not that they're impossible - they certainly occur. But their occurrence is marked. If we work out expected counts for vowel patterns based on overall vowel frequency, we see that patterns that are all (back or central) or all (front or central) usually occur twice or more as often as expected, whereas patterns that combine a front and a back vowel commonly occur 1/3 or less often than expected. There are some other patterns in the data, too: It seems to be uncommon for the last syllable to contain a central vowel. And there are a few patterns that combine back and front vowels that do actually occur more often than expected, such as "BfF" and "bcf". It's not clear why that is.

In [ ]: