Written by Lisa Yan and Chris Piech
The Federalist Papers was a body of 85 essays advocating ratification of the US constitution. The pseudonymous author "Publius" actually referred to Alexander Hamilton, James Madison, and John Jay.
Using probability, we can determine who wrote each of the essays in the Federalist Papers by analyzing the probability of the words in the essay and comparing them against the word distributions in known writings from Hamilton, Madison, and Jay. This approach is known more generally as the "Bag of Words" model--in other words, we ignore sentence structure and word ordering in favor of comparing just word frequency.
In this demo, we seek to decide whether James Madison or Alexander Hamilton was the author of Federalist No. 53, the fifty-third of The Federalist Papers (unknown.txt
). We have two known writing samples from Madison (madison.txt
) and Hamilton (hamilton.txt
), from which we can generate author-specific word frequencies. We then model the unknown document as a multinomial, where each author has some probability of generating each word in the document, and these probabilities can be different depending on the author. Given the document word frequencies, if the author is more likely to be Madison than Hamilton, we report Madison as the author.
Sources:
madison.txt
: Federalist No. 10hamilton.txt
: Federalist No. 11unknown.txt
: Federalist No. 53import csv
import operator
import math
# The below functions are not shown
from helper import makeWordProbMap, makeWordCountMap
Do once each for the two writers, Madison and Hamilton:
Create a probability lookup wordProbMap
that stores $P(word|writer)$.
makeWordProbMap(textfile)
: creates a map of word -> probability.
getWordProb(wordProbMap, word)
to return $P(word|writer)$, where $writer$ is the author corresponding to wordProbMap.# Calculate all the ps and qs
madisonWordProb = makeWordProbMap("madison.txt")
hamiltonWordProb = makeWordProbMap("hamilton.txt")
"""
To use when retrieving words from our two word maps
Return probability of a given word.
If the word was not found, return some small probability epsilon.
"""
EPSILON = 0.000001
def getWordProb(wordProbMap, word):
if word in wordProbMap:
return wordProbMap[word]
return EPSILON
print("P(congress|madison) =", getWordProb(madisonWordProb, "congress"))
print("P(congress|hamilton) =", getWordProb(hamiltonWordProb, "congress"))
print("P(lisa|madison) =", getWordProb(madisonWordProb, "lisa"))
print("P(the|madison) =", getWordProb(madisonWordProb, "the"))
print("P(the|hamilton) =", getWordProb(hamiltonWordProb, "the"))
makeWordCountMap(textfile)
: creates a map of word -> count.unknownDocCount, nDocWords = makeWordCountMap('unknown.txt')
print('# words in unknown.txt:', nDocWords)
print('# unique words in unknown.txt:', len(unknownDocCount))
print("# of times \"congress\" appears in unknown.txt:", unknownDocCount["congress"])
print("# of times \"the\" appears in unknown.txt:", unknownDocCount["the"])
Bayes' Theorem says:
$P(writer|unknownDoc) = \dfrac{P(unknownDoc|writer)P(writer)}{P(unknownDoc)}$
However, since we are computing a ratio of two probabilities, we can cancel out many terms.
$\dfrac{P(unknownDoc|Madison)}{P(unknownDoc|Hamilton)} > 1 \rightarrow \text{Madison wrote document}$
The distribution of word counts in an unknown document (conditioned on knowing the writer) is a Multinomial RV. Since the multinomial coefficients are identical in both numerator and denominator, these also cancel.
Ultimately, we can compute a ratio of the product of probabilities of observing each word given each author wrote it:
$P(unknownDoc|Madison) \propto \Pi_{i=1}^m \left( p_{\text{M}, i}^{\text{# apperances of word }i \text{ in unknown}} \right)$
def calcProbDoc(wordProbMap, countMap):
prob = 1
for i, word_i in enumerate(countMap):
c_i = countMap[word_i]
p_i = getWordProb(wordProbMap, word_i)
if i < 10:
print(word_i, "appeared", c_i, "times. prob:", math.pow(p_i, c_i))
prob *= math.pow(p_i, c_i)
print("------")
return prob
print('P(doc|madison) is proportional to:')
pMadison = calcProbDoc(madisonWordProb, unknownDocCount)
print('P(doc|hamilton) is proportional to:')
pHamilton = calcProbDoc(hamiltonWordProb, unknownDocCount)
print('madison: \t\t',pMadison)
print('hamilton: \t\t', pHamilton)
print('madison/hamilton:\t',pMadison/pHamilton)
Multiplying many small probabilities leads to underflow.
A tractable version computes the sum of log probabilities.
An equivalent comparison would then be as follows:
$\log{P(unknownDoc|Madison)} - \log{P(unknownDoc|Hamilton)} > 0 \rightarrow \text{Madison wrote document},$
where
$P(unknownDoc|Madison) \propto \sum_{i=1}^m \left( (\text{# apperances of word }i \text{ in unknown}) \log( p_{\text{M}, i}) \right)$
def calcLogProbDoc(wordProbMap, countMap):
logprob = 0
for word_i in countMap:
c_i = countMap[word_i]
p_i = getWordProb(wordProbMap, word_i)
logprob += c_i * math.log(p_i)
return logprob
logpMadison = calcLogProbDoc(madisonWordProb, unknownDocCount)
logpHamilton = calcLogProbDoc(hamiltonWordProb, unknownDocCount)
print('log madison: \t\t',logpMadison)
print('log hamilton: \t\t', logpHamilton)
print('log madison/hamilton:\t',logpMadison - logpHamilton)
From Wikipedia: https://en.wikipedia.org/wiki/Federalist_No._53