Government Document Classification

img
Steve Ballmer, after leaving Microsoft as its second CEO, pursued several new projects that reflected his more personal interests. One of these was founding USAFacts, a non-profit organization seeking to provide "a data-driven portrait of the American population and government's impact on society". One issue that USAFacts has sought to explore is the analysis of legislative actions, or bills. Thousands of bills are put up to proposal each year across these bodies. In the last Congress (the 116th), more than 14,000 bills were introduced. For the past two years, the USAFacts team has manually created a dataset detailing the counts of legislative actions that our government is taking to address a variety of different topical areas, such as healthcare, immigration, energy and environment. This has been a very labor-intensive and costly process, and the team is ultimately limited in the scope of documents that they are able to categorize. In this project, I looked for an effective way to classify legislative documents with a trained NLP model. I found that a fastText model was able to achieve an accuracy of 84% on a custom-scraped dataset of 10,000 documents, beating out other models such as a hierarchical attention network and convolutional neural network.