Tag Archives: R

Wordclouds and Space

Wordclouds are widely popular, yet space is generally used arbitrarily, or at least does not express any meaningful relationship that could provide insight into the data. Drew Conway recently not only suggested an alternative take on Wordclouds, but went right ahead and implemented it in R. The core idea is to visually compare two different bodies of texts and represent both, frequencies as well as differences in the use of words spatially. The word size represents frequency as before and the spreading over the x-axis represents differences.

I used the same code to compare the commencement speeches from Steve Jobs and Oprah Winfrey.

Obviously there is a lot of overlapping vocabulary. What’s important to note in reading the graph is that in this case the word “life” is used often by both speakers, but somewhat more often by Steve Jobs.

There are certainly limits and drawbacks, both methodologically (it only represents words used by both parties) and visually (for example, overlapping words, large bodies of text). But the core idea, namely to take advantage of spatial arrangement as a device that adds meaning to a visualization is inspiring.

Learning the Grammar of Graphics with R

During summer I usually spend a lot of hours locked up at an altitude of 30000+ feet, and this year I took ggplot2: Elegant Graphics for Data Analysis as reading material. ggplot2 is a data visualization package for the R statistical analysis platform. It is loosely based on “The Grammar of Graphics” from Leland Wilkinson, thus taking a different approach from traditional graphics packages by very explicitly mapping the data to aesthetic attributes (eg. colors) and geometric objects (eg. points).

Here is my first attempt to use the ggplot2 package. I was interested in the change of the mean population center of the US between 1790 and 2000, similar to the map that is put out by the Census Bureau, but specifically looking at the initially African, then African-American population.

I downloaded census data and county outlines from the National Historical Geographic Information System website, merged the data for each census year on the county level, and calculated the weighted mean for each census year. (Data are 1 2 3 7