Important Dates/Information

Due date: Friday, February 9, 2018 11:59PM . No late submission allowed! Please submit to blackboard.

Update to submission guidelines: Submit both Rscript, Rdata and a pdf of the answers to the exercises below. Please submit all files in such format: _lab#_mmdd.extension

The .R and .Rdata files will be the same as before. The PDF will now contain the question, your answer, and any plots. No code should be in the PDF. Naming convention should remain the same.

Failure to submit in such format will be penalized by 1 pt each (so you might lose a total of 3 points). Example: If your name is John Doe and today’s date is January 25, your submission files would be: Doe_lab3_0125.R, Doe_lab3_0125.Rdata, Doe_lab3_0125.pdf



What you should know and have done prior to this

  • How to google for help
  • Lab 4 (material from class and the assignment)
  • All the material given before lab 4

Learning Goals

  1. Gather data from published tables 1.1 Reading data dictionaries
  2. Aggregate data from disperate sources
  3. Basic plotting with ggplot
  4. Interpreting linear model output

Public Data

Having learned some of the basics of linear models and different types of regressions, we are going to take a step back this week and study application examples.

The goal of today’s lab will be to practice dealing with ‘real’ data, so let’s start with what we want to predict: how many creatives live in an area?

In quite a surprise to me, the USDA studies proprotions of populations who are employed in jobs which require creative thinking and luckily for us, they publish the data here. They also publish a variety of demographic data which are also available at a county level. You can find other county-level sets here.

For our example, download the creative-class dataset from the link above and download the “educational attainment” dataset from the second link. Note these both download as XLS files. This is an instance where I find it easier to open excel, convert to CSV, and save again rather than importing from xls into R (it tends to be tedious).

creative.df <- read.csv("./creativeclass200711.csv")
education.df <- read.csv("./Education.csv")

Note here that these datasets both have strange headers we need to deal with (head(education.df_) if you don’t believe me). There are a few ways to do this (notably, we could subset the data once it’s imported) but in this case I am going to make use of a feature of the read.csv function which lets us skip lines. By specifying skip=XYZ we get rid of all the mumbo jumbo at the top and end up with much cleaner data. Note XYZ should be changed to the number of lines we want to skip.

creative.df <- read.csv("./creativeclass200711.csv",skip=1,stringsAsFactors = F)
education.df <- read.csv("./Education.csv",skip=4,stringsAsFactors = F)

The above links to the data also contain documention about what the header columns actually mean and how they are calculated.


In this case, if I want to try to predict how some part of the education dataset bears on some part of the creative dataset, it can be helpful to combine them into one table. In the example, I wish to add the column 1970 from education to a new dataframe along with the Creative 2007 11 Pooled share column from creative. We are very lucky in this data as we have the FIPS codes for each county. In this case, we can simply match the values by the numerical FIPS code and be on our way. It can’t go without mention how much of a rabbit hole you might fall into if you have to start matching columns by text (remember the Texas question from 2 weeks ago?). merge lets us merge data by row if they match a certain criteria.Note we have to specify the by.x and by.y differently because the columns we want to match eachother have different names.

df.combination <- merge(x=education.df,y=creative.df[,c(1,10)],by.x="FIPS.Code",by.y="FIPS")

Now we have the combination dataset! Another important thing to look at is how I subset creative when putting it in. The output will contain all columns of both dataframes so being that I only wanted to add one, I let merge only deal with the data frame with the index column, and the column of data.


The questions are using the education and creative datasets above. There are also poverty, unemployement, and population datasets available at the second link in the Public Data section above. Import the remaining datsets before begining the questions

  1. From all the data you’ve imported, which are the most/least complete. Do you trust the data why or why not? (A full answer here will be at least a paragraph. An answer should also be irrespective of the source, ie you can’t say it’s trustworthy because it’s from the government) (1pt)

  2. R base plots are a little boring sometimes. Install the package ggplot2. Create density plots of the fraction of creatives in a county, and the total number of creatives in a county. In the lab folder, there is a ggplot cheatsheet which I highly reccomend. Comment on why you might use a fraction of creatives vs total number in a model. (1 pt)

  3. Using ggplot, plot the number of creatives in a county versus three other variables. The variables must each come from a different dataset (education, population, poverty, unemployeent). Try to find ones which show interesting (but not spurious) correlations. This can be done in one plot, or 3 plots. (1 pt)

  4. Of all the data we have, build a linear model to predict the number of creatives using 10 variables. The variables should come from at least 3 of the datsets. Try to pick interesting variables which you believe might movitave the number of creatives in a county. For each variables given, please explain why you included it. Cross validation is recommended. Any other transformations, or data manipulations must be justified. Comment on the results and include at minimum the following: what variables were most important (why?), what variables were unimportant (why?), what interactions did you observe, how good is the model fit? (up to 5 points)

  5. How does your model hold up on weird outliers? Los Alamos County (New Mexico) is home to LANL. They have an extremely high rate of folks with PhDs and an extremely high average income. Comment on how your model performs there. Fairfax County (Virginia) is also quite rich. How does your model perform there? Robeson County (NC) is one of the poorest in the country. Comment on model performance in that county. The point of this question is to see how your model handles corner cases. (1 point)

  6. BONUS QUESTION. Make a US map of the fractions of creatives per county using ggplot. It should look something like what I’ve included below.