Anyone working with data should become comfortable with a good text editor
A text editor is quite distinct from a word processor as it works with plain text
The choice of text editor is largely a matter of personal choice
Wikipedia does a nice comparing different text editors
I use Aquamacs, which is a native Mac version of GNU Emacs, for most of my work
As we discussed in section, for many (most?) epidemiological data sets, it’s probably easiest to copy-and-paste from online sources and then just do the cleaning in a text editor than it is to try to write general-use code
It becomes worth your while to invest in writing code if you’re going to do a lot of this sort of work
REpidemiological reporting data are typically time-referenced
We want to plot outbreak data as they unfold in time
This means we need to be able to work with dates
R has a variety of facilities for handling dates
Some people like the lubridate package for working with time/date data
Base-R command as.Date() works well for most reporting data
WHO reports outbreak data in a pretty haphazard way
The International Society for Infectious Disease Research ProMed Mail system compiles the data from the occasional WHO reports and these data can be quite easily used
The reporting is somewhat irregular – there are often gaps of a week or two
The ProMED Mail tables include average daily cases, but we can calculate that ourselves as well
To calculate incidence from the cumulative number of cases, use the R function diff()
To calculate the average number of cases per day in a reporting interval, simply divide the difference by number of days between reports
yemen <- read.csv("https://web.stanford.edu/class/earthsys214/data/yemen-cases-2017.txt", skip=2, header=TRUE)
## cumulative cases
plot(as.Date(yemen$Date, "%d-%m-%Y"), yemen$Cases, type="l", lwd=3, col= "#660066", ylab="Total Reported Cases", xlab="Date")
# incidence
dates <- as.Date(yemen$Date, "%d-%m-%Y")
inc <- diff(yemen$Cases)/as.numeric(diff(dates))
plot(dates[-1],inc, type="h", lwd=5, col= "#660066", ylab="New Cases", xlab="Date")
We can fit a curve to incidence data using natural splines from the splines library
Use five degrees of freedom to place knots at 0th, 25th, 50th, 75th, and 100th quantiles of the data, which seems sensible
library(splines)
sinc <- lm(inc ~ ns(dates[-1], df=5))
plot(dates[-1],inc, type="h", lwd=5, col= "#660066", ylab="New Cases", xlab="Date")
lines(dates[-1], predict(sinc, data.frame(dates=dates)[-1,]), lwd=3, col="black")
Standard package is XML
Package htmltab parses multi-column spans (which is common on html tables)
XML does not work with https either (and all Wikipedia pages have moved to https since 2015)
There is a useful vignette that delves more into the details of htmltab
The key is basically to find the anchor point for the html table that allows htmltab to parse the cells of the table
library(htmltab)
url <- "https://www.cdc.gov/std/stats16/tables/1.htm"
cnames <- c("Year", "Syphilis_All_Cases","Syphilis_All_Rate", "Syphilis_Primary_Secondary_Cases", "Syphilis_Primary_Secondary_Rate", "Syphilis_Early_Latent_Cases", "Syphilis_Early_Latent_Rate",
"Syphilis_Late_Latent_Cases", "Syphilis_Late_Latent_Rate", "Syphilis_Congenital_Cases", "Syphilis_Congenital_Rate", "Chlamydia_Cases","Chlamydia_Rate", "Gonorrhea_Cases", "Gonorrhea_Rate", "Chancroid_Cases", "Chancroid_Rate")
stitable <- htmltab(doc = url, which = "//th[text() = 'Year*']/ancestor::table", colNames=cnames, rm_nodata_rows=FALSE, rm_nodata_cols=FALSE)
## remove commas and convert from text to numeric
## note that gsub() works on vectors, so need to cycle through the cols of the data frame
for(i in 1:17){
tmp <- stitable[,i]
stitable[,i] <- as.numeric(gsub(",", "", tmp))
}
## the way I actually do this is to use apply(), which will convert to matrix -- need to change back to data.frame
## for some reason, R Markdown won't compile when I use this approach...
# dropcomma <- function(x) gsub(",", "", x)
## this will generate a warning but you actually want NAs introduced!
#stitable <- as.data.frame(apply(stitable,2,dropcomma))
with(stitable, plot(Year, Gonorrhea_Cases/1e05, type="l", lwd=3, col="yellow4",
xlab="Year", ylab="Gonorrhea Cases (10,000)"))
title("The Disco Surge")
rm_nodata_rows=FALSE and rm_nodata_cols=FALSE to ensure all columns load (because the “Chlamydia Rate” column (inexplicably) has blanks for 1941-1983)