Epidemic Stats

On Text Editors
Dates in R
Epidemic Curves From Reporting Data
Curve Fitting
Download Data from Web

On Text Editors

Anyone working with data should become comfortable with a good text editor
A text editor is quite distinct from a word processor as it works with plain text
The choice of text editor is largely a matter of personal choice
Wikipedia does a nice comparing different text editors
I use Aquamacs, which is a native Mac version of GNU Emacs, for most of my work
Emacs ins’t for everyone, but it actually isn’t that hard to dive in
- Reference cards help
- And you never know what you may be able to do in Emacs
As we discussed in section, for many (most?) epidemiological data sets, it’s probably easiest to copy-and-paste from online sources and then just do the cleaning in a text editor than it is to try to write general-use code
It becomes worth your while to invest in writing code if you’re going to do a lot of this sort of work

Dates in `R`

Epidemiological reporting data are typically time-referenced
We want to plot outbreak data as they unfold in time
This means we need to be able to work with dates
R has a variety of facilities for handling dates
Some people like the lubridate package for working with time/date data
Base-R command as.Date() works well for most reporting data

Epidemic Curves From Reporting Data

WHO reports outbreak data in a pretty haphazard way
The International Society for Infectious Disease Research ProMed Mail system compiles the data from the occasional WHO reports and these data can be quite easily used
The reporting is somewhat irregular – there are often gaps of a week or two
The ProMED Mail tables include average daily cases, but we can calculate that ourselves as well
To calculate incidence from the cumulative number of cases, use the R function diff()
To calculate the average number of cases per day in a reporting interval, simply divide the difference by number of days between reports

yemen <- read.csv("https://web.stanford.edu/class/earthsys214/data/yemen-cases-2017.txt", skip=2, header=TRUE)
## cumulative cases
plot(as.Date(yemen$Date, "%d-%m-%Y"), yemen$Cases, type="l", lwd=3, col= "#660066", ylab="Total Reported Cases", xlab="Date")

# incidence
dates <- as.Date(yemen$Date, "%d-%m-%Y")
inc <- diff(yemen$Cases)/as.numeric(diff(dates))
plot(dates[-1],inc, type="h",  lwd=5, col= "#660066", ylab="New Cases", xlab="Date")

Curve Fitting

We can fit a curve to incidence data using natural splines from the splines library
A spline is a curve made up of cubic polynomials between knots
- it is also continuous and has continuous first and second derivatives at each knot
We use natural splines which are constrained to be linear for values below the lowest and above the highest knots
- this ends up making the predictions more stable at the extremes
Use five degrees of freedom to place knots at 0th, 25th, 50th, 75th, and 100th quantiles of the data, which seems sensible

library(splines)
sinc <- lm(inc ~ ns(dates[-1], df=5))
plot(dates[-1],inc, type="h",  lwd=5, col= "#660066", ylab="New Cases", xlab="Date")
lines(dates[-1], predict(sinc, data.frame(dates=dates)[-1,]), lwd=3, col="black")

Download Data from Web

Standard package is XML
Package htmltab parses multi-column spans (which is common on html tables)
XML does not work with https either (and all Wikipedia pages have moved to https since 2015)
There is a useful vignette that delves more into the details of htmltab
The key is basically to find the anchor point for the html table that allows htmltab to parse the cells of the table

library(htmltab)
url <- "https://www.cdc.gov/std/stats16/tables/1.htm"
cnames <- c("Year", "Syphilis_All_Cases","Syphilis_All_Rate", "Syphilis_Primary_Secondary_Cases", "Syphilis_Primary_Secondary_Rate", "Syphilis_Early_Latent_Cases", "Syphilis_Early_Latent_Rate",
"Syphilis_Late_Latent_Cases", "Syphilis_Late_Latent_Rate", "Syphilis_Congenital_Cases", "Syphilis_Congenital_Rate", "Chlamydia_Cases","Chlamydia_Rate", "Gonorrhea_Cases", "Gonorrhea_Rate", "Chancroid_Cases", "Chancroid_Rate") 
stitable <- htmltab(doc = url, which = "//th[text() = 'Year*']/ancestor::table", colNames=cnames, rm_nodata_rows=FALSE, rm_nodata_cols=FALSE)
## remove commas and convert from text to numeric
## note that gsub() works on vectors, so need to cycle through the cols of the data frame
for(i in 1:17){
  tmp <- stitable[,i]
  stitable[,i] <- as.numeric(gsub(",", "", tmp))
}
## the way I actually do this is to use apply(), which will convert to matrix -- need to change back to data.frame
## for some reason, R Markdown won't compile when I use this approach...
# dropcomma <- function(x) gsub(",", "", x)
## this will generate a warning but you actually want NAs introduced!
#stitable <- as.data.frame(apply(stitable,2,dropcomma))
with(stitable, plot(Year, Gonorrhea_Cases/1e05, type="l", lwd=3, col="yellow4",
                    xlab="Year", ylab="Gonorrhea Cases (10,000)"))
title("The Disco Surge")

For this table, had to use rm_nodata_rows=FALSE and rm_nodata_cols=FALSE to ensure all columns load (because the “Chlamydia Rate” column (inexplicably) has blanks for 1941-1983)

Epidemic Stats

James Holland Jones

02/09/2018

On Text Editors

Dates in `R`

Epidemic Curves From Reporting Data

Curve Fitting

Download Data from Web

Epidemic Stats

James Holland Jones

02/09/2018

On Text Editors

Dates in R

Epidemic Curves From Reporting Data

Curve Fitting

Download Data from Web

Dates in `R`