The goal of this lab is to become familiar with the grammar of graphics philosophy implemented in the R
package ggplot2
.
Work through this lab by running all the R code to your computer and making sure that you understand the input and the output. We encourage you to work through this lab with a partner.
We will be going through some of the topics from the ggplot2
documentation website, docs.ggplot2.org. You should keep that website open as a reference throughout this tutorial.
Another useful resource, the complete guide to the Grammar of Graphics philosophy, is available through Springer website
Another useful ggplot2 tutorial is provided by Data Carpentry.
In order to learn how to use ggplot2
, we will use a few datasets available through the nutshell
, parathyroidSE
, EnsDb.Hsapiens.v86
packages.
The code below checks if all packages required for today’s lab are available, and installs any missing ones. Note that (in the ‘Rmd’ file) the chunk declaration includes the following options: message = FALSE, warning = FALSE, results = "hide"
. These R Markdown settings let you suppress messages and warnings which would otherwise appear below the chunk. Another way to avoid the lengthy messages during package loading, is to wrap R commands inside suppressMessages()
function, e.g. suppressMessages(library("foo"))
.
pkgs_needed <- c("tidyverse", "Biostrings",
"parathyroidSE", "EnsDb.Hsapiens.v86", "nutshell")
letsinstall <- setdiff(pkgs_needed, installed.packages())
if (length(letsinstall) > 0) {
BiocManager::install(letsinstall)
}
library("dplyr")
library("ggplot2")
library("Biostrings")
First, we will use a data set about births in the USA in 2006. You can read more about this in the “R in a Nutshell, 2nd edition” book which is freely available as a PDF file online. This is convenient since you can follow the same methods in that book (except translating into the grammar of graphics to make plots).
First, we load the dataset and add two new variables: one to encode whether the day falls on a weekend or whether it is a weekday, and a three-level, discretized health score based on the APGAR 5 score.
We also fix an ideosyncratic encoding of not available values by 99 with proper NA
s.
Finally, we define a subsampled version of the dataset births_small
to speed up some of the plotting (feel free to try out the computations and plots we see subsequently with the full dataset births
.)
data("births2006.smpl", package = "nutshell")
births <- mutate(births2006.smpl,
WEEKEND = ifelse(DOB_WK %in% c(1, 7), "Weekend", "Weekday"),
HEALTH = c("CritLow", "Low", "Normal")[ 1 +
findInterval(APGAR5, c(3.5, 6.5)) ],
ESTGEST = replace(ESTGEST, ESTGEST==99, NA))
set.seed(1)
births_small <- births[ sample(nrow(births), 40000), ]
head(births_small)
## DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5 DMEDUC
## 113458 3 3 27 4 NA F NA NULL
## 159017 3 2 25 4 40 M 10 4 years of high school
## 244793 1 5 34 3 30 M 9 3 years of college
## 388096 3 2 31 3 NA M NA NULL
## 86183 3 5 42 7 18 M 9 NULL
## 383899 10 3 28 4 NA M 3 NULL
## UPREVIS ESTGEST DMETH_REC DPLURAL DBWT WEEKEND HEALTH
## 113458 10 NA C-section 1 Single 2657 Weekday <NA>
## 159017 15 37 C-section 1 Single 2353 Weekday Normal
## 244793 13 41 Vaginal 1 Single 3770 Weekday Normal
## 388096 15 NA Vaginal 1 Single 2722 Weekday <NA>
## 86183 15 39 Vaginal 1 Single 3117 Weekday Normal
## 383899 99 39 Vaginal 1 Single NA Weekday CritLow
To find out about what data each column of births2006.smpl
dataset, holds, you can run the line below in your console:
help("births2006.smpl", package = "nutshell")
geom
’sggplot2 is a plotting system for R, based on the grammar of graphics. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics 1
You can think of plots generated with ggplot2
as sentences. The function ggplot()
starts a sentence (initializes a plot) which usually contains a noun/subject (a dataset with information to be plotted).
The +
operator is used to add further clauses/fragments containing “verbs”, “adjectives”, “adverbs” etc. to the sentence. This allows you to construct sophisticated plots from a few elementary building blocks, just like forming compound sentences from simple phrases.
In ggplot
world, verbs are geometric objects, or geom
’s. For example, geom_bar()
draws bars, geom_point()
draws points. There can be multiple geom
s in a sentence, and these are then also called layers.
For each geom
, you need to define required aesthetics using the function aes()
. The aesthetics of a geom
determine how the attributes of the geometric object (e.g. the x- and y- coordinates of the points, their colors and shapes, etc.) are mapped to the columns of the supplied data.frame. The aesthetics mapping can be defined when initializing a plot (with ggplot(data = dataframe_name, aes(...))
), which makes it apply to all geom
’s. Otherwise, one can specify aes()
mapping for each geom
separately (e.g. geom_bar(aes(...))
), in which case it applies to that particular geom
only.
It is also possible to use a different dataset for geom
than the one supplied to ggplot()
. To do this you simply supply a new data.frame as a data
argument of geom
(e.g. geom_point(data = new_dataframe_name, aes(...))
) which would overwrite the dataset used to this geom
.
For example, let’s look at the number of births on each day of the week in our sample data. First set up our plot with the data births
.
ppp <- ggplot(births_small)
Note that this doesn’t actually plot anything (just like if you wrote junk <- 7
doesn’t output anything until you run junk
. What happens here when you run ppp
?
The number ofbirths per day of the week can be shown easily in a barplot, so let’s use that. To create a geometric object layer for a barplot we use the function geom_bar()
. In order for it to know what part of the data we want to actually plot, we need to give an aesthetic. We can do this by declaring that the aesthetic for the x-axis is the day of the week of the birth.
The column DOB_WK
gives the day of the week that each birth happened as a numeric value 1 (Sunday) through 7 (Saturday). We can tell R that this is actually a factor variable by putting the variable name in the function factor
. Putting this all together, we get geom_bar(aes(x = factor(DOB_WK))
. Finally, to add this layer to the initialized graph, we add the geom
to ppp
with the +
operator.
ppp <- ggplot(births_small)
ppp + geom_bar(aes(x = factor(DOB_WK)))
Doctors are able to delay birth with anti-contraction medications or labor suppressants called tocolytics. That might be one reason we see fewer births on day 1 of the week (Sunday) and day 7 (Saturday).
We can get further information from this plot if we add more aesthetics. For example, maybe we can fill each bar with different colors corresponding to what type of birth it was (“C-section”, “Vaginal”, or “Unknown”). We can do this by just including another aes
in the geometric object. Start with the same initialization, but add a geometric object with the x-axis and also fill color defined.
ppp <- ggplot(births_small)
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC))
When we made that plot, we used the default value for the position
argument of the geom_bar
function. We could have equivalently written the code as follows.
ppp <- ggplot(births_small)
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC), position = "stack")
Another option is to use position = "dodge"
. Note that this is an argument to geom_bar
and not to aes
.
ppp <- ggplot(births_small)
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC), position = "dodge")
Now we see that about 1/2 as many C-sections were on weekends as there were on a typical weekday, whereas there were about 3/4 as many vaginal births on weekends as there were on weekdays.
Let’s continue looking at the birth data on a day to day basis. We might conjecture that older women are less likely to take a tocolytic since they are more likely to have had other complications already. One way we can do this is to “facet” the graph and display day of the week versus women’s age.
First, let’s make a histogram of the women’s ages to get an idea of what the distribution looks like.
ppp <- ggplot(births_small)
ppp + geom_histogram(aes(x = MAGER), binwidth = 1)
We used the argument binwidth
to set the width of the bins in this histogram (here binwidth = 1
corresponds to 1 year).
Using the grammar of graphics, it is easy to facet this single graph to make multiple graphs along another dimension of the data. In this case, we’re interested in breaking this up along the dimension of day of birth DOB_WK
. We will add this facet with the command facet_grid
or facet_wrap
. In facet_grid
, the argument we use is a formula with the rows (of the tabular display) on the left hand side and the columns (of the tabular display) on the right hand side (RHS).
A formula is created in R with the tilde operator ~
. A dot in the formula is used to indicate there should be no faceting on this dimension (either row or column). The formula can also be provided as a string instead of a classical formula object. In facet_wrap
, we have the same sort of argument, but we only include a RHS of the formula. We’ll use both of them in an example so you can see the difference.
Now let’s look at this faceting on that variable. Again, we will use the +
operator. Here, we also see that we can save the geometric objects in the plot and just add facets at the end.
ppph <- ggplot(births_small) +
geom_histogram(aes(x = MAGER, fill = DMETH_REC), binwidth = 1)
ppph + facet_wrap( ~ DOB_WK)
ppph + facet_grid(DOB_WK ~ SEX)
What is the difference between facet_wrap
and facet_grid
?
Here is an interesting perspective of the data (we use dplyr::filter
to exclude the record where the delivery method is Unknown).
ggplot(dplyr::filter(births, !(DMETH_REC %in% "Unknown"))) +
geom_histogram(aes(x = MAGER, fill = factor(TBO_REC)), binwidth = 1) +
facet_grid(WEEKEND ~ DMETH_REC, scale="free_y", drop = TRUE) +
geom_vline(xintercept = seq(15, 45, by=5), alpha=0.2, color="white") +
labs(title = "Births in USA 2006", fill="Birth\nOrder")
It’s often useful to transform your data before plotting, and that’s what statistical transformations do.
Here we look at the histogram of mother’s ages as before, but we also add a density estimate to it.
ggplot(births_small, aes(x = MAGER)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "grey", col = "black") +
stat_density(col = "red", fill = NA)
Maybe we want to compare this to a lognormal distribution’s density.
ggplot(births_small, aes(x = MAGER)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill="grey", col="black") +
stat_density(col="red", fill=NA) +
stat_function(fun = dlnorm, col = "blue",
args = list(mean = mean(log(births_small$MAGER)),
sd = sd(log(births_small$MAGER))))
You can add any function using the stat_function
.
In this example, we will plot the birth weight in grams (DBWT) versus weight gain (WTGAIN) by mother (which seems to be in pounds). Unfortunately, we need to be a little more careful about this when there are NAs in the data.
ppp2 <- ggplot(dplyr::filter(births_small, !is.na(WTGAIN) & !is.na(DBWT)),
aes(x = WTGAIN, y = DBWT)) +
labs(x = "Weight Gain by Mother", y = "Birth Weight in Grams")
ppp2 + geom_point()
When we plot this data by itself, it is not clear what is going on – there are many points plotted on top of each other and it just looks messy.
One way to transform this data with a stat_bin2d()
in ggplot2
is by binning the points as below.
ppp2 + stat_bin2d()
See that the color axis reports counts. That means we count the total number of observations to fall within a box, i.e. we are effectively generating a 2D density plot. This plot seems to indicate that the joint distribution of weight gain by mother and birth weight in grams is unimodal.
Instead of just making counts though, we can compute other quantities (evaluate functions other than count) within each of the 2d bins on other two columns of births_small
. For example, let’s look at the median number of weeks of gestation for each of the bins.
ppp2 + stat_summary_2d(aes(z = ESTGEST), fun = median) +
labs(title = "median number of weeks of gestation")
## Warning: Removed 59 rows containing non-finite values (stat_summary2d).
If we want to do a regression, we can include this automatically. By default, ggplot2
uses “locally weighted scatterplot smoothing” (loess) regression for data sets with < 1000 points and “generalized additive models” (GAM) for >= 1000 points.
# here we force a linear model fit
ppp2 + geom_point() + stat_smooth(method = lm)
By changing the color aesthetics of the modelled line, we can fit separate models to the corresponding subsets of the data and compare them. For example, let’s look at a smoothed fit (GAM) of birthweight versus weight gain by mother separated out by baby boys and girls.
ppp2 + geom_point() + stat_smooth(aes(col = SEX))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Finally, let’s look at hexagonal bins as a visually more attractive alternative to the rectangular bins used so far
ppp2 + geom_hex(bins = 30) + stat_smooth(aes(col = SEX))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Scales control the mapping between data and aesthetics. Going back to the 2d distribution of birthweight versus weight gain by mothers, it is difficult to see what is going on except that there is a dense region and a less dense region. If we take the square root of the number of births per region, then we can see that there is a smooth transition between the high density area and the low density area.
# we repeat again the code above so you don't gave to scroll to look at it.
ppp2 <- ggplot(dplyr::filter(births_small, !is.na(WTGAIN) & !is.na(DBWT)),
aes(x = WTGAIN, y = DBWT)) +
labs(x = "Weight Gain by Mother", y = "Birth Weight in Grams")
ppp2 + stat_bin2d() + scale_fill_gradient(trans = "sqrt")
See what happens when you change the trans to log10
.
Sometimes, we might want to change the scale of the vertical axis. According to Wikipedia’s page on “Fetal Viability”, there is a 50% chance of viability at 24 weeks of gestation. We will repeat the plot above with birth weight on a log scale, so we get better separation of the mean weeks of gestation.
ppp2 + stat_summary_2d(aes(z = ESTGEST), fun = mean) +
scale_y_log10(limits = 10^c(2, 4)) +
scale_fill_gradient2(midpoint = 24) +
labs(title="mean number of weeks of gestation")
## Warning: Removed 59 rows containing non-finite values (stat_summary2d).
Now let’s look at only the quadruplets in the full dataset (there are 39 such observations). We want to include lots of variables such as number of prenatal visits (UPREVIS), the mother’s age (MAGER), the estimated weeks of gestation (ESTGEST), the delivery method (DMETH_REC), and the mother’s education level (DMEDUC).
ppp3 <- ggplot(dplyr::filter(births, DPLURAL == "4 Quadruplet"),
aes(x = UPREVIS, y = MAGER)) +
geom_point(aes(size = ESTGEST, shape = DMETH_REC, col = DMEDUC)) +
stat_smooth(aes(col = DMETH_REC), method = "lm")
ppp3
## Warning: Removed 4 rows containing missing values (geom_point).
ppp3 + scale_size(range=c(3, 6)) + scale_color_brewer(palette = "Set1") +
scale_shape(solid = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).
Note that, while shapes are discrete, colors and sizes can both be scaled continuously.
Remember that in the the power calculations section from the last lab, we defined a statistic estimating the deviation of the observed to expected frequencies of nucleotides. The code for generating the statistics assuming it came from the null distribution ( i.e. when all 4 nucleotides are evenly likely).
oestat = function(o, e){
sum( (e-o)^2/e )
}
set.seed(1)
B = 10000
# here we pick an arbitrary length / not the same as for Celegans
n = 2847
expected = rep(n/4 ,4)
oenull = replicate(
B, oestat(e=expected, o=rmultinom(1,size = n, prob = rep(1/4,4))))
Now, we can estimate the null distribution of this statistics, by plotting a histogram of the generated values:
ggplot(data.frame(null_stats = oenull)) +
geom_histogram(aes(x = null_stats), bins = 100, boundary=0)
We compare the distribution of this statistic to \(Chi^2_3\) (df = 1 * (4-1) = 3) using a q-q plot:
ggplot(data.frame(stat = oenull), aes(sample = stat)) +
stat_qq(distribution = stats::qchisq, dparams = list(df = 3)) +
stat_qq_line(distribution = stats::qchisq, dparams = list(df = 3))
We will use the parathyroidGenesSE
package in R
. Load the data and read the experimental information and the abstract.
library("parathyroidSE")
library("EnsDb.Hsapiens.v86")
data("parathyroidGenesSE", package = "parathyroidSE")
metadata(parathyroidGenesSE)$MIAME
## Experiment data
## Experimenter name: Felix Haglund
## Laboratory: Science for Life Laboratory Stockholm
## Contact information: Mikael Huss
## Title: DPN and Tamoxifen treatments of parathyroid adenoma cells
## URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211
## PMIDs: 23024189
##
## Abstract: A 251 word abstract is available. Use 'abstract' method.
abstract(metadata(parathyroidGenesSE)$MIAME)
## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."
Parathyroid adenoma (http://en.wikipedia.org/wiki/Parathyroid_adenoma) is a benign tumor of the parathyroid gland. The abstract tells us that some interesting genes to look at are the following:
Let’s put them in a table.
genes <- read.csv(textConnection(
"name, group
ESR1, estrogen
ESR2, estrogen
CASR, parathyroid
VDR, parathyroid
JUN, parathyroid
CALR, parathyroid
ORAI2, parathyroid"),
stringsAsFactors = FALSE, strip.white = TRUE)
In the parathyroidGenesSE
object, the features are labeled with Ensembl gene identifiers, so let’s use a Bioconductor package to find the corresponding IDs.
ens <- ensembldb::select(EnsDb.Hsapiens.v86,
keys = list(GenenameFilter(genes$name),
TxBiotypeFilter("protein_coding")),
columns = c("GENEID", "GENENAME"))
## Warning: 'GenenameFilter' is deprecated.
## Use 'GeneNameFilter' instead.
## See help("Deprecated")
## Note: ordering of the results might not match ordering of keys!
ens <-
dplyr::filter(ens, GENEID %in% rownames(parathyroidGenesSE)) %>%
mutate(group = genes$group[match(GENENAME, genes$name)])
ens
## GENEID GENENAME TXBIOTYPE group
## 1 ENSG00000179218 CALR protein_coding parathyroid
## 2 ENSG00000036828 CASR protein_coding parathyroid
## 3 ENSG00000091831 ESR1 protein_coding estrogen
## 4 ENSG00000140009 ESR2 protein_coding estrogen
## 5 ENSG00000177606 JUN protein_coding parathyroid
## 6 ENSG00000160991 ORAI2 protein_coding parathyroid
## 7 ENSG00000111424 VDR protein_coding parathyroid
Make the table of gene counts, add the patient info:
countData <- assay( parathyroidGenesSE )
gene.counts <- t(countData[ens$GENEID, ])
colnames(gene.counts) <- ens$GENENAME
dat <- cbind(data.frame(colData( parathyroidGenesSE)), data.frame(gene.counts))
head(dat)
## run experiment patient treatment time submission study sample
## 1 SRR479052 SRX140503 1 Control 24h SRA051611 SRP012167 SRS308865
## 2 SRR479053 SRX140504 1 Control 48h SRA051611 SRP012167 SRS308866
## 3 SRR479054 SRX140505 1 DPN 24h SRA051611 SRP012167 SRS308867
## 4 SRR479055 SRX140506 1 DPN 48h SRA051611 SRP012167 SRS308868
## 5 SRR479056 SRX140507 1 OHT 24h SRA051611 SRP012167 SRS308869
## 6 SRR479057 SRX140508 1 OHT 48h SRA051611 SRP012167 SRS308870
## CALR CASR ESR1 ESR2 JUN ORAI2 VDR
## 1 5055 14525 2 1 1799 632 1167
## 2 6161 16870 3 3 1153 1424 1653
## 3 3333 7798 1 1 1086 309 688
## 4 6407 14299 2 1 1227 974 1407
## 5 3810 9235 4 1 1258 326 702
## 6 4885 12994 2 3 906 814 1235
Plot one of the estrogen related gene’s counts (ESR1) with plot aesthetics and faceting to separate patients, treatments and times.
ggplot(dat, aes(col = patient, x = treatment, y = ESR1)) +
geom_point(size = 3) +
facet_grid( . ~ time)
Another useful ggplot2 tutorial is provided by Data Carpentry. Check that for more tips, tricks and hacks!
From the plot of the parathyroid data, answer the following.
Quiz question 1 : How many patients are there?
Quiz question 2 : How many time points are there?
Quiz question 3 : There were 3 treatments: “Control”, “DPN”, and “OHT”. How many measurements were taken from patient 2 under the DPN treatment?
Make your own plot of VDR versus CASR. (That is CASR, not CALR).
Quiz question 4 : Which patient has the highest recorded level of CASR?
Quiz question 5 : Which of the following pairs of patients seem to be well separated in this plot (i.e., for which two patients you can draw a line on the plot that perfectly separates them)?
Quiz question 6 : You need to know which pairs of patients are well separated with respect to the genes VDR and CASR (i.e., you can draw a line on the plot that perfectly separates the patients). Which of the following methods can help you visualize this?
Quiz question 7 : Which patient looks different from the other three when you plot VDR versus ORAI2?
Quiz question 8 : Plot ORAI2 versus JUN. Which can you separate best?
Quiz question 9 : Plot CASR versus (ESR1 + ESR2). Fit a separate linear model for each treatment (Control, DPN, OHT). Which linear models are increasing?
Quiz question 10 : What is the maximum number of shapes that you are allowed to use in ggplot2 by default?
Quiz question 11 : Write the name of the function that you could use to make more than the maximum number of default shapes allowed. Hint: this function has “values” as one of the arguments ____(…, values = (…)).
Quiz question 12 : What do Themes do in ggplot2?