Lab 1: Biostrings in R

In this lab, we’ll learn how to manipulate strings in R, mostly using the Biostrings package.

Getting started

If you’ve not done so already, install the Biostrings package with the commands

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

Load the package with

require(Biostrings)

Biostrings can be quoted or unquoted. You will want to refer to the documentation for this package which is available at http://www.bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings.pdf. We suggest keeping this document open throughout the lab.

We will play around with some of functions providede in Biostrings just to get familiar with what they do, and how they work.

Generating DNA alphabets

R provides functions that generate upper and lower case alphabets.

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

LETTERS

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

One way to generate from “ACGT” is to do

sample(LETTERS[c(1,3,7,20)], size=10, replace=TRUE)

##  [1] "C" "G" "T" "A" "T" "T" "G" "G" "A" "A"

Biostrings makes this slightly easier with the DNA_ALPHABET variable.

DNA_ALPHABET

##  [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."

seq = sample(DNA_ALPHABET[1:4], size=10, replace=TRUE)
seq

##  [1] "A" "G" "C" "T" "C" "G" "T" "C" "T" "T"

Convert seq into a string with

seq = paste(seq, collapse="")
seq

## [1] "AGCTCGTCTT"

The XString class

The XString class allows us to create, store, and work with different types of strings. We are only allowed to work with subclasses of XString: BString, DNAString, RNAString, and AAString.

Let’s create a BString object:

bstring = BString("I am a BString object")
bstring

##   21-letter "BString" instance
## seq: I am a BString object

Obtain the length of bstring with R’s “length” function:

length(bstring)

## [1] 21

Now make a DNAString object:

dnastring = DNAString("TTGAAA-CTC-N")
dnastring

##   12-letter "DNAString" instance
## seq: TTGAAA-CTC-N

As before, we can compute the length of dnastring with

length(dnastring)

## [1] 12

According to the manual, the differences with a BString object are:

only letters from the IUPAC extended genetic alphabet and the gap letter (-) are allowed and
each letter in the argument passed to the DNAString function is encoded in a special way before it’s stored in the DNAString object.

We can look at the slots of any XString object with the str() function:

str(dnastring)

## Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. ..@ xp                    :<externalptr> 
##   .. .. ..@ .link_to_cached_object:<environment: 0x632f648> 
##   ..@ offset         : int 0
##   ..@ length         : int 12
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

slotNames(dnastring)

## [1] "shared"          "offset"          "length"          "elementMetadata"
## [5] "metadata"

Some useful functions:

alphabetFrequency(dnastring)

## A C G T M R W S Y K V H D B N - + . 
## 3 2 1 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0

alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)

##          A          C          G          T      other 
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000

letterFrequency(dnastring, "A", as.prob=TRUE)

##    A 
## 0.25

reverseComplement(dnastring)

##   12-letter "DNAString" instance
## seq: N-GAG-TTTCAA

We can access the individual letters in each string using the [] operator.

bstring[3]

##   1-letter "BString" instance
## seq: a

dnastring[2]

##   1-letter "DNAString" instance
## seq: T

If we want a substring, we specify a range of indices.

dnastring[7:12]

##   6-letter "DNAString" instance
## seq: -CTC-N

Remember that R uses 1-based indexing, meaning that the first element in a string or any vector gets index 1. While this method of obtaining substrings is easy and intuitive, we DO NOT recommend it, especially for large strings. Biostrings provides a function “subseq” that takes advantage of the XString internals.

subseq(bstring, start=3, end=3)

##   1-letter "BString" instance
## seq: a

subseq(bstring, 3, 3)

##   1-letter "BString" instance
## seq: a

subseq(dnastring, 7, 12)

##   6-letter "DNAString" instance
## seq: -CTC-N

We can convert any XString object to a character vector with

toString(bstring)

## [1] "I am a BString object"

toString(dnastring)

## [1] "TTGAAA-CTC-N"

The results are character vectors of length 1. Check this with

length(toString(bstring))

## [1] 1

Comparing XStrings

We can compare XStrings with characters and other XStrings.

bb = subseq(bstring, 3, 6)
bb == "am a"

## [1] TRUE

bb == BString("am a")

## [1] TRUE

Try not to compare different XString classes with one another.

bstring == dnastring

## Error in bstring == dnastring: comparison between a "BString" instance and a "DNAString" instance is not supported

bstring != dnastring

## Error in e1 == e2: comparison between a "BString" instance and a "DNAString" instance is not supported

The XStringViews class

This is a useful way to view multiple subsequences of a XString.

view = Views(dnastring, start=3:0, end=5:8)
view

##   Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
##     start end width
## [1]     3   5     3 [GAA]
## [2]     2   6     5 [TGAAA]
## [3]     1   7     7 [TTGAAA-]
## [4]     0   8     9 [ TTGAAA-C]

To check the number of views, use

length(view)

## [1] 4

We can select a subset of the views using the [] operator.

view[4:2]

##   Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
##     start end width
## [1]     0   8     9 [ TTGAAA-C]
## [2]     1   7     7 [TTGAAA-]
## [3]     2   6     5 [TGAAA]

The returned object is still a XStringViews object, even if we select only one element. To extract any given view as a XString object, use the [[]] operator.

view[[2]]

##   5-letter "DNAString" instance
## seq: TGAAA

To see all views but the 3rd one, use

view[-3]

##   Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
##     start end width
## [1]     3   5     3 [GAA]
## [2]     2   6     5 [TGAAA]
## [3]     0   8     9 [ TTGAAA-C]

XStringSet

The XStringSet objects provide a convenient way of storing several XString objects together.

set = NULL
for (i in 1:4)
  set = c(set, paste(sample(DNA_ALPHABET[1:4], 10, replace=T), collapse=""))
set

## [1] "AGACCACTCC" "GCATGTAGCT" "GTGGTACGGC" "TCAAACGGCT"

length(set)

## [1] 4

Now convert into a set of DNAStrings:

set = DNAStringSet(set)
set

##   A DNAStringSet instance of length 4
##     width seq
## [1]    10 AGACCACTCC
## [2]    10 GCATGTAGCT
## [3]    10 GTGGTACGGC
## [4]    10 TCAAACGGCT

length(set)

## [1] 4

The functions “reverseComplement”, “alphabetFrequency”, “width”, and “subseq” work on each of the XStrings in a XStringSet.

reverseComplement(set)

##   A DNAStringSet instance of length 4
##     width seq
## [1]    10 GGAGTGGTCT
## [2]    10 AGCTACATGC
## [3]    10 GCCGTACCAC
## [4]    10 AGCCGTTTGA

alphabetFrequency(set, baseOnly=T, as.prob=T)

##        A   C   G   T other
## [1,] 0.3 0.5 0.1 0.1     0
## [2,] 0.2 0.2 0.3 0.3     0
## [3,] 0.1 0.2 0.5 0.2     0
## [4,] 0.3 0.3 0.2 0.2     0

letterFrequency(set, "A", as.prob=T)

##        A
## [1,] 0.3
## [2,] 0.2
## [3,] 0.1
## [4,] 0.3

width(set)

## [1] 10 10 10 10

subseq(set, 1, 3)

##   A DNAStringSet instance of length 4
##     width seq
## [1]     3 AGA
## [2]     3 GCA
## [3]     3 GTG
## [4]     3 TCA

You can name each XString in the set:

names(set) = paste("name", 1:4, sep="")

Exercise

We’ll look at the Escherichia coli APEC O1 genome (NC 008563). This can be found in the BSgenome package (along with genomes from other model organisms). Download and install the package data:

biocLite("BSgenome.Ecoli.NCBI.20080805")

We can now load the genome with

require(BSgenome.Ecoli.NCBI.20080805)
eco = Ecoli$NC_008563

Sample the genome by generating 1000 random views of random widths (50 - 100).
Use the alphabetFrequency and reverseComplement functions on these views.
Repeat 1-2 but this time with only 100 samples.
Compare the results of alphabetFrequency. How do both of them compare with the entire genome?

Generating the views:

views1000 = Views(eco, start=sample(length(eco), 1000, replace=T), width=sample(50:100, 1000, replace=T))
views100 = Views(eco, start=sample(length(eco), 100, replace=T), width=sample(50:100, 100, replace=T))

Computing the base frequencies:

alphabetFrequency(views1000, baseOnly=T, as.prob=T, collapse=T)

##            A            C            G            T        other 
## 0.2436190835 0.2540654784 0.2542674062 0.2480211082 0.0000269237

alphabetFrequency(views100, baseOnly=T, as.prob=T, collapse=T)

##         A         C         G         T     other 
## 0.2511641 0.2521227 0.2515749 0.2451383 0.0000000

alphabetFrequency(eco, baseOnly=T, as.prob=T)

##            A            C            G            T        other 
## 2.471704e-01 2.529128e-01 2.525602e-01 2.473315e-01 2.518681e-05

IRanges

Just like Views, can be used to look at subsequences. For example, a common task is to specify a list of starting positions on a chromosome, and then look at subsequences of a given length with that start point.

ir1 = IRanges(start=1:10, width=10:1)
ir2 = IRanges(start=1:10, end=11)
ir3 = IRanges(end=11, width=10:1)

Lets look at the structure of an IRanges object.

str(ir1)

## Formal class 'IRanges' [package "IRanges"] with 6 slots
##   ..@ start          : int [1:10] 1 2 3 4 5 6 7 8 9 10
##   ..@ width          : int [1:10] 10 9 8 7 6 5 4 3 2 1
##   ..@ NAMES          : NULL
##   ..@ elementType    : chr "integer"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

We can extract the starts, ends, and widths with

start(ir1)

##  [1]  1  2  3  4  5  6  7  8  9 10

end(ir1)

##  [1] 10 10 10 10 10 10 10 10 10 10

width(ir1)

##  [1] 10  9  8  7  6  5  4  3  2  1

As in Views, we can subset an IRanges object to get another IRanges object.

ir1[1:4]

## IRanges of length 4
##     start end width
## [1]     1  10    10
## [2]     2  10     9
## [3]     3  10     8
## [4]     4  10     7

We can subset using logical expressions too.

ir1[start(ir1) <= 6]

## IRanges of length 6
##     start end width
## [1]     1  10    10
## [2]     2  10     9
## [3]     3  10     8
## [4]     4  10     7
## [5]     5  10     6
## [6]     6  10     5

We’ll make a function to plot an IRanges object, and we will use basic R plotting functions.

library(ggplot2)

plotRanges = function(x, xlim=x, main=deparse(substitute(x)), sep=0.5){
  height = 1
  if (is(xlim, "Ranges"))
    xlim = c(min(start(xlim))-sep, max(end(xlim))+sep)
  bins = disjointBins(IRanges(start(x), end(x)+1))  
  
  plot.new()
  plot.window(xlim, c(0, max(bins) * (height + sep)))
  ybottom = bins * (sep + height) - height
  rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = "blue")
  title(main)
  axis(1)
}
plotRanges(ir1)

We can also use ggplot to make the plots.

library(ggplot2)

plotRanges = function(x, xlim=x, main=deparse(substitute(x)), sep=0.5){
  height = 1
  if (is(xlim, "Ranges"))
    xlim = c(min(start(xlim))-sep, max(end(xlim))+sep)
  bins = disjointBins(IRanges(start(x), end(x)+1))  
 
  ybottom = bins*(sep+height) - height
  df = data.frame(ybottom = ybottom, xleft = start(x) -.5, 
  xright = end(x) + .5, ytop = ybottom + height)
  ggplot(df) + geom_rect(aes(xmax = xright, xmin = xleft, ymax = ytop, ymin = ybottom)) + 
    labs(title = main)
}

Notice that our ranges overlap. What if we want it so that the ranges do not overlap?

reduce(ir1)

## IRanges of length 1
##     start end width
## [1]     1  10    10

plotRanges(reduce(ir1))

The result is a NormalIRanges object. An IRanges object is normal when its ranges are: 1. not empty (i.e. they have a non-null width); 2. not overlapping; 3. ordered from left to right; 4. not even adjacent (i.e. there must be a nonempty gap between 2 consecutive ranges).

CpG islands in real data

GC content is the precentage of the genome (or DNA fragment) that is “G” or “C”. To compute the GC content, we count the occurrences of the “G” and “C” alphabets, and divide by the length of the string in question.

We will be using data from chr8 of the human genome version 19 from the UCSC genome repository. (If you have plenty of space and a larger computer, you can prepare the data yourself or take a different chromosome by using the Bioconductor data package BSgenome.Hsapiens.UCSC.hg19, otherwise you will be able to load just a subset of chr8).

A genomic window is defined as a CpG island if its GC content>50% and observed-to-expected CG ratio>0.6. CpG islands are said to mark important regions in the genome because over 65% of gene promoter regions can be found in with CpG islands.

We want to look at this for the Human Chromosome 8. You can get the data either from installing the data from bioconductor (about 800Mb, in a nice format for later reference) which we would like you to do. You can also get the data from downloading from the class website if you haven’t already gotten the data (about 140Mb). If you do not have time to do this right now, you can skip ahead and load the data seqChr8Islands and seqChr8NonIslands by downloading this file and using load("cpg.RData").

If you haven’t already downloaded the bioconductor package, you can get it with this:

source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg19")

If you already have the genome installed, run:

require(BSgenome.Hsapiens.UCSC.hg19)

## Loading required package: BSgenome.Hsapiens.UCSC.hg19

seqChr8 = unmasked(Hsapiens$chr8)

If you don’t have that installed, then you can download the necessary file for this lab which is 140Mb. We don’t want everyone downloading this from the class website though as that will be a lot of traffic. Having said that, you can download the seqChr8.fasta file to your working directory, and run this:

seqChr8 = readDNAStringSet("seqChr8.fasta")[[1]]

Next, we’re going to download the locations of the CpG islands in this version of the genome from R. Irizarray’s website: http://rafalab.jhsph.edu/CGI/model-based-cpg-islands-hg19.txt. The file model-based-cpg-islands-hg19.txt is also on our data website. Download it here: model-based-cpg-islands-gh19.txt. Save it in your working directory and you can read it into R with the following commands:

cpglocs=read.table("model-based-cpg-islands-hg19.txt",header=T)

Then we retain only the start and stop positions for chromosome 8:

cpglocs8=cpglocs[which(cpglocs[,1]=="chr8"),2:3]
#Look at the dimensions of the cpglocs object,
dim(cpglocs8)

## [1] 2855    2

#Look at the first and last few rows:
cpglocs8[1:5,]

##        start    end
## 56961  32437  33708
## 56962  40354  40656
## 56963  44536  46203
## 56964  99457 100721
## 56965 155954 156761

cpglocs8[2850:2855,]

##           start       end
## 59810 146125639 146125885
## 59811 146126089 146128165
## 59812 146175867 146176338
## 59813 146228006 146228907
## 59814 146232962 146233086
## 59815 146276730 146278360

# We create a new matrix, what does it contain?
nonilocs8=matrix(0,ncol=2,nrow=2856)
nonilocs8[1,1]=10000
nonilocs8[1,2]=cpglocs8[1,1]-1
nonilocs8[2:2856,1]=cpglocs8[1:2855,2]+1
nonilocs8[2:2855,2]=cpglocs8[2:2855,1]-1
nonilocs8[2856,2]=146304021

Now we’ll create DNAStringSets that contain the islands and non-islands.

seqChr8Islands = DNAStringSet(seqChr8, start=cpglocs8[,1], end=cpglocs8[,2])
seqChr8NonIslands = DNAStringSet(seqChr8, start=nonilocs8[,1], end=nonilocs8[,2])

Look at the frequencies of the CG digrams in both sets:

freqIslands = vcountPattern("CG", seqChr8Islands) / width(seqChr8Islands)
freqNonIslands = vcountPattern("CG", seqChr8NonIslands) / width(seqChr8NonIslands)
freqCombined = rbind(data.frame(freq = freqIslands, type = "islands"), 
  data.frame(freq = freqNonIslands, type = "non-islands"))
b <- seq(0, 0.25, length = 50)
hist(freqNonIslands, col = rgb(0, 0, 1, 0.5), xlim = c(0, 0.25), breaks = b, 
    main = "Frequencies of CG Digrams")
hist(freqIslands, col = rgb(1, 0, 0, 0.5), add = T, breaks = b)
legend("topright", c("in NonIsland regions", "in Island regions"), fill = c(rgb(0, 
    0, 1, 0.5), rgb(1, 0, 0, 0.5)))

Plotting the above using ggplot can be done with the followinng calls

ggplot(freqCombined) + 
geom_histogram(aes(fill = type, x = freq), alpha = .7, position = "identity", 
  binwidth = .005) + 
labs(title = "Frequency of CG digrams")

Exercise: Display GC content in a running window along the sequence of Staphylococcus Aureus. For this excercise we read in a fasta file sequence from a file:

staph = readDNAStringSet("./data/staphsequence.ffn.txt", "fasta")
staph[1:3]

##   A DNAStringSet instance of length 3
##     width seq                                          names               
## [1]  1362 ATGTCGGAAAAAGAAATTTGG...AAGAAATAAGAAATGTATAA lcl|NC_002952.2_c...
## [2]  1134 ATGATGGAATTCACTATTAAA...TACCAATCAGAACTTACTAA lcl|NC_002952.2_c...
## [3]   246 GTGATTATTTTGGTTCAAGAA...TTCATCAAGGTGAACAATGA lcl|NC_002952.2_c...

This shows the first 3 sequences in the set. If you type in ‘staph’ it will print out all the sequences. Now, suppose we want to find the average GC content in the sequence windows of width 100 bases.

letterFrequency(staph[[1]], letters="ACGT", OR=0)

##   A   C   G   T 
## 522 219 229 392

GCstaph = data.frame(ID=names(staph), GC=rowSums(alphabetFrequency(staph)[, c(2,3)]/width(staph))*100)

Now we would like to display the GC content in a sliding window.

window = 100
# compute the GC content in a sliding window (as a fraction) for a sequence no. 364
gc = rowSums(letterFrequencyInSlidingView(staph[[364]], window, c("G", "C")))/window
plot(gc, type = 'l')

Now, we would like to add a smoothing line to observe the overall trends.

plot(1:length(gc), gc)
lines(lowess(x = 1:length(gc), y= gc, f = 0.10), col = 12, lwd = 2)