In this lab, we’ll learn how to manipulate strings in R, mostly using the Biostrings package.
If you’ve not done so already, install the Biostrings
package with the commands
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
Load the package with
require(Biostrings)
Biostrings can be quoted or unquoted. You will want to refer to the documentation for this package which is available at http://www.bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings.pdf. We suggest keeping this document open throughout the lab.
We will play around with some of functions providede in Biostrings just to get familiar with what they do, and how they work.
R provides functions that generate upper and lower case alphabets.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
One way to generate from “ACGT” is to do
sample(LETTERS[c(1,3,7,20)], size=10, replace=TRUE)
## [1] "C" "G" "T" "A" "T" "T" "G" "G" "A" "A"
Biostrings makes this slightly easier with the DNA_ALPHABET variable.
DNA_ALPHABET
## [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."
seq = sample(DNA_ALPHABET[1:4], size=10, replace=TRUE)
seq
## [1] "A" "G" "C" "T" "C" "G" "T" "C" "T" "T"
Convert seq into a string with
seq = paste(seq, collapse="")
seq
## [1] "AGCTCGTCTT"
The XString
class allows us to create, store, and work with different types of strings. We are only allowed to work with subclasses of XString
: BString
, DNAString
, RNAString
, and AAString
.
Let’s create a BString object:
bstring = BString("I am a BString object")
bstring
## 21-letter "BString" instance
## seq: I am a BString object
Obtain the length of bstring with R’s “length” function:
length(bstring)
## [1] 21
Now make a DNAString object:
dnastring = DNAString("TTGAAA-CTC-N")
dnastring
## 12-letter "DNAString" instance
## seq: TTGAAA-CTC-N
As before, we can compute the length of dnastring with
length(dnastring)
## [1] 12
According to the manual, the differences with a BString object are:
We can look at the slots of any XString object with the str() function:
str(dnastring)
## Formal class 'DNAString' [package "Biostrings"] with 5 slots
## ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. ..@ xp :<externalptr>
## .. .. ..@ .link_to_cached_object:<environment: 0x632f648>
## ..@ offset : int 0
## ..@ length : int 12
## ..@ elementMetadata: NULL
## ..@ metadata : list()
slotNames(dnastring)
## [1] "shared" "offset" "length" "elementMetadata"
## [5] "metadata"
Some useful functions:
alphabetFrequency(dnastring)
## A C G T M R W S Y K V H D B N - + .
## 3 2 1 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)
## A C G T other
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000
letterFrequency(dnastring, "A", as.prob=TRUE)
## A
## 0.25
reverseComplement(dnastring)
## 12-letter "DNAString" instance
## seq: N-GAG-TTTCAA
We can access the individual letters in each string using the [] operator.
bstring[3]
## 1-letter "BString" instance
## seq: a
dnastring[2]
## 1-letter "DNAString" instance
## seq: T
If we want a substring, we specify a range of indices.
dnastring[7:12]
## 6-letter "DNAString" instance
## seq: -CTC-N
Remember that R uses 1-based indexing, meaning that the first element in a string or any vector gets index 1. While this method of obtaining substrings is easy and intuitive, we DO NOT recommend it, especially for large strings. Biostrings provides a function “subseq” that takes advantage of the XString internals.
subseq(bstring, start=3, end=3)
## 1-letter "BString" instance
## seq: a
subseq(bstring, 3, 3)
## 1-letter "BString" instance
## seq: a
subseq(dnastring, 7, 12)
## 6-letter "DNAString" instance
## seq: -CTC-N
We can convert any XString object to a character vector with
toString(bstring)
## [1] "I am a BString object"
toString(dnastring)
## [1] "TTGAAA-CTC-N"
The results are character vectors of length 1. Check this with
length(toString(bstring))
## [1] 1
We can compare XStrings with characters and other XStrings.
bb = subseq(bstring, 3, 6)
bb == "am a"
## [1] TRUE
bb == BString("am a")
## [1] TRUE
Try not to compare different XString classes with one another.
bstring == dnastring
## Error in bstring == dnastring: comparison between a "BString" instance and a "DNAString" instance is not supported
bstring != dnastring
## Error in e1 == e2: comparison between a "BString" instance and a "DNAString" instance is not supported
This is a useful way to view multiple subsequences of a XString.
view = Views(dnastring, start=3:0, end=5:8)
view
## Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
## start end width
## [1] 3 5 3 [GAA]
## [2] 2 6 5 [TGAAA]
## [3] 1 7 7 [TTGAAA-]
## [4] 0 8 9 [ TTGAAA-C]
To check the number of views, use
length(view)
## [1] 4
We can select a subset of the views using the [] operator.
view[4:2]
## Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
## start end width
## [1] 0 8 9 [ TTGAAA-C]
## [2] 1 7 7 [TTGAAA-]
## [3] 2 6 5 [TGAAA]
The returned object is still a XStringViews object, even if we select only one element. To extract any given view as a XString object, use the [[]] operator.
view[[2]]
## 5-letter "DNAString" instance
## seq: TGAAA
To see all views but the 3rd one, use
view[-3]
## Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
## start end width
## [1] 3 5 3 [GAA]
## [2] 2 6 5 [TGAAA]
## [3] 0 8 9 [ TTGAAA-C]
The XStringSet objects provide a convenient way of storing several XString objects together.
set = NULL
for (i in 1:4)
set = c(set, paste(sample(DNA_ALPHABET[1:4], 10, replace=T), collapse=""))
set
## [1] "AGACCACTCC" "GCATGTAGCT" "GTGGTACGGC" "TCAAACGGCT"
length(set)
## [1] 4
Now convert into a set of DNAStrings:
set = DNAStringSet(set)
set
## A DNAStringSet instance of length 4
## width seq
## [1] 10 AGACCACTCC
## [2] 10 GCATGTAGCT
## [3] 10 GTGGTACGGC
## [4] 10 TCAAACGGCT
length(set)
## [1] 4
The functions “reverseComplement”, “alphabetFrequency”, “width”, and “subseq” work on each of the XStrings in a XStringSet.
reverseComplement(set)
## A DNAStringSet instance of length 4
## width seq
## [1] 10 GGAGTGGTCT
## [2] 10 AGCTACATGC
## [3] 10 GCCGTACCAC
## [4] 10 AGCCGTTTGA
alphabetFrequency(set, baseOnly=T, as.prob=T)
## A C G T other
## [1,] 0.3 0.5 0.1 0.1 0
## [2,] 0.2 0.2 0.3 0.3 0
## [3,] 0.1 0.2 0.5 0.2 0
## [4,] 0.3 0.3 0.2 0.2 0
letterFrequency(set, "A", as.prob=T)
## A
## [1,] 0.3
## [2,] 0.2
## [3,] 0.1
## [4,] 0.3
width(set)
## [1] 10 10 10 10
subseq(set, 1, 3)
## A DNAStringSet instance of length 4
## width seq
## [1] 3 AGA
## [2] 3 GCA
## [3] 3 GTG
## [4] 3 TCA
You can name each XString in the set:
names(set) = paste("name", 1:4, sep="")
We’ll look at the Escherichia coli APEC O1 genome (NC 008563). This can be found in the BSgenome package (along with genomes from other model organisms). Download and install the package data:
biocLite("BSgenome.Ecoli.NCBI.20080805")
We can now load the genome with
require(BSgenome.Ecoli.NCBI.20080805)
eco = Ecoli$NC_008563
Sample the genome by generating 1000 random views of random widths (50 - 100).
Use the alphabetFrequency and reverseComplement functions on these views.
Repeat 1-2 but this time with only 100 samples.
Compare the results of alphabetFrequency. How do both of them compare with the entire genome?
Generating the views:
views1000 = Views(eco, start=sample(length(eco), 1000, replace=T), width=sample(50:100, 1000, replace=T))
views100 = Views(eco, start=sample(length(eco), 100, replace=T), width=sample(50:100, 100, replace=T))
Computing the base frequencies:
alphabetFrequency(views1000, baseOnly=T, as.prob=T, collapse=T)
## A C G T other
## 0.2436190835 0.2540654784 0.2542674062 0.2480211082 0.0000269237
alphabetFrequency(views100, baseOnly=T, as.prob=T, collapse=T)
## A C G T other
## 0.2511641 0.2521227 0.2515749 0.2451383 0.0000000
alphabetFrequency(eco, baseOnly=T, as.prob=T)
## A C G T other
## 2.471704e-01 2.529128e-01 2.525602e-01 2.473315e-01 2.518681e-05
Just like Views, can be used to look at subsequences. For example, a common task is to specify a list of starting positions on a chromosome, and then look at subsequences of a given length with that start point.
ir1 = IRanges(start=1:10, width=10:1)
ir2 = IRanges(start=1:10, end=11)
ir3 = IRanges(end=11, width=10:1)
Lets look at the structure of an IRanges object.
str(ir1)
## Formal class 'IRanges' [package "IRanges"] with 6 slots
## ..@ start : int [1:10] 1 2 3 4 5 6 7 8 9 10
## ..@ width : int [1:10] 10 9 8 7 6 5 4 3 2 1
## ..@ NAMES : NULL
## ..@ elementType : chr "integer"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
We can extract the starts, ends, and widths with
start(ir1)
## [1] 1 2 3 4 5 6 7 8 9 10
end(ir1)
## [1] 10 10 10 10 10 10 10 10 10 10
width(ir1)
## [1] 10 9 8 7 6 5 4 3 2 1
As in Views, we can subset an IRanges object to get another IRanges object.
ir1[1:4]
## IRanges of length 4
## start end width
## [1] 1 10 10
## [2] 2 10 9
## [3] 3 10 8
## [4] 4 10 7
We can subset using logical expressions too.
ir1[start(ir1) <= 6]
## IRanges of length 6
## start end width
## [1] 1 10 10
## [2] 2 10 9
## [3] 3 10 8
## [4] 4 10 7
## [5] 5 10 6
## [6] 6 10 5
We’ll make a function to plot an IRanges object, and we will use basic R plotting functions.
library(ggplot2)
plotRanges = function(x, xlim=x, main=deparse(substitute(x)), sep=0.5){
height = 1
if (is(xlim, "Ranges"))
xlim = c(min(start(xlim))-sep, max(end(xlim))+sep)
bins = disjointBins(IRanges(start(x), end(x)+1))
plot.new()
plot.window(xlim, c(0, max(bins) * (height + sep)))
ybottom = bins * (sep + height) - height
rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = "blue")
title(main)
axis(1)
}
plotRanges(ir1)
We can also use ggplot to make the plots.
library(ggplot2)
plotRanges = function(x, xlim=x, main=deparse(substitute(x)), sep=0.5){
height = 1
if (is(xlim, "Ranges"))
xlim = c(min(start(xlim))-sep, max(end(xlim))+sep)
bins = disjointBins(IRanges(start(x), end(x)+1))
ybottom = bins*(sep+height) - height
df = data.frame(ybottom = ybottom, xleft = start(x) -.5,
xright = end(x) + .5, ytop = ybottom + height)
ggplot(df) + geom_rect(aes(xmax = xright, xmin = xleft, ymax = ytop, ymin = ybottom)) +
labs(title = main)
}
Notice that our ranges overlap. What if we want it so that the ranges do not overlap?
reduce(ir1)
## IRanges of length 1
## start end width
## [1] 1 10 10
plotRanges(reduce(ir1))
The result is a NormalIRanges object. An IRanges object is normal when its ranges are: 1. not empty (i.e. they have a non-null width); 2. not overlapping; 3. ordered from left to right; 4. not even adjacent (i.e. there must be a nonempty gap between 2 consecutive ranges).
GC content is the precentage of the genome (or DNA fragment) that is “G” or “C”. To compute the GC content, we count the occurrences of the “G” and “C” alphabets, and divide by the length of the string in question.
We will be using data from chr8 of the human genome version 19 from the UCSC genome repository. (If you have plenty of space and a larger computer, you can prepare the data yourself or take a different chromosome by using the Bioconductor data package BSgenome.Hsapiens.UCSC.hg19, otherwise you will be able to load just a subset of chr8).
A genomic window is defined as a CpG island if its GC content>50% and observed-to-expected CG ratio>0.6. CpG islands are said to mark important regions in the genome because over 65% of gene promoter regions can be found in with CpG islands.
We want to look at this for the Human Chromosome 8. You can get the data either from installing the data from bioconductor (about 800Mb, in a nice format for later reference) which we would like you to do. You can also get the data from downloading from the class website if you haven’t already gotten the data (about 140Mb). If you do not have time to do this right now, you can skip ahead and load the data seqChr8Islands
and seqChr8NonIslands
by downloading this file and using load("cpg.RData")
.
If you haven’t already downloaded the bioconductor package, you can get it with this:
source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg19")
If you already have the genome installed, run:
require(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome.Hsapiens.UCSC.hg19
seqChr8 = unmasked(Hsapiens$chr8)
If you don’t have that installed, then you can download the necessary file for this lab which is 140Mb. We don’t want everyone downloading this from the class website though as that will be a lot of traffic. Having said that, you can download the seqChr8.fasta file to your working directory, and run this:
seqChr8 = readDNAStringSet("seqChr8.fasta")[[1]]
Next, we’re going to download the locations of the CpG islands in this version of the genome from R. Irizarray’s website: http://rafalab.jhsph.edu/CGI/model-based-cpg-islands-hg19.txt
. The file model-based-cpg-islands-hg19.txt is also on our data website. Download it here: model-based-cpg-islands-gh19.txt. Save it in your working directory and you can read it into R with the following commands:
cpglocs=read.table("model-based-cpg-islands-hg19.txt",header=T)
Then we retain only the start and stop positions for chromosome 8:
cpglocs8=cpglocs[which(cpglocs[,1]=="chr8"),2:3]
#Look at the dimensions of the cpglocs object,
dim(cpglocs8)
## [1] 2855 2
#Look at the first and last few rows:
cpglocs8[1:5,]
## start end
## 56961 32437 33708
## 56962 40354 40656
## 56963 44536 46203
## 56964 99457 100721
## 56965 155954 156761
cpglocs8[2850:2855,]
## start end
## 59810 146125639 146125885
## 59811 146126089 146128165
## 59812 146175867 146176338
## 59813 146228006 146228907
## 59814 146232962 146233086
## 59815 146276730 146278360
# We create a new matrix, what does it contain?
nonilocs8=matrix(0,ncol=2,nrow=2856)
nonilocs8[1,1]=10000
nonilocs8[1,2]=cpglocs8[1,1]-1
nonilocs8[2:2856,1]=cpglocs8[1:2855,2]+1
nonilocs8[2:2855,2]=cpglocs8[2:2855,1]-1
nonilocs8[2856,2]=146304021
Now we’ll create DNAStringSets that contain the islands and non-islands.
seqChr8Islands = DNAStringSet(seqChr8, start=cpglocs8[,1], end=cpglocs8[,2])
seqChr8NonIslands = DNAStringSet(seqChr8, start=nonilocs8[,1], end=nonilocs8[,2])
Look at the frequencies of the CG digrams in both sets:
freqIslands = vcountPattern("CG", seqChr8Islands) / width(seqChr8Islands)
freqNonIslands = vcountPattern("CG", seqChr8NonIslands) / width(seqChr8NonIslands)
freqCombined = rbind(data.frame(freq = freqIslands, type = "islands"),
data.frame(freq = freqNonIslands, type = "non-islands"))
b <- seq(0, 0.25, length = 50)
hist(freqNonIslands, col = rgb(0, 0, 1, 0.5), xlim = c(0, 0.25), breaks = b,
main = "Frequencies of CG Digrams")
hist(freqIslands, col = rgb(1, 0, 0, 0.5), add = T, breaks = b)
legend("topright", c("in NonIsland regions", "in Island regions"), fill = c(rgb(0,
0, 1, 0.5), rgb(1, 0, 0, 0.5)))
Plotting the above using ggplot can be done with the followinng calls
ggplot(freqCombined) +
geom_histogram(aes(fill = type, x = freq), alpha = .7, position = "identity",
binwidth = .005) +
labs(title = "Frequency of CG digrams")
Exercise: Display GC content in a running window along the sequence of Staphylococcus Aureus. For this excercise we read in a fasta file sequence from a file:
staph = readDNAStringSet("./data/staphsequence.ffn.txt", "fasta")
staph[1:3]
## A DNAStringSet instance of length 3
## width seq names
## [1] 1362 ATGTCGGAAAAAGAAATTTGG...AAGAAATAAGAAATGTATAA lcl|NC_002952.2_c...
## [2] 1134 ATGATGGAATTCACTATTAAA...TACCAATCAGAACTTACTAA lcl|NC_002952.2_c...
## [3] 246 GTGATTATTTTGGTTCAAGAA...TTCATCAAGGTGAACAATGA lcl|NC_002952.2_c...
This shows the first 3 sequences in the set. If you type in ‘staph’ it will print out all the sequences. Now, suppose we want to find the average GC content in the sequence windows of width 100 bases.
letterFrequency(staph[[1]], letters="ACGT", OR=0)
## A C G T
## 522 219 229 392
GCstaph = data.frame(ID=names(staph), GC=rowSums(alphabetFrequency(staph)[, c(2,3)]/width(staph))*100)
Now we would like to display the GC content in a sliding window.
window = 100
# compute the GC content in a sliding window (as a fraction) for a sequence no. 364
gc = rowSums(letterFrequencyInSlidingView(staph[[364]], window, c("G", "C")))/window
plot(gc, type = 'l')
Now, we would like to add a smoothing line to observe the overall trends.
plot(1:length(gc), gc)
lines(lowess(x = 1:length(gc), y= gc, f = 0.10), col = 12, lwd = 2)
Answer the questions on OHMS “Lab 2: Biostrings”. Go to https://web.stanford.edu/class/bios221/cgi-bin/index.cgi/ to answer some questions. You may NOT work with other students to answer the questions on OHMS. You will need a Stanford ID to log in to OHMS.