HW1 Stats 366 - Stats 166


Due by 11am, April 13th, 2012 (timestamp used as proof).

Send a complete pdf file to ahead@stanford.edu

Reading and revision

  • Review slides from this week.
  • What are ten words you looked up on wikipedia for clarification?

Do the Lab, make sure to provide a copy of your R commands and the output. Provide comments for both the functions you used and the input. If you can use Sweave, it is highly recommended, in any case, even if you use word, please hand in a pdf file.


Here is a simple example of good comments

GCcontent <- function(myseq)
 {  #####A function to compute GC content of the input sequence myseq
   mytable <- count(myseq, 1) # make a table with the count of As, Cs, Ts, and Gs
   mylength <- length(myseq) # find the length of the whole sequence
   myCs <- mytable[[2]] # number of Cs in the sequence
   myGs <- mytable[[3]] # number of Gs in the sequence
   myCG <- (myCs + myGs)/mylength

About Sweave (for the hardcore programmers who like latex only)

Some links to Sweave documentation:

LAB component

  1. Look up on the NCBI nucleotide page the genome of Mycobacterium tuberculosis (reference strain H37Rv). Explain the term fasta.
  2. Using either seqinr or geneR do not download the full genome of Mycobacterium tuberculosis H37Rv (this is large), but just the first contig (positions 1-341957).
  3. What is the length of the full genome of Mycobacterium tuberculosis H37Rv accession number AL123456 ?
  4. What are the last 12 nucleotides of the genome?
  5. From within R write a fasta file called mycontig1.fasta with the first 341957 base pairs.
  6. Check that you can read in the file with the read.fasta function.
  7. What is the GC content of the first 341957 base pairs?
  8. How many CG digrams are there?
  9. Is this close to the expected number under an independence model? (ie supposing that the occurrence of a G following a C is independent)
  10. Do a chisquare test to see if the difference is significant.
  11. Follow the instructions in chapter 2 of the little book of r to make a plot of GC content of a moving window of width 500 along the first contig (positions 1-341957).
  12. Download the genome with NCBI reference: NC_016893.1, call this seqyale.
  13. Put the frequencies (counts/total) of a,c,g,t in a vector of length 4 called pest.
  14. Simulate a new sequence of the same length as seqyale called simuseq by drawing independently from a multinomial distribution with probabilities \((p_a, p_t, p_c, p_g)=\) pest. (do not use a loop)
  15. Make a comparative plot of the frequencies of AT digrams in the simulated data simuseq and the seqyale data along the genome.