In this lab, we'll learn how to simulate data with R using
random number generators of different kinds of
**mixture variables** we control.

Unlike previous labs where the homework was done via OHMS, this lab will require you to submit short answers, submit plots (as aesthetic as possible!!), and also some code. All questions are numbered 1), 2), …

**Formatting** We prefer you try to make a markdown file with your
answers (use MyName_Lab3.Rmd as your file) using Rstudio and the
knitr function icon that will generate an html so that you can make
aesthetic output easily. You then put these files (there will be an
Rmd, a figure folder and an html to upload) in your dropbox folder
on coursework.

When wanting to produce the same results with a random number generator it is important to set a starting point. This is important if you want to reproduce the results of a simulation or algorithm, and is very important in debugging.

Compare for instance the following output:

```
vecpoisson=rpois(100,5)
mean(vecpoisson)
```

```
## [1] 4.74
```

If you do this alot, you get many different values. In fact if you put all the values together you would get what we call the sampling distribution of the mean of 100 Poisson random variables.

0) Why don't you get the same answer every time?

Now try:

```
set.seed(198911)
vecpoisson=rpois(100,5)
mean(vecpoisson)
```

```
## [1] 4.8
```

```
set.seed(198911)
vecpoisson=rpois(100,5)
mean(vecpoisson)
```

```
## [1] 4.8
```

Recall from probability that the sum of exponentials gives a gamma distribution. In this section, we will confirm that by simulation and cover some helpful functions in R. In general, we want to avoid `for`

loops in R since that is slower than working with functions such as `apply()`

.

We will generate 5 samples from an exponential with a rate parameter 0.1 and sum them together. This is `sum(rexp(n=nexps, rate=rate))`

.
The function `replicate()`

allows us to do this many times with very little code. Here, we do this sum 10,000 times to get an idea of the distribution.

```
reps <- 50000
nexps <- 5
rate <- 0.1
set.seed(0)
system.time(
x1 <- replicate(reps, sum(rexp(n=nexps, rate=rate)))
) # replicate
```

```
## user system elapsed
## 0.304 0.003 0.307
```

```
head(x1)
```

```
## [1] 38.01 42.64 85.13 58.21 16.67 81.43
```

To confirm that this is correct, we can make a histogram or a qqplot of the simulation and compare it to what we know is the truth.

```
require(ggplot2)
ggplot(data.frame(x1), aes(x1)) +
geom_histogram(aes(y=..density..)) +
stat_function(fun=function(x)dgamma(x, shape=nexps, scale=1/rate),
color="red", size=2)
```

There are different ways to do this. Look at the `help(replicate)`

and you can see that there are various functions `sapply()`

, `lapply()`

, and `vapply()`

. These are related to the functions `apply()`

and `tapply()`

.

Here is how you might do the same thing with `sapply()`

.
You can plot this using the same commands above.

```
set.seed(0)
system.time(x1 <- sapply(1:reps, function(i){sum(rexp(n=nexps, rate=rate))})) # simple apply
```

```
## user system elapsed
## 0.277 0.017 0.295
```

```
head(x1)
```

```
## [1] 38.01 42.64 85.13 58.21 16.67 81.43
```

```
set.seed(0)
system.time(x1 <- lapply(1:reps, function(i){sum(rexp(n=nexps, rate=rate))})) # list apply
```

```
## user system elapsed
## 0.231 0.001 0.232
```

```
head(x1)
```

```
## [[1]]
## [1] 38.01
##
## [[2]]
## [1] 42.64
##
## [[3]]
## [1] 85.13
##
## [[4]]
## [1] 58.21
##
## [[5]]
## [1] 16.67
##
## [[6]]
## [1] 81.43
```

When we apply a very simple function (e.g., a sum), the fastest way is often to just make a matrix of all the simulations and then apply that function to the matrix appropriately. The functions `rowSums()`

and `colSums()`

are particularly efficient at this.

```
set.seed(0)
system.time(x1 <- apply(matrix(rexp(n=nexps*reps, rate=rate), nrow=nexps),2,sum)) # apply on a matrix
```

```
## user system elapsed
## 0.114 0.003 0.117
```

```
head(x1)
```

```
## [1] 38.01 42.64 85.13 58.21 16.67 81.43
```

```
set.seed(0)
system.time(x1 <- colSums(matrix(rexp(n=nexps*reps, rate=rate), nrow=nexps))) # using colSums
```

```
## user system elapsed
## 0.018 0.001 0.018
```

```
head(x1)
```

```
## [1] 38.01 42.64 85.13 58.21 16.67 81.43
```

If you have a multi-core processor, sometimes you can speed things up by taking advantage of parallelization. In the package `parallel`

is the function `cmlapply()`

which acts very similarly to `lapply()`

. By default, this uses only a single processor. You can change `mc.cores`

if you have more cores.

```
require(parallel)
```

```
set.seed(0)
system.time(x1 <- mclapply(1:reps, function(i){sum(rexp(n=nexps, rate=rate))})) # multi-cluster apply
```

```
## user system elapsed
## 0.385 0.095 0.250
```

```
head(x1)
```

```
## [[1]]
## [1] 28.92
##
## [[2]]
## [1] 67.89
##
## [[3]]
## [1] 26.47
##
## [[4]]
## [1] 37.11
##
## [[5]]
## [1] 31.21
##
## [[6]]
## [1] 24.9
```

Do not use a for loop for any of these questions. You must show your code in order to get credit.

**Question 1**

Let \(U_1, U_2, U_3\) all come from a uniform(0,1) distribution. Let \(M = \max(U_1, U_2, U_3)\). Estimate (to 3 significant digits) the probability \(\mathbb{P}(M > 0.75)\).

**Question 2**

Let \(Z_{(n)}\) be maximum of \(n\) standard normal observations. Estimate what \(n\) should be so that \(\mathbb{P}(Z_{(n)} > 4) = 0.25\).

**Question 3**

Let \(X_1, \dots, X_n\) be Poisson random variables with parameter \(\lambda = 0.5\) and where \(n=21\). Estimate the probability that the sample mean is greater than the sample median.

**Question 4**

Let \(U\) come from a uniform(0,1) distribution and \(Z\) come from a standard normal distribution. Let \(X = pZ + (1-p)U\). Estimate \(p\) when \(X\) has a variance of 0.4.

We will warm up by generating some random normal variables. Generate 1000 samples from the \(N(0,1)\) distribution:

```
samples = rnorm(1000, 0, 1)
```

**Question 5**
Check that these are from \(N(0,1)\) using a quantile-quantile plot (Q-Q plot). Use the stat_qq() function in the ggplot2 package.

Now load data that has been sampled from a “mystery” distribution. You can download the RData file here.

```
load("mystery_samples.RData")
head(samples)
```

```
## [1] 27 29 31 26 30 18
```

**Question 6**
Which distribution do you think the samples comes from? Hint: look at the histogram, and the displot() function from the vcd package might be useful. You might also want to check the mean and variances.

Take a sample of size 1,000,000 from a Normal with mean 0 and variance 1 and a sample of size 1,500,000 from a normal with mean 3 and variance 1. Then plot the histogram with a binwidth of 0.25:

```
require(ggplot2)
sampa=rnorm(1000000,0,1)
sampb=rnorm(1500000,3,1)
combined = c(sampa, sampb)
plt = ggplot(data.frame(combined), aes(x=combined)) + stat_bin(binwidth=0.25, position="identity")
plt
```

**Question 7**

- a) How many modes are there? Is this surprising?
- b) Now plot the density using geom_density().
- c) How is density related to hist?

Try making 3,000,000 points from a mixture of two normal groups with the means separated by 1 and the variance of the second group twice the size of the first. The first group should be twice the size of the second.

```
pop1=rnorm(2000000)
pop2=rnorm(1000000, 1, 2)
combined = c(pop1, pop2)
plt= ggplot(data.frame(data=c(combined, pop1, pop2), labels=rep(c("combined", "pop1", "pop2"), c(3e6, 2e6, 1e6))), aes(x=data)) + stat_bin(aes(fill=labels), position="identity", binwidth=0.25, alpha=0.5) + theme_bw()
plt
```

*Question 8*

- a) What does this histogram show?
- b) Plot the densities of pop1 and pop2 on the same axes.

We want to generate a random DNA string with an *independent* background multinomial of \((p_A,p_C,p_G,p_T)=(0.2,0.3,0.2,0.3)\). Load the Biostrings package.

```
require(Biostrings)
```

**Question 9**
Show how to do this using sample().

Now we want to generate according to a Markov Chain with the following transition matrix:

```
trans=matrix(c(0.2, 0.2, 0.19, 0.1, 0.27, 0.34, 0.31, 0.34, 0.39,
0.24, 0.35, 0.36, 0.14, 0.22, 0.15, 0.20),nrow=4)
dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
trans
```

```
## [,1] [,2] [,3] [,4]
## [1,] 0.20 0.27 0.39 0.14
## [2,] 0.20 0.34 0.24 0.22
## [3,] 0.19 0.31 0.35 0.15
## [4,] 0.10 0.34 0.36 0.20
```

Begin by generating a nucleotide from the stationary distribution \((p_A,p_C,p_G,p_T)=(0.18,0.325,0.325,0.18)\). From here, generate the subsequent letters according to the chain's transition matrix.

**Question 10**
Generate a string of size 50 from this. Because we will be indexing into the transition matrix, it will be convenient to generate the indices instead of the alphabets.

Generate two clusters with 100,000 points each:

```
require(MASS)
Sigma=matrix(c(5,3,3,2),2,2)
ex1=mvrnorm(100000,rep(0,2),Sigma)
Sigma=matrix(c(9,-5,-1,5),2,2)
ex2=mvrnorm(n=100000, rep(3, 2), Sigma)
```

**Question 11**
Plot both clusters on the same axes, along with 2d contours of their densities. You will want to use geom_point() and geom_density2d().

We show how to compute the probability of simple events using simulation.

Suppose we rolled two fair dice. What is the probability that their sum is at least 7? We will approach this by simulating many throws of two fair dice, and then computing the fraction of those trials whose sum is at least 7. It will be convenient to write a function that simulates the trials and returns TRUE if the sum is at least 7 (we call this an event), and FALSE otherwise.

```
isEvent = function(numDice, numSides, targetValue, numTrials){
apply(matrix(sample(1:numSides, numDice*numTrials, replace=TRUE), nrow=numDice), 2, sum) >= targetValue
}
```

Now that we have our function, we are ready to do the monte carlo. It is good practice to set the random seed for reproducibility and debugging.

```
set.seed(0)
#try 5 trials
outcomes = isEvent(2, 6, 7, 5)
mean(outcomes)
```

```
## [1] 1
```

This is far from the theoretical answer of \(\frac{21}{36}=0.58333\). Now try with 10,000 trials:

```
set.seed(0)
outcomes = isEvent(2, 6, 7, 10000)
mean(outcomes)
```

```
## [1] 0.5843
```

Ah, much better. The computation required for each trial was trivial, but in general, monte carlo simulations can be quite expensive, so we show how to easily parallelize the above computation in R using the “parallel” package.

```
require(parallel)
```

There is quite a bit to this package; we'll illustrate how to use pvec() for the dice example. We'll modify isEvent() so that it's fit for use with the pvec() function.

```
isEventPar = function(numDice, numSides, targetValue, trialIndices){
sapply(1:length(trialIndices), function(x) sum(sample(1:numSides, numDice, replace=TRUE)) >= targetValue)
}
```

We call pvec() with:

```
set.seed(0)
outcomes = pvec(1:10000, function(x) isEventPar(2, 6, 7, x))
mean(outcomes)
```

```
## [1] 0.5931
```

As expected, the result is the same, provided we set the same random seed.

**Question 12**
For a coin with probability \(p\) of landing heads, the expected number of tosses before we get the first head is \(\frac{1}{p}\). Verify this with simulation for \(p=0.5\).

Generate the means of the poisson distribution by sampling from a gamma distribution:

```
lambdas = rgamma(100, shape=2, scale=3)
samples = rep(0, 100)
for (i in 1:100)
samples[i] = rpois(1, lambdas[i])
```

**Question 13**
Using the goodfit() and rootogram() functions in the vcd package, what distribution do you think fits the gamma-poisson mixture best? (Hint: Beware of the Chisquare test when some counts are smaller than 5.)

The power of a statistical test is the probability that the test rejects the null hypothesis if the alternative is true. There is rarely a closed form for the power, so we resort to simulation. An important question in many clinical trials is how many subjects (samples) do we need to achieve a certain amount of power?

Suppose we want to find out how many samples are needed to distinguish between the means of two normal distributions, \(N(1, 0.5)\) and \(N(2, 0.5)\) with a power of at least 0.8 at the 0.05 significance level.

We'll take \(n\) samples from each population, and compute the statistic \(\frac{\bar{X}_1-\bar{X}_2}{\sqrt{(0.5^2+0.5^2)/n}}\). Under the null hypothesis that the two means are the same, this statistic has a \(N(0, 1)\) distribution, and the \(p\)-value is \(2P\left(N(0,1)\geq\left|\frac{\bar{X}_1-\bar{X}_2}{\sqrt{(0.5^2+0.5^2)/n}}\right|\right)\).

```
compute_power = function(n, sigma, numTrials){
sampa = matrix(rnorm(n*numTrials, 1, sigma), ncol=numTrials)
sampb= matrix(rnorm(n*numTrials, 2, sigma), ncol=numTrials)
statistics = (apply(sampa, 2, mean) - apply(sampb, 2, mean))/sqrt(2*sigma^2/n)
return (mean(abs(statistics) >= qnorm(0.975)))
}
```

Let's try 3 and 4 samples:

```
set.seed(0)
compute_power(3, 0.5, 10000)
```

```
## [1] 0.6818
```

```
compute_power(4, 0.5, 10000)
```

```
## [1] 0.8091
```

So it looks like 4 samples will do it.

**Question 14**
Will it take fewer or more samples if the standard deviations were 5 instead of 0.5?
Verify this with simulation.