Next: References Up: Introduction to the Bootstrap: Previous: Labs and Homeworks   Index

Subsections

# Situation of resampling in contemporary statistics

This will be explained as a flow chart showing where Fisher's initial hypotheses come in and how the data are studied under the conjectured distribution : usually multivariate normal. There are several books on Multivariate Analaysis that follow this paradigm and I will not go into any details, but it makes sense to bear in mind that if one does know from previous studies for instance that the data will be normal multivariate, there is only a simple estimation problem left, all the possible information from the data is summarised by the mean and variance covariance matrix and no more is to be said.

# The underlying principle

Suppose we are interested in the estimation of an unknown parameter (this doesn't mean that we are in a parametric context). The two main questions asked at that stage are :
• Question 1 What estimator should be used ?
• Question 2 Having chosen an estimator, how accurate is it ?

The second question has to be answered through information on the distribution or at least the variance of the estimator.

Of course there are answers in very simple contexts: for instance when the parameter of interest is the mean then the estimator has a known standard deviation : the estimated standard error noted sometimes

However no such estimator is available for the sample median for instance.

In maximum liklihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of

The bootstrap is a more general way to answer question 2 , with the following aspects:

• Less or no parametric modelling.
• More computation. (a factor 100 to 1000)
• Automatic, whatever the situation (can be complex).

If we had several samples from the unknown (true) distribution then we could consider the variations of the estimator :

Such a situation is never the case, so we replace these new samples by a resampling procedure based on the only information we have about , and that is an empirical :this side is what is called bootstrapping.

# The bootstrap: Some Examples

## A binomial example

Suppose we don't know any probability or statistics, and we are told that
 Heart attacks Subjects Aspirin 104 10933 11037 Placebo 189 10845 11034
The question : Is the true ratio , the parameter of interest, smaller than 1?

Or is the difference between rates in both populations strictly different from zero?

We could run a simulation experiment to find out.

>> [zeros(1,10) ones(1,2)]
ans =
0     0     0     0     0     0     0     0     0     0     1     1
>> [zeros(1,10) ones(1,2)]'
ans =
0
0
0
0
0
0
0
0
0
0
1
1
>> [zeros(1,10) ; ones(1,2)]
???  All rows in the bracketed expression must have the same
number of columns.
>> sample1=[zeros(1,109) ones(1,1)]';
>> sample2=[zeros(1,108) ones(1,2)]';
>> orig=sample1
>>[n,p]=size(orig)
n =     110
p =    1
>> thetab=zeros(1,1000);


File myrand.m for the bootstrap index creation:

function out=myrand(m,n,range)
%Returns a vector of draws
%of integers from 1to n,
%with replacement

out=floor(rand(m,n)*range)+1;


File bsample.m for the bootstrap resample creation:

function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a
%matrix with nrow observations and ncol variables
[n,p]=size(orig);
indices=myrand(1,n,n);
out=orig(indices,:);


File bsample.m:

function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a
%matrix with nrow observations and ncol variables

[n,p]=size(orig);
indices=myrand(1,n,n);
out=orig(indices,:);


### Computing the bootstrap distribution for one sample

>> for (b =(1:1000))
res.bsample1=bsample(sample1);
thetab(b)=sum(res.bsample1==1);
end
>>hist(thetab)


This is what the histogram looks like:
Here is the complete data set computation

>> sample1=[zeros(1,10933),ones(1,104)]';
>> sample2=[zeros(1,10845),ones(1,189)]';
>> thetab=zeros(1,1000);
>> for (b =(1:1000))
res.bsample1=bsample(sample1);
thetab(b)=sum(res.bsample1==1);
end


## Comparing to a hypothetical parameter

Suppose that we are trying to test

>> for (b =(1:1000))
res.bsample1=bsample(sample1);
thetab(b)=sum(res.resample1==1)-2;
end
>>hist(thetab)
>> mean(thetab)
ans =
-0.9530
>> var(thetab)
ans =
1.0398
>> sum(thetab>0)
ans =
96


This is what the histogram looks like:

### Computing the bootstrap distribution for two samples

>>thetab=zeros(1,1000);
>> for (b =(1:1000))
res.bsample1=bsample(sample1);
res.bsample2=bsample(sample2);
thetab(b)=sum(res.bsample2==1)-sum(res.bsample1==1);
end
>>hist(thetab)


This is what the histogram looks like:

## Without the computer

Sample one could be considered as the realization of a Binomial random variable, from some unkown Binomial distribution, for which the best estimate given by maximum likelihood would be:

The second sample would be considered as coming from another Binomial, in the most general case

Theoretically, what can we say abou the distribution of ?

# Some notation

From an original sample

draw a new sample of observations among the original sample with replacement, each observation having the same probabilty of being drawn (). A bootstrap sample is often denoted

If we are interested in the behaviour of a random variable , then we can consider the sequence of new values obtained through computation of new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.

Here is an example of a line of matlab that does just that:

indices=randint(1,n,n)+1;


If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)


An approximation of the distribution of the estimate is provided by the distribution of

denotes the bootstrap distribution of , often approximated by

1. Compute the original estimate from the original data.
2. For b=1 to B do : %B is the number of bootstrap samples
1. Create a resample
2. Compute
3. Compare to .

## Accuracy of the sample mean

Using the linearity of the mean and the fact that the sample is iid we have

where is the usual estimate of the variance obtained from the sample.

If we were given true samples, and their associated estimates , we could compute the usual variance estimate for this sample of values, namely:

where

Here are some computations for the mouse data(page 11 of text)

 treat=[94 38 23 197 99 16 141]' treat = 94 38 23 197 99 16 141 >> median(treat) ans = 94 >> mean(treat) ans = 86.8571 >> var(treat) ans = 4.4578e+03 >> var(treat)/7 ans = 636.8299 >> sqrt(637) ans = 25.2389 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(treat)); end hist(thetab) >> sqrt(var(thetab)) ans = 37.7768 >> mean(thetab) ans = 80.5110  This is what the histogram looks like: Control Group control=[52 104 146 10 51 30 40 27 46]'; >> median(control) ans = 46 >> mean(control) ans = 56.2222 >> var(control) ans = 1.8042e+03 >> var(control)/length(control) ans = 200.4660 >> sqrt(200.4660) ans = 14.1586 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(control)); end hist(thetab) >> sqrt(var(thetab)) ans = 11.9218 >> mean(thetab) ans = 45.4370  This is what the histogram looks like:

Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?

## The combinatorics of the bootstrap distribution

As we noted in class, and looking at the histograms, the main aspect of the bootstrap distribution of the median is that it can take on very few values, in the case of the treatment group for instance, . The simple bootstrap will always present this discrete characteristic even if we know the underlying distribution is continuous, there are ways to fix this and in many cases it won't matter but it is an important feature.

### How many different bootstrap samples are there?

By different samples, the samples must differ as sets, ie there is no difference between the sample , ie the observations are exchangeable or the statistic of interest is a symmetrical function of the sample: .
Definition:
The sequence of random variables is said to be exchangeable if the distribution of the vector is the same as that of , for any permutation of elements.

Suppose we condition on the sample of distinct observations , there are as many different samples as there are ways of choosing objects out of a set of possible contenders, repetitions being allowed.

At this point it is interesting to introduce a new notation for a bootstrap resample, up to now we have noted a possible reasample, say , because of the exchangeability/symmetry property we can recode this as the vector counting the number of occurrences of each of the observations. in this recoding we have and the set of all bootstrap resamples is the dimensional simplex

Here is the argument I used in class to explain how big is. Each component in the vector is considered to be a box, there are boxes to contain balls in all, we want to contain to count the number of ways of separating the n balls into the boxes. Put down separators of to make boxes, and balls, there will be positions from which to choose the bars' positions, for instance our vector above corresponds to: oo||o|oo| . Thus

Stirling's formula ( ) gives an approximation ,

here is the function file approxcom.m

function out=approxcom(n)
out=round((pi*n)^(-.5)*2^(2*n-1));

that produces the following table of the number of resamples:

Are all these samples equally likely, thinking about the probability of drawing the sample of all 's by choosing the index times in the integer uniform generation should persuade you that this sample appears only once in times. Whereas the sample with once and all the other observations can appear in out of the ways.

### Which is the most likely bootstrap sample?

The most likely resample is the original sample , the easiest way to see this is to consider:

### The multinomial distribution

In fact when we are drawing bootstrap resamples we are just drawing from the mulinomial distribution a vector , with each of the categories being equally likely, , so that the probability of a possible vector is

This will be largest when all the 's are , thus the most likely sample in the boostrap resampling is the original sample, here is the table of the most likely values:

As long as the statistic is somewhat a smooth function of the observations, we can see that discreteness of the boostrap distribution is not a problem.

# Complete Enumeration

Theoretically, we could give a complete enumeration of the bootstrap sampling distribution, all we need to know is how to compute the statistic for all the bootstrap resamples. There is a way of going through all resamles as defined by the simplex , it is called a Gray code. As an example, consider the law school data used by Efron (1982). One can only hope to enumerate completely for moderate sample sizes ( today). For larger sample sizes partial enumeration through carefully spaced points is discussed in Diaconis and Holmes (1994a). Another idea is to use a small dose of randomness, not as much as Monte Carlo, by doing a random walk between close points so as to be able to use updating procedures all the same, this is detailed in the case of exploring the tails of a bootstrap distribution in Diaconis and Holmes (1994b).

## The original Gray code

Let be the set of binary -tuples. This may be identified with the vertices of the usual -cube or with the set of all subsets of an element set. The original Gray code gives an ordered list of with the property that successive values differ only in a single place. For example, when such a list is

It is easy to give a recursive description of such a list, starting from the list for (namely 0,1). Given a list of length , form by putting a zero before each entry in , and a one before each entry in . Concatenate these two lists by writing down the first followed by the second in reverse order. Thus from we get and then the list displayed above for . For the list becomes :

Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.

One way to express this is as follows : let be the binary representation of the integer , and let be the string of rank in the Gray code list. Then and . For example, when , the list above shows the string of rank is ; now . So . Thus from a given string in the Gray code and the rank one can compute the successor. There is a parsimonious implementation of this in the algorithm given in the appendix. Proofs of these results can be found in Wilf (1989).

## Gray Codes for the Bootstrap

Bickel and Freedman (1981) carried out exact enumeration of the bootstrap distribution for the mean using the fast Fourier transform, Bailey (1992) used a similar approach for simple functionals of means. Fisher and Hall (1991) suggest exact enumeration procedures that we will compare to the Gray code approach in Section B below.

Let be the original data supposed to be independent and identically distributed from an unknown distribution on a space . The bootstrap proceeds by supposing that replacement of by , the empirical distribution, can provide insight on sampling variability problems.

Practically one proceeds by repeatedly choosing from the points with replacement. This leads to bootstrap replications . There are such possible replications, however these are not all different and grouping together replications that generate the same subset we can characterize each resample by its weight vector where is the number of times appears in the replication. Thus .

Let the space of compositions of into at most parts be

Thus . We proceed by running through all compositions in a systematic way. Note that the uniform distribution on induces a multinomial distribution on

To form the exhaustive bootstrap distribution of a statistic one need only compute each of the statistics and associate a weight with it. The shift from to gives substantial savings. For the law school data, while .

Efficient updating avoids multiplying such large numbers by factors of . This is what makes the computation feasible. Gray codes generate compositions by changing two coordinates of the vector by one up and one down. This means that can be easily updated by multiplying and dividing by the new coordinates. Similar procedures, discussed in Section C below allow efficient changes in the statistics of interest.

### Gray codes for compositions.

Following earlier work by Nijenhuis, Wilf and Knuth, Klingsberg (1982) gave methods of generating Gray codes for compositions. We will discuss this construction briefly here, details can be found in Klingsberg (1982) and Wilf (1989).

For , the algorithm produces the 10 compositions of in the following order:

The easiest way to understand the generation of such a list is recursive, construction of the -compositions of can actually be done through that of the compositions of .

For any , the 2-composition is just provided by the list , which is of length . So the 2-compositions out of 3 are

the 2-compositons out of 2 are

and 2-compositions out of 1 are

Finally there is only one 2-composition of :

The 3-out of 3 list is obtained by appending a to the list, a 1 to the list, a 2 to the list and a 3 to the list. These four lists are then concatenated by writing the first, the second in reverse order, the third in its original order followed by the fourth in reverse order. This is actually written:

and more generally

The same procedure leads to the 35 compositions of in the following order:

The lists generated in this way have the property that two successive compositions differ only by in two coordinates.

Klingsberg (1982) provides a simple nonrecursive algorithm that generates the successor of any compositon in this Gray code. This is crucial for the implementationin the the present paper. It requires that one keep track of the whereabouts of the first two non-zero elements and an updating counter. Both a and a version of the algorithm are provided in the appendix.

We conclude this subsection by discussing a different algorithm due to Nijenhuis and Wilf (1978, pp. 40-46) which runs through the compositions in lexicographic order (reading from right to left). This algorithm was suggested for bootstrapping by Fisher and Hall (1991).

algorithm to run through compositions .

(1) Set .

(2) Let first with . Set .

(3) Stop when .

For example, the following list gives all 35 compoisitions in in the order produced by the N.W. algroithm

# Balanced Bootstraps

Intermediary between the complete enumeration of all samples and the drawing of a few samples at random lies the Balanced Bootstrap, (see Gleason, 1988, American Statistician,263-266) for a simple algorithm.

The idea is that as when we count all samples each observation has the same chance of appearing, we would like to have each observation appearing overall the same number of times after all bootstrap resamples have been drawn.

We can be ensured of this by taking a list of repetitions of each of the observations and repeating them times, take a permutation of these values and then take the first as the first sample, and so on.

# Monte Carlo

This is the method used for drawing a sample at random from the empirical distribution. I will start by giving a history and general remarks about Monte Carlo methods for those who have never studied them before.

## What is a Monte Carlo Method?

There is not necessarily a random component in the original problem that one wants to solve,usually a problem for which there is no analytical solution. An unknown parameter (deterministic) is expressed as a parameter of some random distribution, that is then simulated. The oldest well-known example is that of the stimation of by Buffon, in his needle on the floorboards expeiment, where supposing a needle of the same length as the width between cracks we have:

In physics and statistics many of the problems Monte Carlo is used on is under the form of the estimate of an integral unkown in closed form:

1. The crude, or mean-value Monte Carlo method thus proposes to generate numbers uniformly from and take their average: to estimate ,

2. The hit-or-miss Monte Carlo method generates random points in a bounded rectangle and counts the number of 'hits' or points that are in the region whose area we want to evaluate.

Which estimate is better?
This is similar to comparing statistical estimators in general.

There are certain desirable properties that we want estimators to have, consistency which ensures that as the sample size increases the estimates converges to the right answer is ensured here by the properties of Riemann sums. Other properties of interest are:

• Unbiasedness.
• Minimal Variance.
Both the above methods are unbiased, that is when repeated many times their average values are centred in the actual value .

So the choice between them lies in finding the one which has the less variance. The heuristic I developed in class to see that the hit-and-miss has a higher variance is based on the idea that the variance comes from the added randomness of generating both coordinates at random, instead of just the absissae in the crude Monte Carlo.

More precisely, the variance of crude Monte Carlo is

and that of hit and miss Monte Carlo, which is just a Binomial(n,) is:

The difference between these two variances is always positive:

Most improvements to Monte Carlo methods are variance-reduction techniques.

### Antithetic Resampling

Suppose we have two random variables that provide estimators for , and , that they have the same variance but that they are negatively correlated, then will provide a better estimate for because it's variance will be smaller.

This the idea in antithetic resampling (see Hall, 1989). Suppose we are interested with a real-valued parameter, and that we have ordered our original sample , for each resample and statistic we associate by taking a special permutation of the 's that will make , and as small as possible. If the statistic is a smooth function of mean for instance, then the 'reversal permutation' that maps into , into , etc... is the best, the small sample values are transformed into the larger observations, and the average of these two estimates will give an estimate with smaller variance.

### Importance Sampling

This is often used in simulation, and is a method to work around the small area problem. If we want to evaluate tail probabilities, or very small areas, we may have very few hits of our random number generator in that area. However we can modify the random number generator, make that area more likely as long as we take that into account we we do the summation. Importance sampling is based on the equalities:

# Theoretical Underpinnings

## Statistical Functionals

(Reference : Eric Lehmann, 1998,pp.381-438.)

We often speak of the asymptotic properties of the sample mean .These refer to the sequence . These functions are the same in some sense, for all sample size. The notion of statistical functional makes this clearer.

Suppose we are interested in real-valued parameters. We often have a situation where the parameter of interest is a function of the distribution function , these are called statistical functionals. Examples:

Goodness of fit statistics:

is estimated by:

Ratio of two means.

We use the sample cdf

as the nonparametric estimate of the unknown distribution .

The usual estimates for these functionals are obtained by simply plugging in the empirical distribution function for the unknown theoretical one.

Thus taking into account that for any function we have:

the plug-in estiamte for the mean is the usual sample estimate, for the variance, it is the biased estimate:

### Notions of Convergence

Convergence in Law
A sequence of cumulative distribution functions is said to converge in distribution to iff on all continuity points of .

We say that if the random variable has cdf and the rv has cdf , converges in law to , and we write

This does not mean that and are arbitrarily close, think of the random variables and .

Convergence in Probability

Note:
If where is a limit distribution and then .

### Why is the empirical cdf a good estimator of F?

This number has a binomial distribution with probability so that the theoretical expectation is and . From this and de Moivre's central limit theorem for binomial, we can show for fixed real

This shows consistency pointwise.

Because of the result noted above, this also ensures that , this is actually true uniformly in because Kolmogorov's statistic is pivotal

has a distribution that does not depend on . In fact , as . It is easy for continuous , but also true for larger classes of cdfs (Serfling, Billingsley).

Definition: A statistic is said to be pivotal if its distribution does not depend on any unknown parameters.

Example: Student's statistic.

### Generalized Statistical Functionals

When we want to evaluate an estimator, construct confidence intervals, etc.. we are usually interested in evaluating quantities that are functions of both the unknown distribution , the empirical and the sample size, here are some examples:
1. The sampling distribution of the error:
2. The bias:
3. The standard error:
For each of these examples, what the bootstrap proposes is to replace by the empirical .

The bootstrap is said to work if

Typically, we will be in the situation where

If the above is true the bootstrap works if and only if:

We will not give any proofs in this class. (If you want to do a theoretical project, I can guide you to cases where this can be done.)

Examples: Estimating the bias of the median:

Bootstrapping a probability:

as in the case of finding an error distribution: , we can see that this is finding the volume of a set under a known distribution, and Monte Carlo is the natural route. So in this new notation we see that bootstrapping is a two step process:
• Estimate by .
• Approximate by its Monte Carlo approximation , with B as big as possible. When , .

Note: is not an unknown parameter.

# The jackknife

Suppose the bias is of the form:

Even if the numerators are unkowns depending on the real distribution .

This method was the jackknife.

This is the first time that the sample was manipulated, each observation is dropped once from the sample, the new distribution function is called when the observation has been dropped, and the functional estimate gives the estimator and their average is denoted

We can use the jackknife for estimating the bias by:

We can show that if the bias is of the order as in then the Jackknife estimate is in

So the order of the bias has been decreased from to

Example:
Suppose for which the plug in estimate is:

### Example: Patch Data

Jackknife

%------------------------------------------
function out=jackbias(theta,orig)
%Estimate the bias using the jackknife
%Theta has to be a character string containg
% a valid function name
[n,p]=size(orig);
lot=feval(theta,orig(2:n,:));
k=length(lot);
lo=zeros(n,k);
lo(1,:)=lot;
lo(n,:)=feval(theta,orig(1:(n-1),:));
for i=(2:(n-1))
lo(i,:)=feval(theta,orig([1:(i-1),(i+1):n],:));
end
%-------------------------------
function out=ratio(yszs)
%Computes the ratio of mean of first
%column of mean of second column
out=mean(yszs(:,1))/mean(yszs(:,2));
%-------------------------------------
>>z=oldpatch-placebo ; y=newpatch-oldpatch
>> yz=[y,z]
-1200        8406
2601        2342
-2705        8187
1982        8459
-1290        4795
351        3516
-638        4796
-2719       10238
>>ratio(yz)
-0.0713
>> jackbias('ratio',yz)
ans =    0.0080


Bootstrap Simulations

function out=bootbias(theta,orig,B)
thetabs=zeros(1,B);
for b=(1:B)
bs=bsample(orig);
thetabs(b)=feval(theta,bs);
end
theta0=feval(theta,orig);
out=mean(thetabs)-theta0;
%-----------------------------------
>> bootbias('ratio',yz,1000)
ans =    0.0085
%-----------------------------------
function out=bootmultin(orig,B)
[n,p]=size(orig);
out=zeros(B,n);
for b=1:B
inds=randint(1,n,n)+1;
for i=1:n
out(b,inds(i))=out(b,inds(i))+1;
end;
end
%-----------------------------------


# Cross Validation

Separate Diagnostic Data Sets When having iterated through several exploratory methods, varied the projections and looked for the best fit', there seems only one honest method of verifying whether one is overfitting noise or whether there really is a latent variable or a good prediction available and that is to have another set of data of exactly the same type.

The best thing to do is at the beginning of the study to take a random sub-sample, without any particular stratification and to put it aside for the confirmatory stage. Many scientists are mean with their data, and only have just enough to model, but nowadays the expense of an extra 25 % or so, should be made - especially when the consequences of the study are medical, this is what tukey and mallows call a careful serarate diagnostic.

## Cross-Validation when there is a response variable

When the above prescription is not followed and one of the variables has the status of variable to be explained, it is possible - at computational expense, but who cares ? - to redo the analysis leaving out part of the data and comparing with the reference set.

For instance in Discriminant Analysis
For each observation, do the analysis without that one, and look whether or not it is well classified, this will give an unbiased estimate of the percentage of badly classified. Cross Validation can thus be used when one variable has the particular status of being explained.

And in regresssion
We want to estimate the prediction error:

This can be done by cross validation, writing:

However it has also been used at the diagnostic stage in principal components, and in classification and regression trees where it helps choose the size of an optimal tree'.

# Confidence Intervals

We call the unknown parameter and our estimate . Suppose that we had an ideal (unrealistic) situation in which we knew the distribution of , we will be interested especially in its quantiles : denote the quantile by and the quantile by .

By definition we have: and giving a confidence interval:

1. Would we get the same answer without centering with regards to ?

What we called and were the ideal quantiles from which we build the confidence interval : .Estimated by : using the th quantile of , the 's do NOT cancel out.

2. These intervals would not be the same if we took the distribution of to mimick simply that of the 's, (those are seen later as the percentile method estimates).

## Properties and Improvements

As can be predicted, the bigger , the number of bootstrap resamples, the better the Monte Carlo approximations are, thus much recent work has been devoted to making the most of the computations, however as we have seen improving the MC approximation is not the only route for improving the bootstrap's performance, this can also be done:
• By choosing better quantities to bootstrap than the original estimate, through various transforms some of which require a double level bootstrap.
• By using techniques inspired from Monte Carlo methods : importance sampling, antithetic sampling. balanced sampling.
• By transforming the results obtained after an ordinary bootstrap process : bias correction EFRON 1982, accelerated bias correction EFRON 1987 for instance.

Preliminaries : Pivotal quantities
Construction of confidence intervals is based (ideally) on a pivotal quantity , that is a function of the sample and the parameter whose distribution is independant of the parameter , the sample, or any other unknown parameter. Thus for instance no knowledge is needed about the parameters and of a normal variable to construct confidence intervals for both of these parameters, because one has two pivotal quantities available :

Let denote the largest quantile of the distribution of the pivot , and the parameter space, set of all possible values for then

is a confidence set whose level is . Thus if , and denote the relevant quantiles for the distributions cited above the corresponding confidence intervals will be

and

Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called roots). Therefore errors will be made on the level () of the confidence sets. Level error is sensitive to how far is from being a pivot.

We will present several approaches for finding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation.

## Studentized Confidence Intervals

Suppose that we are interested in a parameter , and that is an estimate based on the n-sample .

We will suppose that the asymptotic variance of is with a corresponding estimate .

The studentized bootstrap or bootstrap-t method (EFRON 1982, HALL 1988) is based on approximating the distribution function of by , distribution of .

Denote by the th quantile of the distribution of ,

Thus the equitailed bootstrap-t confidence interval of coverage is :

One way of finding an estimate is to use a second level bootstrap, this is rather expensive computation-wise : one could prefer as an intermediate solution a jackknife estimate.
1. Compute the original estimate of : THETA(0)
2. For b=1 to B do :
1. Generate a sample
2. Compute the estimate of based on : THETA(b)
3. Use as original data and compute an estimate of the standard deviation of
(requires another set of iterations): SIGMA(b).
4. Compute =(THETA(b)-THETA(0))/ SIGMA(b)
3. Compute CNA and CNCA the and 'th quantiles for the sample formed by all the R's.
4. LO=THETA(0)-SQRT(N)*CNCA*SIGMA(0)
5. HI=THETA(0)-SQRT(N)*CNA*SIGMA(0)

### Correlation Coefficient Example

function out=corr(orig)
%Correlation coefficient
c1=corrcoef(orig);
out=c1(1,2);
%------------------------------------------
function interv=boott(orig,theta,B,sdB,alpha)
%Studentized bootstrap confidence intervals
%
theta0=feval(theta,orig);
[n,p]=size(orig);
thetab=zeros(B,1);
sdstar=zeros(B,1);
thetas=zeros(B,1);
for b=1:B
indices=myrand(1,n,n);
samp=orig(indices,:);
thetab(b)=feval(theta,samp);
%Compute the bootstrap se,se*
sdstar(b)=feval('btsd',samp,theta,n,sdB);
%Studentized statistic
thetas(b)=(thetab(b)-theta0)/sdstar(b);
end
se=sqrt(var(thetab));
Pct=100*(alpha/2);
lower=prctile(thetas,Pct);
upper=prctile(thetas,100-Pct);
interv=[(theta0-upper*se) (theta0 - lower*se)];
%----------------------------------------------
function out=btsd(orig,theta,n,NB)
%Compute the bootstrap estimate
%of the std error of the estimator
%defined by the function theta
%NB number of bootstrap simulations
thetab=zeros(NB,1);
for b=(1:NB)
indices=myrand(1,n,n);
samp=orig(indices,:);
thetab(b)=feval(theta,samp);
end
out=sqrt(var(thetab));
%----------------------------------------------
>> boott(law15,'corr',1000,30,.05)
-0.4737    1.0137
%----------------------------------------------
>> boott(law15,'corr',2000,30,.05)
-0.2899    0.9801
%----------------------


### Transformations of the parameter-matlab example

function out=transcorr(orig)
%transformed correlation coefficient
c1=corrcoef(orig);
rho=c1(1,2);
out=.5*log((1+rho)/(1-rho));
>> transcorr(law15)
1.0362
>> tanh(1.03)
0.7739
>> boott(law15,'transcorr',100,30,.05)
-0.7841    1.7940
>> tanh(ans)
-0.6550    0.9462
>> boott(law15,'transcorr',1000,30,.05)
0.0473    1.7618
>> tanh(ans)
0.0473    0.9427
>> transcorr(law15)
1.0362
>> 2/sqrt(12)
0.5774
>> 1.0362 - 0.5774
0.4588
>> 1.0362 + 0.5774
1.6136
>> tanh([.4588 1.6136])
0.4291    0.9237
%%%%%%%%%%%%%%%True confidence Intervals%
>> prctile(res100k,[5 95])
ans =
0.5307    0.9041


#### Confidence Intervals : The percentile method

We denote by the bootstrap distribution of , approximated by

The percentile method consists in taking the confidence interval for as being

Theoretically speaking this is equivalent to replacement of the unknown distribution by the estimate .
:
1. Input the level of the confidence interval
2. For b=1 to B do /* B is the number of bootstrap samples*/
3. For i=1 to n do /* Generation of Sample */
4. Compute
5. Combine the into a new data set THETA
6. Compute the and 'th quantiles for THETA.

## Variance Stabilizing Transformations

We may want to assess accuracy of an estimate or just it's expectation.

If is linear as we know finding it's expectation is trivial, and we can also find it's variance (recalling that and )

So when is not linear we have to bring ourselves back to that case by using a Taylor expansion, we hope to find a region where dwells with high probability and where is linear.

This gives as a first order approximation to the expectation of : and as a second order approximation

This is called the method. How good is such an approximation? That depends on two things:
1) How nonlinear is in a neighborhood of 2) How big is.
We are just interested in the neighborhood of because that's where X lies most of the time. (remember Chebychev).

This method proves a generalisation of a central limit theorem:

as long as exists and is non zero.

By Taylor's theorem:

Examples: Even though are non normal, this holds because the transformations will be locally linear.

Binomial: We're interested in estimating , should we

1. Do binomial trials with probability of success .
2. Do binomial trials with probability of success and then estimate the parameter by
Comparison: For large , will be more accurate iff

Otherwise it is that is the most efficient.

The delta method provides a route towards variance stabilization, ie making the variance less dependent on (and thus making the distribution closer to being pivotal).

We would like to find transformations such that , where does not depend on .

By taking the transformation to be such that , we have a constant variance.

Poisson case:
,

For , we have and is asymptotically pivotal.

Chisquare:
Let , where . , , , the variance stabilizing is

### Correlation Coefficient

For the correlation coefficient case, we actually need a more sophisticated version, supposing we are interested in

a function of two random variables we know about, we need two-dimensional Taylor series, note Then

as a first approximation, this gives us straight away for the expectation and with a little more work for the variance we have :

An improved estimate of the expectation can be obtained by taking the second order Taylor expansion:

From this and from the properties of the expectation:

For -variables it is similar.

Note : If you don't remember about Taylor series you can get the computer to remind you by using Maple.

Type maple, then:

>help(taylor);
>taylor(f(x),x=mu,3);
for multivariate:
>mtaylor(f(x,y),[x=a,y=b],3);


To simplify the formulas, suppose the variances of both and are 1 and the means are 0, then a simplified version of the argument in Lehmann(page 317) says that any bivariate distribution with correlation coefficient

The variance stabilizing transformation is given by the condition, with :

so that

This is called Fishers -transformation.

# Bootstrap World

Next: References Up: Introduction to the Bootstrap: Previous: Labs and Homeworks   Index
Susan Holmes 2002-04-25