TABLE OF CONTENTS
[Methods] [Results] [Discussion] [Literature Cited] [Tables]
[Figures]
^{a}Department of Human Genetics, University of Chicago, Chicago, Illinois 60637 ^{b}E-mail: mfp@uchicago.edu ^{c}Department of Computational and Molecular Biology, University of Southern California, 1050 Childs Way, Los Angeles, California 90089 |
Abstract.—Considerable interest is focused on the use of polymorphism data to identify regions of the genome that underlie recent adaptations. These searches are guided by a simple model of positive selection, in which a mutation is favored as soon as it arises. This assumption may not be realistic, as environmental changes and range expansions may lead previously neutral or deleterious alleles to become beneficial. We examine what effect this mode of selection has on patterns of variation at linked neutral sites by implementing a new coalescent model of positive directional selection on standing variation. In this model, a neutral allele arises and drifts in the population, then at frequency f becomes beneficial, and eventually reaches fixation. Depending on the value of f, this scenario can lead to a large variance in allele frequency spectra and in levels of linkage disequilibrium at linked, neutral sites. In particular, for intermediate f, the beneficial substitution often leads to a loss of rare alleles—a pattern that differs markedly from the signature of directional selection currently relied on by researchers. These findings highlight the importance of an accurate characterization of the effects of positive selection, if we are to reliably identify recent adaptations from polymorphism data.
Key words.—Hitchhiking mapping, linkage disequilibrium, natural selection, selective sweep, standing variation.
A major goal of evolutionary genetics is to identify the loci that underlie adaptations. In some cases, this challenge has been met by classical mapping approaches, now facilitated by the availibility of genomic tools (e.g., Colosimo et al. 2005). An alternative, of particular relevance for humans and nonmodel organisms, is to analyze genetic variation within and between species, with the goal of identifying genomic regions that appear to have evolved under natural selection rather than by drift alone. For example, a popular class of methods contrasts polymorphism and divergence at synonymous and nonsynonymous sites to detect nonneutral evolution of amino-acid sequences (e.g., Yang and Bielawski 2000; Nielsen et al. 2005). This approach has been successful in identifying several proteins that appear to be evolving under repeated positive selection (cf. Swanson 2003), but it lacks power to detect adaptations consisting of few substitutions to a given protein. Nor, in its current form, is it applicable to noncoding regions.
A single beneficial substitution can be detected in polymorphism data, so long as it occurred recently. The fixation of a favorable allele in a population distorts patterns of variation at linked sites, thereby leaving a distinguishing signature that lasts up to about 10^{4} generations in humans or about 10^{6} in Drosophila melanogaster (Przeworski 2002). In principle, targets of positive selection can therefore be identified by searching polymorphism data for regions that harbor this signature (Nair et al. 2003; Wright et al. 2005).
Several recent papers have applied this hitchhiking mapping approach to humans and to species of Drosophila (e.g., Harr et al. 2002; Payseur et al. 2002; Glinka et al. 2003; Kauer et al. 2003; Kayser et al. 2003; Akey et al. 2004; Schofl and Schlotterer 2004; Storz et al. 2004). Early modern humans, D. melanogaster, and D. simulans share a similar demographic history: they have an African origin and are thought to have only recently become cosmopolitan (cf. Aquadro et al. 2001). This expansion from their ancestral range may have been accompanied by adaptations to novel climates, diets, and diseases. The motivation behind these recent papers is to search for signatures of adaptations in non-African populations that must have experienced a recent change of habitat.
While promising, the approach will only be reliable if the signature of natural selection is accurately characterized. Currently, our understanding of the effects of directional positive selection stems from the simple model of a random-mating population of constant size, in which an allele is beneficial as soon as it arises and is rapidly driven to fixation in the population (hereafter referred to as a “standard selective sweep”). Under these assumptions, the substitution of a favorable allele at one site results in a reduction in variability and a skew toward rare and high-frequency derived variants at linked, neutral sites (Maynard Smith and Haigh 1974; Simonsen et al. 1995; Fay and Wu 2000; Kim and Stephan 2002; Przeworski 2002). This distortionary effect is the footprint sought in hitchhiking mapping.
Whether the characterization of positive selection is reliable, however, remains an open question. In that respect, it is troubling that in one of the rare cases with independent evidence for a recent selective sweep, the expected signature was not observed (Hamblin and Di Rienzo 2000). At the Duffy locus in humans, homozygosity for a null allele is known to confer resistance to vivax malaria. This null allele is fixed or near fixation in sub-Saharan African populations and virtually absent elsewhere. Yet, in four out of five sub-Saharan African ethnicities surveyed for variation, allele frequencies at linked sites do not harbor a high proportion of rare alleles, nor is diversity always sharply reduced (Hamblin and Di Rienzo 2000; Hamblin et al. 2002). As suggested by the authors, this may be due to departures from demographic assumptions. For example, it is known that the fixation of a beneficial allele in a structured population leads to a distinct signature from that expected under the standard sweep model (Slatkin and Wiehe 1998; Santiago and Caballero 2005).
A second possibility is that, contrary to what is assumed in the standard model, selection does not always act on a new allele. A striking example is provided by threespine sticklebacks that moved from marine environments to colonize freshwater lakes and streams in the past 20,000 years. These isolated populations show a parallel reduction in body armor plates, adaptations that appear to have been accomplished by changes at the same major locus, EDA (Colosimo et al. 2005). Moreover, the allele associated with the adaptive phenotype is found at 0.2% and 3.8% frequency in two marine populations. Thus, directional selection for reduced plating appears to have acted on standing variation present at nonnegligible frequency in the ancestral environment (Colosimo et al. 2005). A similar set of circumstances might underlie the high frequency of lactose tolerance in dispersed pastoral populations of humans (Bersaglieri et al. 2004).
Such a scenario may be plausible for Duffy, as the advent of agriculture (10,000 years ago) is thought to have drastically increased malarial pressures (Hamblin and Di Rienzo 2000). In fact, it may apply to many of the adaptations that motivate the hitchhiking mapping approach. Early modern humans, for instance, are only thought to have left the African continent 100,000 years ago, or about 4000 generations ago, reaching Australia by about 50,000 years ago and the Americas only in the past 20,000 years (cf. Lewin and Foley 2004). Given their small census population size, there may have been little time for new beneficial mutations to arise. Instead, responses to novel selective pressures may have been brought about by selection on existing polymorphisms.
If so, it becomes important to examine whether selection on standing variation leaves a similar footprint than does the standard sweep, as otherwise we may fail to recognize many genomic regions of interest. To predict the footprint of selection on standing variation, we first need to know how many independent mutations are likely to underlie a given adaptation (Hermisson and Pennings 2005). Assuming a simple demographic model, only two fitness classes at the target of selection (favored and unfavored), and no interference among favored alleles, Hermisson and Pennings (2005) concluded that a single origin of the favored allele is more likely so long as selection is strong and 4N 0.1, where N is the diploid effective population size and is the per generation mutation rate to the favored class. In humans, the population mutation rate, 4Nμ (where μ is the neutral mutation rate per generation), is thought to be approximately 0.001 per base pair (Li and Sadler 1991; Przeworski et al. 2000), so this condition is equivalent to the assumption that mutations at fewer than 100 bp lead to the favored type. In D. melanogaster, 4Nμ is estimated to be about 0.01 per base pair (e.g., Moriyama and Powell 1996), which corresponds to a requirement that mutations at fewer than 10 bp be beneficial. Thus, for both species, the assumption of a single origin for the favored allele seems sensible.
In this case, the signature of selection on linked, neutral variation depends crucially on the frequency, f, at which the allele that eventually reaches fixation is first favored. If f < 1/(2Ns) and selection is strong (where s is the selection coefficient of the favored allele), the effect of a favorable substitution will be much as expected under the standard sweep model (Stephan et al. 1992). Thus, so long as natural selection acts on a new or rare allele, linked, neutral sites will tend to harbor an excess of rare alleles and low diversity. This result also helps to understand a model in which an allele is originally deleterious, then favored. If purifying selection was strong enough to keep the variant at very low frequency in the ancestral environment, the predictions will again resemble those of a standard selective sweep (Orr and Betancourt 2001). In particular, if the environmental shift causes the selection coefficient to change its sign, but not its absolute value, the predictions will be identical (Hermisson and Pennings 2005).
If the allele was initially very weakly deleterious or neutral, however, the frequency f may not have been very low; this is also the case if the allele was brought in by gene flow from an environment in which it was not deleterious (Roper et al. 2004; Colosimo et al. 2005). In cases where f was appreciable, the signature of selection may not resemble that of a standard sweep. Innan and Kim (2004) examined this problem in the context of plant domestication. To mimic the process of artificial selection imposed by early farmers, they considered a recent population bottleneck (7500 generations ago) followed by 20- to 100-fold population growth. They modeled extremely strong selection, occurring at the beginning of the bottleneck, and acting on a single allele that was previously neutral. Using standard tests of neutrality based on diversity levels and allele frequencies, they found that larger f values lead to decreased power to detect the fixation of a beneficial allele in polymorphism data.
We revisit this question for parameters applicable to natural selection. To do so, we implement a novel coalescent model of directional selection on standing variation and characterize both allele frequencies and patterns of linkage disequilibrium at linked neutral sites. Like Innan and Kim (2004), we assume that the favored allele has a unique origin.
We consider the following scenario: at time t_{m}, a neutral allele A arises by mutation and drifts in the population until time t_{s}, when, due to a change in the environment, it becomes beneficial. Allele A eventually fixes at time T, at which point all chromosomes carry the favored allele. Selection is genic, and in the new environment A has selection coefficient s. The frequency of A at time t_{s} is denoted by f.
To examine the effect of the substitution of allele A on patterns of variation, we model the genealogical history of a sample from a linked, neutrally evolving region. To do so, we assume that the population mates randomly and is of constant size N diploids. We also assume that allele A was the only mutation event at that site (in the genealogical history of the neutral region). Mutations in the neutral region arise according to the infinite-sites model. Recombination is modeled as crossing-over with no gene conversion and occurs at a constant rate r per base pair.
Going backward in time (so 0 < T < t_{s} < t_{m}), there are three phases. Before allele A arose (t > t_{m}) or after A fixed (t < T), there is only one allele, a, at the site. The history can therefore be described by the standard neutral coalescent (cf. Hudson 1990). In the two other phases, there are two alleles: when t_{s} < t < t_{m}, a and A are selectively equivalent, while for T < t < t_{s}, they are not. During these phases, the lineages ancestral to the sample from the neutral region can be thought of as evolving in a structured population, where the allelic classes (a and A) define subpopulations and recombination between ancestral lineages of each class acts as migration (Hudson and Kaplan 1988; Barton 1998; Nordborg 2001).
Using this analogy of a structured coalescent, we can simulate the genealogical history of a sample from the neutrally evolving region by generating the frequency of the selected allele through time (hereafter “the trajectory”), then generating an ancestral recombination graph conditional on this trajectory (see Fig. 1 ). This general approach was pioneered by Kaplan et al. (1989) and since used in other studies (e.g., Przeworski 2002; Ray et al. 2003; Coop and Griffiths 2004; Innan and Kim 2004).
We implement the approach by modifying the coalescent program described in Przeworski (2002). The only change pertains to the trajectory of allele A. In Przeworski (2002), allele A is favored from introduction to fixation and a deterministic approximation is used to model the trajectory. Here, A is initially neutral, then beneficial, and the trajectory of A is modeled stochastically (as described below). Thus, while we present results for a fixed time of fixation, T, the times t_{m} and t_{s} are random and will therefore vary from run to run.
We error-checked the program by writing an independent code (which uses a birth-death approximation to the diffusion process) and comparing the results. The latter is implemented as a version of the program SELSIM (Spencer and Coop 2004) and is available at pritch.bsd.uchicago.edu/software.html.
The frequency of an allele A in the population can be modeled by a diffusion process X(t) on (0,1), with generator where σ^{2}(x) = x(1 − x) is the infinitesimal variance and μ(x) the infinitesimal mean of the diffusion process (cf. Ewens 2004). In our model, there are two diffusion processes: a neutral one, X_{N}(t), and a selected one, X_{S}(t). These have infinitesimal means μ_{N}(x) = 0 and μ_{S}(x) = 2Nsx(1 − x), respectively.
We consider these processes conditional on their reaching one of two absorbing states: zero (i.e., loss of A from the population) and one (i.e., fixation). The conditional diffusion process has the same σ^{2}(x) as the usual diffusion, but the infinitesimal mean includes an additional term, which effectively gives the appropriate push toward the boundary on which we have conditioned.
Our approach also relies on the reversibility of the diffusion process (cf. Griffiths 2003). Specifically, we use the fact that the diffusion process looking backward in time from the present (i.e., toward the introduction of the allele) has the same distribution as a process forward in time conditional on absorption at zero. This conditional process X_{N}^{*}(t) is the same as X_{N}(t) but with μ_{N}(x) replaced by μ_{N}^{*}(x) = −x (Ewens 2004). Likewise, because we are only interested in beneficial alleles that eventually reach fixation, we consider the diffusion process conditional on the selected allele reaching a frequency of one. This conditional process X_{S}^{+}(t) has an infinitesimal mean μ_{S}^{+}(x) = 2Nsx(1 − x)/tanh(2Nsx) (Ewens 2004).
To generate a trajectory for allele A, we use a variable-sized jump random walk to approximate to the diffusion process. Given a current frequency x, at time intervals Δt, the frequency x jumps to either: with equal probability. The term μ(x) is replaced by the conditional infinitesimal mean of the phase in question (i.e., neutral or selective). This process has the correct diffusion limit, that is, the correct infinitesimal mean and variance are obtained and all higher moments are zero, as the time interval Δt 0 (Karlin and Taylor 1981). Hence, for small Δt, it provides a good approximation to the diffusion process. We verified this for our choice of Δt = 1/(4N) by comparison to analytical expectations and to alternative methods of simulation (results not shown).
There are two steps in our implementation: (1) simulation of the trajectory of a neutral allele from frequency f to loss, with μ(x) = μ_{N}^{*}(x) in the jumps described above; by the reversibility property described above, we can flip this trajectory to model the allele A from introduction to frequency f; and (2) simulation of the trajectory of a selected allele from f to fixation, with μ(x) = μ_{S}^{+}(x) in the jumps. We then concatenate the results of (1) and (2) to obtain one trajectory from introduction to fixation.
Our approach ensures that A is initially selected when at frequency f, without assuming that A is selected when it first reaches frequency f (which would be unrealistic). Moreover, it is computationally efficient, because we only generate trajectories where A eventually fixes in the population.
We are interested in contrasting two models: the standard selective sweep, in which an allele is favored from introduction to fixation, and a model of directional selection on standing variation. For the latter, we consider the following scenario: a neutral allele A arises and drifts in the population until time t_{s}, when it becomes favored; it eventually reaches fixation in the population at time T (see Methods for details). The frequency of allele A at time t_{s}, f, is the salient parameter in the comparison.
To characterize the effects of these two models on genetic variation, we simulate samples from a linked, neutrally evolving region using a structured coalescent approach. Specifically, we generate a trajectory of allele A from introduction to fixation, then condition on this particular realization of the genealogical process to generate an ancestral recombination graph for our sample (Fig. 1 ). The trajectory of allele A is modeled stochastically, using a new approach (see Methods). Under the standard sweep model, f = 1/(2N), while under the model of directional selection on standing variation, f 1/(2N).
Irrespective of the value of f, mean diversity levels are most distorted near the selected site and tend toward their neutral expectation with increasing genetic distance. This is illustrated in Figure 2A , using parameters that may be applicable to humans (e.g., Frisse et al. 2001). We present three summaries of diversity: θ_{W} (Watterson 1975), θ_{H} (Fay and Wu 2000), and π (Tajima 1989). Under a neutral equilibrium model, these statistics provide an unbiased estimate of θ, the population mutation rate (θ = 4Nμ, where μ is the mutation rate per generation per base pair). For these parameters, a standard sweep leads to a reduction in the mean levels of variation throughout the 100-kb region (relative to the neutral expectation of θ = 0.001 per base pair).
A very similar picture is expected so long as f < 1/(2Ns) and selection is strong (Stephan et al. 1992). As an example, in Figure 2A the expected levels of variation are indistinguishable for f = 1/(2N) = 5 × 10^{−5} and f = 1/(2Ns) = 10^{−3}. As f increases, the substitution of a favored allele has a weaker effect on diversity at linked neutral sites (Innan and Kim 2004); for f = 0.20, the effect is hardly detectable. If the effective population size, N, is larger, the difference between the standard sweep and a model of directional selection on standing variation is more readily apparent. For instance, using parameters that may be realistic for D. melanogaster (e.g., Andolfatto and Przeworski 2000), a model with f = 0.05 leads to a very slight reduction in diversity levels relative to the expectation at neutral equilibrium (see Fig. 2B ).
This finding might suggest that directional selection on standing variation behaves like the standard sweep, but with a weaker footprint. This turns out not to be true. Figure 3 plots π and θ_{W} as a function of the distance from the selected site for four simulated datasets generated under models of directional selection where f = 0.05 and where f = 1/(2N), as well as under the neutral equilibrium. As can be seen, some cases of selection on standing variation resemble a standard sweep with a weaker footprint (example 3); others look like the neutral equilibrium case, with high diversity very close to the selected site (example 4); and yet others look unlike either the standard sweep or neutrality, with valleys of low diversity in some segments and neutral equilibrium levels in others (examples 1, 2).
Moreover, while the fixation of a new beneficial mutation has more effect on π than θ_{W} in all four examples, this is not the case when f = 0.05. We illustrate this difference between the two selection models in more detail by presenting five randomly generated examples of allele frequencies in a linked neutral region (Fig. 4 ). Close to the selected site, a standard sweep tends to produce an excess of rare and high-frequency alleles relative to the neutral equilibrium model (e.g., Maynard Smith and Haigh 1974; Kaplan et al. 1989; Simonsen et al. 1995; Fay and Wu 2000). As seen in examples 2 and 5, this also happens under a model where f 1/(2N). What distinguishes the model of directional selection on standing variation from the standard sweep model is the appreciable number of cases where there is a relative excess of intermediate-frequency alleles (examples 1 and 4 in Fig. 4 ). This suggests that for intermediate values of f, directional selection leads to a much larger variance in frequency spectra than expected under a standard sweep model.
To quantify this observation, we estimate the variance and central 95% probability interval of Tajima's D (Tajima 1989), a commonly used summary of the folded allele frequency spectrum based on the (approximately) normalized difference between π and θ_{W} (Table 1 ). Under a neutral equilibrium model, the mean of this statistic is roughly zero, while a negative (positive) value reflects an excess of rare (intermediate-frequency) alleles. We first consider the case where the beneficial allele has just reached fixation. As expected, the standard selective sweep leads to sharply reduced values of D. In contrast, for f = 0.05, the mean is only slightly reduced from zero, but both tails of the distribution of D are greatly increased (Table 1 ). If the time since fixation is instead 2000 generations, or approximately 50,000 years in humans, then both models lead to more negative D-values. However, the variance in outcomes remains much larger under a model of directional selection on standing variation and the 95th percentile still includes sharply positive values (Table 1 ).
The same qualitative behavior was noted for a model of plant domestication, in which selection occurs during a recent bottleneck (Innan and Kim 2004). The authors considered the power to reject a neutral null model using two summaries of the allele frequency spectrum, Tajima's D and a statistic akin to Fay and Wu's H (Fay and Wu 2000), as test statistics. They noted that “the two tests work towards both tails because selection makes patterns of polymorphism variable” (Innan and Kim 2004, p. 10670). Together, their results and ours suggest that an increased variance in frequency spectra is a general feature of directional selection on standing variation.
This finding can be understood in a coalescent framework. Under a standard selective sweep model and in the absence of recombination between selected and neutral loci, the genealogy of a sample from the neutral region will be star shaped and reflected in data by a high proportion of rare alleles. If a low level of recombination does occur, it will tend to be when the allele A is at intermediate frequency in the population, before many coalescent events have taken place. The resulting genealogy will be lopsided, with one or few long external branches, so that the sample will contain a high proportion of rare and high-frequency-derived alleles (Barton 1998; Fay and Wu 2000). In contrast, under a model of selection on standing variation, recombination will also take place when A is at low frequency (Fig. 1 ). The increased opportunity for recombination in the initial stage stems from the fact that A spends more time in the population drifting at low frequencies. Recombination events that occur while A is rare will result in more balanced genealogies, that is, in a higher number of intermediate frequency alleles in the sample (Fig. 1 ). Viewed another way, if recombination occurs while A is rare, then more than one haplotype will carry the allele by time t_{s}. A subset of these will increase in frequency along with A, and, after fixation, the alleles that distinguish them will tend to be at intermediate frequencies in the sample. In summary, the frequency spectrum observed at linked, neutral sites depends on how many recombination events occurred during the sojourn of the favored allele in the population and when they occurred, that is, for a fixed strength of selection, on f and the recombination rate.
In this respect, it is worth nothing that any scenario that leads to a relatively long sojourn time of the favored allele, especially at low frequency, will increase the opportunity for recombination between allelic classes, thereby weakening and distorting the footprint of an adaptive substitution relative to the standard sweep model. For example, the fixation of a beneficial allele has a lesser effect on diversity at linked neutral sites when the allele is recessive rather than codominant, as assumed in the standard model (Teshima and Przeworski 2005). Importantly, population structure can also increase the sojourn time of the favored allele (Cherry 2003; Whitlock 2003).
The timing of recombination events during the selective sweep will also influence levels of linkage disequilibrium (LD). To examine this aspect of the polymorphism data, we estimate the population recombination rate, ρ = 4N_{e}r (Hudson 1987). Under the neutral equilibrium model, estimates of ρ can be thought of as the amount of recombination needed in the population per generation to generate approximately the observed LD. More generally, they can be thought of as an index of the strength of allelic associations, with smaller -values corresponding to greater LD and vice versa (Andolfatto and Przeworski 2000).
In Table 2 , we present the mean, variance, and range of two estimates of ρ, for parameters applicable to humans and ρ = 10. The first, _{W00}, is the maximum likelihood estimate of ρ given two summaries of the data: the number of distinct haplotypes in the sample and the minimum number of recombination events, as estimated by the four-gamete test (Wall 2000). The second estimate, _{H01}, is a composite likelihood estimator based on haploype configurations at pairs of polymorphic sites (Hudson 2001). Under the equilibrium neutral model, both estimators are close to unbiased (Table 2 ). By comparison, the standard selective sweep model leads to a sharp decrease in _{W00}, or equivalently, a marked increase in levels of LD (Przeworski 2002; Kim and Nielsen 2004). This decrease in estimated ρ is also observed by considering the median of _{H01}, but not its mean, as the latter is distorted by occasional extreme values.
On average, the fixation of a beneficial allele has less of an effect on estimates of ρ (i.e., a weaker effect on LD) when f = 0.05 compared to the standard selective sweep model. As an illustration, for these parameters the median _{W00} is equal to 1.0 instead of 0.0 and the median _{H01} = 2.6 instead of 1.6. Moreover, there is a greater variance in levels of LD: Var(_{W00}) = 2.34 versus 0.90 for f = 0.05 and f = 1/(2N), respectively. (This effect is not seen with _{H01}, probably because the estimator is biased upward when the sample contains a high proportion of rare alleles and this case arises more often under a standard selective sweep model; J. D. Wall, unpubl. obs.).
In spite of the increased variance in estimates of _{W00} under a model where f = 0.05, the upper 97.5 percentile of _{W00} is still substantially lower than expected under a neutral equilibrium model. This suggests that using _{W00} as a test statistic may provide substantial power to detect a beneficial substitution, even when f 1/(2N). Unfortunately, the use of _{W00} as a test statistic would require an accurate estimate of ρ, which is rarely available. Moreover, the effect of a beneficial substitution on LD dissipates rapidly (Table 2 ; Przeworski 2002; Kim and Nielsen 2004).
As an alternative summary of LD patterns, we tabulated both the number of distinct haplotypes and the haplotype homozygosity for the simulations described in Table 2 . In all cases, selective sweeps with f = 1/(2N) had less haplotype diversity than comparable simulations with f = 0.05, even when corrected for differences in levels of diversity and allele frequencies (results not shown). These results are broadly consistent with those described for estimates of ρ, in that the standard sweep model seems to lead to a greater increase in levels of LD than does a model where f = 0.05. It remains to be investigated whether this observation generalizes to other LD summaries and, in particular, whether there exists an aspect of LD that might help to distinguish selection on standing variation versus new mutations.
As these results demonstrate, the signature of directional selection on an existing polymorphism depends crucially on the value of f. Unfortunately, there is little empirical evidence to suggest what values may be plausible for natural populations. To gain some sense of what to expect, we considered a model in which all beneficial substitutions result from directional selection on standing variation in a neutral equilibrium population, and asked what the distribution of f would be under those conditions (for a related derivation see Hermisson and Pennings 2005).
We first assume that positive selection acts on the derived allele. In our model, a neutral allele is chosen at random from the neutral equilibrium frequency spectrum. Denote the derived allele frequency by x, confined to [1/2N, 1 − 1/2N]. The probability density function (pdf) for x is proportional to 1/x (cf. Ewens 2004). Since _{1/2N}^{(2N−1)/2N}dx/x = ln(2N − 1), the pdf of the derived allele frequency is (x) = [x ln(2N − 1)]^{−1}. The neutral allele is therefore drawn from (x), and subsequently favored with genic selective advantage s. Given an allele at frequency x that has selective coefficient s, the probability of fixation is approximately (1 − e^{−4Nsx})/(1 − e^{−4Ns}) (cf. Ewens 2004). Conditional on fixation, the pdf for the frequency f, P(x), is therefore proportional to and the probability of fixation of a randomly chosen neutral allele is approximately Figure 5 presents (x) and P(x). A comparison of the two indicates a shift toward higher-frequency alleles in P(x) relative to (x). This is easily understood: for a fixed selection coefficient, rare alleles are less likely to fix. Thus, conditional on fixation, the proportion of rare alleles is smaller. As the strength of selection increases, the effect on conditioning on fixation weakens, and the two distributions become more similar.
The implication is that, if directional selection frequently acts on standing variation and selection in natural populations is strong, the distribution of f will resemble that of neutral variation in the ancestral population. This simple model therefore predicts that directional selection in teosinte, the wild progenitor of maize, will tend to produce the standard signature of a selective sweep, because the allele frequency spectrum is skewed toward rare alleles (Wright et al. 2005). In contrast, in European human populations, there appears to be an excess of intermediate-frequency alleles in putatively neutral regions (e.g., Frisse et al. 2001; Akey et al. 2004). Assuming the same were true at the onset of selection, adaptations to new habitats may therefore have been more likely to involve alleles at appreciable frequency, resulting in a subset of genomic regions with reduced diversity and an excess of intermediate-frequency alleles.
Thus far, we have assumed that the selection favors the derived allele. Depending on how selection coefficients change with the environment, it may be more realistic to assume that the allele that becomes beneficial, A, is equally likely to be derived or ancestral. The pdf of the ancestral allele frequency is [(1 − x)ln(2N − 1)]^{−1}. The frequency distribution of a randomly chosen neutral allele is therefore and P(x) is proportional to While (x) is symmetric around 0.5, P(x) is shifted toward higher frequency values. Therefore, in this scenario, selection will mostly act on alleles whose frequency is close to one and that are ancestral. When it does, the trajectory of A from one to f back to one will induce a brief and very mild bottleneck of the a allelic class. In the presence of recombination, this analogy between a sweep and a population size reduction is not perfect (Barton 1998), but it is indicates that the beneficial substitution of an ancestral allele will have essentially no effect on patterns of variation at linked sites. Moreover, whatever selective signature is present will be difficult to distinguish from the effects of population history, especially in species (such as humans) that are believed to have experienced recent bottlenecks. Thus, even if selection is equally likely to favor ancestral and derived alleles, only those episodes that involve derived alleles will be readily detectable.
Identifying selective targets from polymorphism data requires an accurate characterization of the effects of natural selection. To date, these efforts have been guided by a simplistic model that, among many other assumptions, posits that adaptations come about through directional selection on a single, newly arising allele. Yet, both empirical data and theoretical considerations suggest that selection may often act on multiple independent mutations or on a single allele already present in the population. Focusing on the latter scenario, we find that directional selection in a recombining region can lead to patterns of polymorphism that differ markedly from the signature expected under the standard model, when the frequency at which the allele becomes beneficial exceeds 1/(2Ns).
It is unclear how often this condition is met, but considering this scenario may help to explicate patterns of variation at candidate loci. As an illustration, Akey et al. (2004) used a false discovery rate approach to identify genic regions for which neutral evolution is rejected under a range of different demographic assumptions. The authors identified eight loci as significant in European-Americans and none in African-Americans, leading them to suggest that European populations have experienced recent directional selection in response to novel habitats. Interestingly, three of the eight candidates have sharply positive values of Tajima's D (ABO, ACE2, and IL1A; Akey et al. 2004). Of these, ABO and IL1A also exhibit unusually high levels of diversity, perhaps consistent with the action of balancing selection. But at ACE2, the angiotensin I–converting enzyme 2, not only is Tajima's D strongly positive (+1.854) but diversity is reduced (the ratio of European-American to African-American diversity is 11th of the 132 genes surveyed). These two aspects of the data are difficult to reconcile with either standard selective sweep or balancing selection models, but are consistent with the predictions of directional selection on standing variation. Similarly, one of the first papers to apply the hitchhiking mapping approach to D. melanogaster identified two regions of the X chromosome as putative targets of a recent selective sweep in European populations (Harr et al. 2002). Levels of polymorphism were reduced in these regions, but the Tajima's D-value was positive for more than half of the fragments sequenced. One explanation may be that positive selection acted on an allele already segregating at appreciable frequency in the ancestral population. These examples are only suggestive, but they highlight the potential explanatory power of a more realistic model of selection. To allow this model of directional selection on standing variation to be further characterized or used for inference, we have implemented it within the framework of the SELSIM program (Spencer and Coop 2004) and made this version available at pritch.bsd.uchicago.edu/software.html.
As a next step, it would be helpful to characterize the effect on polymorphism data of directional selection on multiple, independent mutations at a locus. Although considerable uncertainty surrounds the exact number of independent origins, this scenario appears to apply to pyrimethamine resistance in the malaria parasite (Roper et al. 2004 and references therein). Similarly, the low plated phenotype in freshwater populations of threespine sticklebacks seems to be due to at least two independent mutations at the major locus (Colosimo et al. 2005). More generally, assuming no interference between alleles, this mode of selection on standing variation is likely whenever selection is weak or the population mutation rate to the favored class, 4N, is high (Hermisson and Pennings 2005). The result of Hermisson and Pennings therefore predicts that adaptations in species with larger effective population size, N, are more likely to involve multiple alleles of independent origins. It also suggests that the signature of selection at a locus will depend on the genetic architecture of the phenotype; for example, for a fixed selection coefficient, an adaptive loss of function is expected to leave a distinct signature from a gain of function, because more mutations can bring it about (Hermisson and Pennings 2005). These conjectures can be tested once the signature of selection on multiple alleles is characterized.
In addition to considering more general modes of selection, it may be important to include more realistic demographic assumptions (e.g., Slatkin and Wiehe 1998; Beaumont and Balding 2004). Several studies have highlighted the difficulty of distinguishing selective from demographic effects (Nielsen 2001; Przeworski 2002; Lazzaro and Clark 2003). This remains a crucial point: for instance, patterns at both ACE2 and the candidate loci identified by Harr et al. (2002) could reflect complex demographic processes not considered by the authors rather than adaptation at a linked site. Equally important, however, is an understanding of how alternative demographic assumptions might affect the signature of selection (e.g., Slatkin and Wiehe 1998; Beaumont and Balding 2004). Ultimately, the goal is not to distinguish between models of selection versus complex demography but between models of selection and demography versus demography alone. In particular, to identify adaptations in response to climate changes and other biotic shifts, we need to characterize the footprint of positive selection in populations that are structured and vary in size over time. Otherwise, we run the risk of mistaking neutrally evolving loci for targets of directional selection and missing actual targets for which the signature of adaptation is distorted beyond recognition.
We thank B. Griffiths for helpful discussions and N. Barton, M. Nachmann, and the reviewers for their comments on the manuscript. GC is supported by National Institutes of Health grant HG002772 to J. K. Pritchard, JDW by an Alfred P. Sloan Fellowship, and MP by National Institutes of Health grant GM72861 and an Alfred P. Sloan Fellowship.
TABLE 1.The mean, variance, and range of Tajima's D under a neutral equilibrium model, a standard selective sweep model, and a model of directional positive selection where f = 0.05. A total of 10^{4} simulations were run for 100 chromosomes, with N = 10^{4} and s = 0.05. The selected site is 10 kb from the neutrally evolving region of 10 kb. With the neutral locus, the population mutation rate θ = 10^{−3} per base pair and the population recombination rate ρ = 4Nr = 10^{−3} per base pair. T is the time since the fixation of the beneficial allele. Note the large variance in D values when f = 0.05
TABLE 2.The mean, variance, and range of estimators of ρ under a neutral equilibrium model, a standard selective sweep model, and a model of directional positive selection where f = 0.05. The parameters values are the same as in Table 1 . Presented are two estimators: _{W00} (Wall 2000) and _{H01} (Hudson 2001). The likelihoods for _{W00} were estimated on a grid of 171 values, ranging from zero to 100, and those for _{H01} on a grid of 197 values ranging from zero to 200
FIG. 1.A possible genealogy for six chromosomes at a neutral locus linked to a site where a beneficial allele, A, has reached fixation. In this example, A has just fixed in the population (at time T = 0), so all lineages carry the favored allele. Going backwards in time, A is favored from T to t_{s} then neutrally evolving from t_{s} (when it is at frequency f) to t_{m}. The trajectories for the selected and neutral phases are shown in black and gray, respectively. The coalescent genealogy for the six chromosomes is depicted with dashed lines, while recombination events between allelic classes are indicated with slanted arrows. Most coalescent events occur when allele A is at low frequency. Because A is neutrally evolving from t_{s} to t_{m}, its sojourn time is longer than it would be under a standard sweep, thus providing more opportunity for recombination. Note that, in this example, the most recent common ancestor has not been reached by 2500 generations ago
FIG. 2.Mean diversity levels as a function of distance from the selected site for different values of f, the frequency at which the allele is first favored. Diversity levels are summarized by the mean π (dashed), θ_{W} (gray) and θ_{H} (black). Under the neutral equilibrium model, all three statistics are unbiased estimators of θ, the population mutation rate. (A) Plausible parameters for humans. A total of 10^{4} simulations were run for 100 chromosomes, with N = 10^{4}, s = 0.05, and θ = ρ = 10^{−3} per base pair (ρ = 4Nr; see Methods for other parameter definitions). The time since the fixation of the beneficial allele is zero. Under the neutral equilibrium model, E(π) = E(θ_{W}) = E(θ_{H}) = 1 per kilobase. (B) Plausible parameters for Drosophila melanogaster. A total of 10^{4} simulations were run for 100 chromosomes, with N = 10^{6}, s = 0.01, θ = 0.01 per base pair, and ρ = 0.1 per base pair. The time since the fixation of the beneficial allele is zero. Under the neutral equilibrium model, E(π) = E(θ_{W}) = E(θ_{H}) = 1 per 100 bp
FIG. 3.Diversity levels in a 5-kb sliding window (incremented every 1 kb) along the sequence under a model of directional selection with f = 0.05, a standard selective sweep model, and a neutral equilibrium model (from top to bottom row, respectively). Four simulated examples are provided for each model (see Methods for details of simulations). The position along the sequence (in kilobases) is shown on the x-axis, while the values of π (solid line) and θ_{W} (dashed line) per kilobase are shown on the y-axis. Under the neutral equilibrium model, E(π) = E(θ_{W}) = 1 per kilobase. Each simulation was run with 100 chromosomes, N = 10^{4} and s = 0.05. The selected site is at position 0 and θ = ρ = 0.001 per base pair. The time since the fixation of the beneficial allele is zero
FIG. 4.The allele frequency spectrum under a model of directional selection with f = 0.05, a standard selective sweep model, and a neutral equilibrium model (from top to bottom row, respectively). Five simulated examples are provided for each model (see Methods for details of simulations). On the x-axis are the allele frequencies (in 10 bins) and on the y-axis the proportion of sites with a given allele frequency. Each simulation was run with 100 chromosomes; N = 10^{4} and s = 0.05. The selected site is 10 kb from the neutrally evolving region and θ = ρ = 10. The time since the fixation of the beneficial allele is zero
FIG. 5.The distribution of allele frequencies conditional on fixation in the population. We assume that selection acts on a derived, previously neutral allele. Shown in dark gray is the allele frequency spectrum for neutral alleles in a random-mating population of constant size. In light gray is the distribution of allele frequencies conditional on fixation of the beneficial allele. In this example, N = 10^{4} and 4Ns = 200. As can be seen, conditional on fixation, the distribution of allele frequencies has a lower proportion of rare alleles