Rosenberg lab - abstracts

Abstracts of Rosenberg lab publications

[1999-2005] [2006-2010] [2011-2015] [2016-2020] [2021-2025]

[249] L Agranat-Tamir, M Fuchs, B Gittenberger, NA Rosenberg, KV Seetharaman (2026) Combinatorial comparison of general galled trees, time-consistent galled trees, and simplex time-consistent galled trees. Advances in Applied Mathematics 180: 103131.

Rooted binary phylogenetic networks are extensions of rooted binary trees, adding reticulation nodes that are designed to represent evolutionary processes that involve hybridization events. Enumerative combinatorics studies have counted leaf-labeled phylogenetic networks in a variety of classes, finding that when the number of reticulations is fixed, the time-consistent galled trees are asymptotically less numerous than each of several network classes that had been previously examined. Here we provide enumerative results on two additional network classes: general galled trees and simplex time-consistent galled trees. We show that for a fixed number of galls, as the number of leaves goes to infinity, the asymptotic count of general galled trees is identical to that of time-consistent galled trees, whereas the count of simplex time-consistent galled trees is smaller. If the number of galls is not restricted, then the asymptotic approximations all differ: simplex time-consistent galled trees are less numerous than time-consistent galled trees, which are in turn less numerous than general galled trees. We also report a variety of additional results: recursions to count the studied networks with small numbers of leaves and a fixed number of galls, as well as enumerative results for unlabeled networks in the classes that we investigate.

[248] L Agranat-Tamir, M Fuchs, B Gittenberger, NA Rosenberg (2026) Enumerative combinatorics of unlabeled and labeled time-consistent galled trees. Theoretical Computer Science 1082: 116075.

In mathematical phylogenetics, the time-consistent galled trees provide a simple class of rooted binary network structures that can be used to represent a variety of different biological phenomena. We study the enumerative combinatorics of unlabeled and labeled time-consistent galled trees. We present a new derivation via the symbolic method of the number of unlabeled time-consistent galled trees with a fixed number of leaves and a fixed number of galls. We also derive new generating functions and asymptotics for labeled time-consistent galled trees.

[247] E Heinrich Mora, NA Rosenberg (2026) An nth-cousin mating model and the n-anacci numbers. Fibonacci Quarterly 64: 270-280. [PDF]

In seeking to understand the size of inbred pedigrees, J. Lachance (J. Theor. Biol. 261, 238-247, 2009)) studied a population model in which, for a fixed value of n, each mating occurs between nth cousins. We explain a connection between the second-cousin case of the model (n=2) and the Fibonacci sequence, and more generally, between the nth cousin case and the n-anacci sequence. For a model with nth-cousin mating (n ≥ 1), we obtain the generating function describing the size of the pedigree t generations back from the present — tabulating the numbers of lineal genealogical ancestors in specified generations — and we use it to evaluate the asymptotic growth of the pedigree size. In particular, we show that the pedigree growth rate asymptotically follows the growth rate of the n-anacci sequence, and in particular, the golden ratio φ = (1 + √ 5)/2 ≈ 1.61803 for second-cousin mating n=2, the tribonacci growth constant 1.83928... for third-cousin mating n=3, and the tetranacci growth constant 1.92756... for fourth-cousin mating n=4. The growth rate approaches 2 as n increases. The computations explain the appearance of familial numerical sequences and constants in a pedigree model. They also recall similar appearances of such sequences and constants in studies of population biology more generally.

[246] EH Dickey, NA Rosenberg (2026) Labeled histories and maximally probable labeled topologies with multifurcation. Discrete Applied Mathematics 391: 192-203. [PDF]

In mathematical phylogenetics, labeled histories describe the sequences by which sets of labeled lineages coalesce to a shared ancestral lineage. We study labeled histories for at-most-r-furcating trees. Consider a rooted leaf-labeled tree in which internal nodes each have i offspring, and i is permitted to range from 2 to r across internal nodes, for a specified value of r. For labeled topologies with n leaves, we enumerate the total number of labeled histories with at-most-r-furcation. We enumerate the labeled histories possessed by a specific at-most-r-furcating labeled topology. We then demonstrate that the maximally probable at-most-r-furcating unlabeled topology on n ≥ 2 leaves — the unlabeled topology whose labelings have the largest number of labeled histories — is the maximally probable strictly bifurcating unlabeled topology on leaves. Finally, we enumerate labeled histories for at-most-r-furcating labeled topologies in a setting that permits simultaneous branchings. We similarly reduce the problem of identifying the maximally probable at-most-r-furcating unlabeled topology on n ≥ 2 leaves, allowing simultaneity, to that of identifying the maximally probable strictly bifurcating unlabeled topology on n leaves, with simultaneity; we conjecture the shape of this bifurcating unlabeled topology. The computations contribute to the study of multifurcation, which arises in various biological processes, and they connect to analogous mathematical settings involving precedence-constrained scheduling.

[245] X Liu, NA Rosenberg, S Ramachandran (2026) Clumppling 2.0: a clustering alignment program for population structure analyses. Human Population Genetics and Genomics 6: 0004. [PDF] [Supplement]

We previously introduced Clumppling to address the "alignment problem" for multiple mixed-membership unsupervised clustering results in population structure analyses, where clusters represent latent genetic ancestries. This problem stems from three challenges — label-switching, multi-modality, and varying numbers of clusters — which Clumppling resolves in three steps: aligning results with the same number of clusters, detecting distinct solutions or "modes," and aligning modes across different numbers of clusters. Here, we present Clumppling 2.0, an update with features for visualizing the emergence of clusters, comparing aligned results from different models, and incorporating modularity of algorithmic steps. We outline the Clumppling 2.0 workflow, highlighting its improved algorithmic flexibility and visual interpretability through a graph of alignment patterns. We then demonstrate its utility on human genetic datasets that include individuals from admixed populations.

[244] S Ramachandran, NA Rosenberg (2026) Reflections on the Human Genome Diversity Project: a conversation with Marcus W. Feldman, Henry T. Greely, and Mary-Claire King. Genetics 232: iyaf273. [PDF] [Supplement]

The Human Genome Diversity Project (HGDP) began in 1991 as an initiative to study genetic variation from human populations worldwide. In 2002, the HGDP reported the HGDP-CEPH Human Genome Diversity Cell Line Panel, a global panel of 1064 cell lines that is maintained at the Centre d'Etude du Polymorphisme Humain (CEPH) and that has served as a major resource fundamental to the last 25 years of progress in human population genetics. HGDP-CEPH data have been central to research on topics such as human genetic diversity, human population structure, human migrations, the development of population-genetic statistics and software, and the potential value of inclusion of diverse sets of human populations in biomedical research. In this article, two researchers who participated in early analyses of genotypes from the HGDP-CEPH panel in the early 2000s speak with three researchers who played key roles in developing the Human Genome Diversity Project from its origin in 1991. The conversation reflects on the successes and challenges of the effort to launch the HGDP and on its scientific contributions.

[243] BK Moon, NA Rosenberg (2026) Integer sequences for diversity statistics. Journal of Integer Sequences 29: 26.1.5. [PDF]

Consider a discrete set of objects and a sample of size N taken with replacement from the set, producing a list of counts of the objects that corresponds to a partition of N. Two statistics that are commonly used for measuring the "diversity" of the sample are the Gini-Simpson index and the Shannon index. We study the number of possible values that these indices can take across all possible partitions of the sample size N as N increases. The two statistics are highly correlated over the set of partitions of N. However, the number of possible values that the Shannon index can take (A383683) far exceeds the number of possible values of the Gini-Simpson index (A069999), with the latter growing quadratically and the former growing faster than every polynomial.

[242] CE Shiff, NA Rosenberg (2026) Enumeration of rooted binary perfect phylogenies. Discrete Applied Mathematics 380: 538-561. [PDF]

Rooted binary perfect phylogenies provide a generalization of rooted binary unlabeled trees. In a rooted binary perfect phylogeny, each leaf is assigned a positive integer value that corresponds in a biological setting to the count of the number of indistinguishable lineages associated with the leaf. For the rooted binary unlabeled trees, these integers equal 1. We enumerate rooted binary perfect phylogenies with n ≥ 1 leaves and sample size s, s ≥ n: the rooted binary unlabeled trees with n leaves in which a sample of size s ≥ n is distributed across the n leaves. (1) First, we recursively enumerate rooted binary perfect phylogenies with sample size s, summing over all possible n, 1 ≤ n ≤ s. We obtain an equation for the generating function, showing that asymptotically, the number of rooted binary perfect phylogenies with sample size s grows with ≈ 0.3519(3.2599^s)s^-3/2, faster than the rooted binary unlabeled trees, which grow with ≈ 0.3188(2.4833)^ss^-3/2. (2) Next, we recursively enumerate rooted binary perfect phylogenies with a specific number of leaves n and sample size s ≥ n. We report closed-form counts of the rooted binary perfect phylogenies with sample size s ≥ n and n=2, 3, and 4 leaves. We provide a recurrence for the generating function describing, for each number of leaves n, the number of rooted binary perfect phylogenies with n leaves and sample size s, as well as an asymptotic normal distribution for the number of leaves in a randomly chosen perfect phylogeny with sample size s. (3) We find a generating function for the number of rooted binary perfect phylogenies with the n-leaf caterpillar shape, growing with s. We also find a generating function and exact count floor(2^s/3) for the number of rooted binary perfect phylogenies with sample size s and any caterpillar shape. A bivariate generating function counting rooted binary perfect phylogenies with n leaves, sample size s, and a caterpillar shape produces an asymptotic normal distribution for the number of leaves in a randomly chosen caterpillar perfect phylogeny with sample size s. (4) Finally, we provide initial results recursively enumerating rooted binary perfect phylogenies with any specific unlabeled tree shape and sample size s. The enumerations further characterize the rooted binary perfect phylogenies, which include the rooted binary unlabeled trees, and which can provide a set of structures useful for various biological contexts.

[241] X Liu, Z Ahsan, NA Rosenberg (2025) Using mathematical constraints to explain narrow ranges for allele-sharing dissimilarities. Theoretical Population Biology 166: 116-137. [PDF]

Allele-sharing dissimilarity (ASD) statistics are measures of genetic differentiation for pairs of individuals or populations. Given the allele-frequency distributions of two populations — possibly the same population — the expected value of an ASD statistic is computed by evaluating the expectation of the pairwise dissimilarity between two individuals drawn at random, each from its associated allele-frequency distribution. For each of two ASD statistics, which we term D₁ and D₂, we investigate the extent to which the expected ASD is constrained by allele frequencies in the two populations; in other words, how is the magnitude of the measure bounded as a function of the frequency of the most frequent allelic type? We first consider dissimilarity of a population with itself, obtaining bounds on expected ASD in terms of the frequency of the most frequent allelic type in the population. We then examine pairs of populations that might or might not possess the same most frequent allelic type. Across the unit interval for the frequency of the most frequent allelic type, the expected allele-sharing dissimilarity has a range that is more restricted than the [0,1] interval. The mathematical constraints on expected ASD assist in explaining a pattern observed empirically in human populations, namely that when averaging across loci, allele-sharing dissimilarities between pairs of individuals often tend to vary only within a relatively narrow range.

[240] E Lappo, NA Rosenberg (2025) Coalescent theory of the ψ directionality index. G3: Genes, Genomes, Genetics 15: jkaf202. [PDF]

The ψ directionality index was introduced by Peter and Slatkin (Evolution 67: 3274-3289, 2013) to infer the direction of range expansins from single-nucleotide polymorphism variation. Computed from the joint site frequency spectrum for two populations, ψ uses shared genetic vriants to measure the difference in the amount of genetic drift experienced by the populations, associating excess drift with greater distance from the origin of the range expansion. Although ψ has been successfully applid in natural populations, its statistical properties have not been well understood. In this article, we define Ψ as a random variable originating from a coalescent process in a two-popualtion demography. For samples consisting of a pair of diploid genomes, one from each of two populations, we derive expressions for moments E[Ψ^k] for standard parametrizations of bottlenecks during a founder event. For the xpectation E[Ψ], we identify parameter combinations that represent distinct demographic scenarios yet yield the same value of E[Ψ]. We also show that the variants V[Ψ] increases with the time since the bottleneck and bottleneck severity, but does not depend on the size of the ancestral population; the ancestral popualtion size affects ψ cmputed from many biallelic loci only through its contribution to the total numebr of loci available for the computation. Finally, we analyze the values of E[Ψ] computed from existing demographic models of Drosophila melanogaster and compare them with empirically computed ψ. Our work builds the foundation fro theoretical treatments of the ψ index and can help in evaluating its behavior in empirical applications.

[239] L Devroye, MR Doboli, NA Rosenberg, S Wagner (2025) Tree height and the asymptotic mean of the Colijn-Plazzotta rank of unlabeled binary rooted trees. Bulletin of Mathematical Biology 87: 172. [PDF]

The Colijn-Plazzotta ranking is a bijective encoding of the unlabeled binary rooted trees with positive integers. We show that the rank f(t) of a tree t is closely related to its height h, the maximal path length from a leaf to the root. We consider the rank f(τ_n) of a random n-leaf tree τ_n under each of three models: (i) uniformly random unlabeled unordered binary rooted trees, or unlabeled topologies; (ii) uniformly random leaf-labeled binary trees, or labeled topologies under the uniform model; and (iii) random binary search trees, or labeled topologies under the Yule-Harding model. Relying on the close relationship between tree rank and tree height, we obtain results concerning the asymptotic properties of log log f(τ_n). In particular, we find E{log₂ log f(τ_n)} ~ 2 √(π n) for uniformly random unlabeled ordered binary rooted trees and uniformly random leaf-labeled binary trees, and for a constant α approx 4.31107, E{log₂ log f(τ_n)} ~ α log n for leaf-labeled binary trees under the Yule-Harding model. We show that the mean of f(τ_n) itself under the three models is largely determined by the rank c_n-1 of the highest-ranked tree — the caterpillar — obtaining an asymptotic relationship with π_nc_n-1, where π_n is a model-specific function of n. The results resolve open problems, providing a new class of results on an encoding useful in mathematical phylogenetics.

[238] ML Morrison, KS Xue, NA Rosenberg (2025) Quantifying compositional variability in microbial communities with FAVA. Proceedings of the National Academy of Sciences USA 122: e2413211122. [PDF] [Supplement]

Microbial communities vary across space, time, and individual hosts, generating a need for statistical methods capable of quantifying variability across multiple microbiome samples at once. To understand heterogeneity across microbiome samples from different host individuals, sampling times, spatial locations, or experimental replicates, we present FAVA (F_ST-based Assessment of Variability across vectors of relative Abundances), a framework for characterizing compositional variability across two or more microbiome samples. FAVA quantifies variability across many samples of taxonomic or functional relative abundances in a single index ranging between 0 and 1, equaling 0 when all samples are identical and 1 when each sample is entirely composed of a single taxon (and at least two distinct taxa are present across samples). Its definition relies on the population-genetic statistic F_ST, with samples playing the role of "populations" and taxa playing the role of "alleles." Its mathematical properties allow users to compare datasets with different numbers of samples and taxonomic categories. We introduce extensions that incorporate phylogenetic similarity among taxa and spatial or temporal distances between samples. We demonstrate FAVA in two examples. First, we use FAVA to measure how the taxonomic and functional variability of gastrointestinal microbiomes across individuals from seven ruminant species changes along the gastrointestinal tract. Second, we use FAVA to quantify the increase in temporal variability of gut microbiomes in healthy humans following an antibiotic course and to measure the duration of the antibiotic's influence on temporal microbiome variability. We have implemented this tool in an R package, FAVA, for use in pipelines for the analysis of microbial relative abundances.

[237] EH Dickey, NA Rosenberg (2025) Labeled histories with multifurcation and simultaneity. Philosophical Transactions of the Royal Society B Biological Sciences 380: 20230307. [PDF]

In mathematical models of phylogenetic trees evolving in time, a labelled history for a rooted labelled bifurcating tree is a temporal sequence of the branchings that give rise to the tree. That is, given a leaf-labelled tree with n leaves and n-1 internal nodes, a labelled history is an identification between the internal nodes and the set {1,2,...,n-1}, such that the label assigned to a given node is strictly greater than the labels assigned to its descendants. We generalize the concept of labelled histories to r-furcating trees. Consider a rooted labelled tree in which each internal node has exactly r children, r ≥ 2. We first generalize the enumeration of labelled histories for a bifurcating tree (r=2) to enumerate labelled histories for an r-furcating tree with arbitrary r ≥ 2. We formulate a conjecture for the rooted unlabelled r-furcating tree shape on n internal nodes whose labelled topologies have the most labelled histories. Finally, we enumerate labelled histories for r-furcating trees in a setting that allows for simultaneous branchings. These results advance mathematical phylogenetic modelling by extending computations concerning fundamental features of bifurcating phylogenetic trees to a more general class of multifurcating trees.

[236] NA Rosenberg, T Stadler, M Steel (2025) "A mathematical theory of evolution": phylogenetic models dating back 100 years. Philosophical Transactions of the Royal Society of London B: Biological Sciences 380: 20230297. [PDF]

(No abstract)

[235] NA Rosenberg, J Van Cleve. Editorial. Theoretical Population Biology 161: 50-51 (2025).

(No abstract)

[234] EE Armstrong*, JA Mooney*, KA Solari, BY Kim, GS Barsh, VB Grant, G Greenbaum, CB Kaelin, K Panchenko, JK Pickrell, NA Rosenberg, OA Ryder, T Yokoyama, U Ramakrishnan, DA Petrov, EA Hadly (2024) Unraveling the genomic diversity and admixture history of captive tigers in the United States. Proceedings of the National Academy of Sciences USA 121: e2402924121. [PDF] [Supplementary Appendix]

Genomic studies of endangered species have primarily focused on describing diversity patterns and resolving phylogenetic relationships, with the overarching goal of informing conservation efforts. However, few studies have investigated genomic diversity housed in captive populations. For tigers (Panthera tigris), captive individuals vastly outnumber those in the wild, but their diversity remains largely unexplored. Privately owned captive tiger populations have remained an enigma in the conservation community, with some believing that these individuals are severely inbred, while others believe they may be a source of now-extinct diversity. Here, we present a large-scale genetic study of the private (non-zoo) captive tiger population in the United States, also known as "Generic" tigers. We find that the Generic tiger population has an admixture fingerprint comprising all six extant wild tiger subspecies. Of the 138 Generic individuals sequenced for the purpose of this study, no individual had ancestry from only one subspecies. We show that the Generic tiger population has a comparable amount of genetic diversity relative to most wild subspecies, few private variants, and fewer deleterious mutations. We observe inbreeding coefficients similar to wild populations, although there are some individuals within both the Generic and wild populations that are substantially inbred. Additionally, we develop a reference panel for tigers that can be used with imputation to accurately distinguish individuals and assign ancestry with ultralow coverage (0.25x) data. By providing a cost-effective alternative to whole-genome sequencing (WGS), the reference panel provides a resource to assist in tiger conservation efforts for both ex- and in situ populations.

[233] NA Rosenberg (2024) Review of Tree Balance Indices: A Comprehensive Survey by M Fischer, L Herbst, S Kersting, L Kühn, K Wicke. SIAM Review 66: 395-397 (2024).

(No abstract)

[232] L Agranat-Tamir, M Fuchs, B Gittenberger, NA Rosenberg (2024) Asymptotic enumeration of rooted binary unlabeled galled trees with a fixed number of galls. In C. Mailler, S. Wild, eds. Proceedings of the 35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2024). Leibniz International Proceedings in Informatics (LIPIcs) 302: 27. Schloss Dagstuhl — Leibniz-Zentrum für Informatik. [PDF]

Galled trees appear in problems concerning admixture, horizontal gene transfer, hybridization, and recombination. Building on a recursive enumerative construction, we study the asymptotic behavior of the number of rooted binary unlabeled (normal) galled trees as the number of leaves n increases, maintaining a fixed number of galls g. We find that the exponential growth with n of the number of rooted binary unlabeled normal galled trees with g galls has the same value irrespective of the value of g ≥ 0. The subexponential growth, however, depends on g; it follows c_g n^2g-3/2, where c_g is a constant dependent on g. Although for each g, the exponential growth is approximately 2.4833ⁿ, summing across all g, the exponential growth is instead approximated by the much larger 4.8230ⁿ.

[231] M Doboli, H-K Hwang, NA Rosenberg (2024) Periodic behavior of the minimal Colijn-Plazzotta rank for trees with a fixed number of leaves. In C. Mailler, S. Wild, eds. Proceedings of the 35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2024). Leibniz International Proceedings in Informatics (LIPIcs) 302: 18. Schloss Dagstuhl — Leibniz-Zentrum für Informatik. [PDF]

The Colijn-Plazzotta ranking is a certain bijection between the unlabeled binary rooted trees and the positive integers, such that the integer associated with a tree is determined from the integers associated with the two immediate subtrees of its root. Letting a_n denote the minimal Colijn-Plazzotta rank among all trees with a specified number of leaves n, the sequence {a_n} begins 1, 2, 3, 4, 6, 7, 10, 11, 20, 22, 28, 29, 53, 56, 66, 67 (OEIS A354970). Here we show that a_n ~ 2 [2^{P(log₂ n)}]ⁿ, where P varies as a periodic function dependent on {log₂ n} and satisfies 1.24602 < 2^{P(log₂ n)} < 1.33429.

[230] L Agranat-Tamir, JA Mooney, NA Rosenberg (2024) Counting the genetic ancestors from source populations in members of an admixed population. Genetics 226: iyae011. [PDF] [Supplement]

In a genetically admixed population, admixed individuals possess genealogical and genetic ancestry from multiple source groups. Under a mechanistic model of admixture, we study the number of distinct ancestors from the source populations that the admixture represents. Combining a mechanistic admixture model with a recombination model that describes the probability that a genealogical ancestor is a genetic ancestor, for a member of a genetically admixed population, we count genetic ancestors from the source populations—those genealogical ancestors from the source populations who contribute to the genome of the modern admixed individual. We compare patterns in the numbers of genealogical and genetic ancestors across the generations. To illustrate the enumeration of genetic ancestors from source populations in an admixed group, we apply the model to the African-American population, extending recent results on the numbers of African and European genealogical ancestors that contribute to the pedigree of an African-American chosen at random, so that we also evaluate the numbers of African and European genetic ancestors who contribute to random African-American genomes. The model suggests that the autosomal genome of a random African-American born in the interval 1960-1965 contains genetic contributions from a mean of 162 African (standard deviation 47, interquartile range 127-192) and 32 European ancestors (standard deviation 14, interquartile range 21-43). The enumeration of genetic ancestors can potentially be performed in other diploid species in which admixture and recombination models can be specified.

[229] L Agranat-Tamir, S Mathur, NA Rosenberg (2024) Enumeration of rooted binary unlabeled galled trees. Bulletin of Mathematical Biology 86: 45. [PDF]

Rooted binary galled> trees generalize rooted binary trees to allow a restricted class of cycles, known as galls. We build upon the Wedderburn-Etherington enumeration of rooted binary unlabeled trees with n leaves to enumerate rooted binary unlabeled galled trees with n leaves, also enumerating rooted binary unlabeled galled trees with n leaves and g galls, 0 ≤ g ≤ floor[(n-1)/2]. The enumerations rely on a recursive decomposition that considers subtrees descended from the nodes of a gall, adopting a restriction on galls that amounts to considering only the rooted binary normal unlabeled galled trees in our enumeration. We write an implicit expression for the generating function encoding the numbers of trees for all n. We show that the number of rooted binary unlabeled galled trees grows with 0.0779(4.8230ⁿ)n^-3/2, exceeding the growth 0.3188(2.4833ⁿ)n^-3/2 of the number of rooted binary unlabeled trees without galls. However, the growth of the number of galled trees with only one gall has the same exponential order 2.4833 as the number with no galls, exceeding it only in the subexponential term, 0.3910n^1/2 compared to 0.3188n^-3/2. For a fixed number of leaves n, the number of galls g that produces the largest number of rooted binary unlabeled galled trees lies intermediate between the minimum of g=0 and the maximum of g= floor[(n-1)/2]. We discuss implications in mathematical phylogenetics.

[228] TD Gress, NA Rosenberg (2024) Mathematical constraints on a family of biodiversity measures via connections with Rényi entropy. BioSystems 237: 105153.

The Hill numbers are statistics for biodiversity measurement in ecological studies, closely related to the Rényi and Shannon entropies from information theory. Recent developments in the mathematics of diversity in the setting of population genetics have produced mathematical constraints that characterize how standard measures depend on the highest-frequency class in a discrete probability distribution. Here, we apply these constraints to diversity statistics in ecology, focusing on the Hill numbers and the Rényi and Shannon entropies. The mathematical bounds can shift perspectives on the diversities of communities, in that when upper and lower bounds on Hill numbers are evaluated in a classic butterfly example, Hill numbers that are initially larger in one community switch positions — so that associated normalized Hill numbers are instead smaller than those of the other community. The new bounds hence add to the tools available for interpreting a commonly used family of statistics for ecological data.

[227] DJ Cotter, AL Severson, JTL Kang, HN Godrej, S Carmi, NA Rosenberg (2024) Modeling the effects of consanguinity on autosomal and X-chromosomal runs of homozygosity and identity-by-descent sharing. G3: Genes, Genomes, Genetics 14: jkad264. [PDF] [Supplement]

Runs of homozygosity (ROH) and identity-by-descent (IBD) sharing can be studied in diploid coalescent models by noting that ROH and IBD-sharing at a genomic site are predicted to be inversely related to coalescence times—which in turn can be mathematically obtained in terms of parameters describing consanguinity rates. Comparing autosomal and X-chromosomal coalescent models, we consider ROH and IBD-sharing in relation to consanguinity that proceeds via multiple forms of first-cousin mating. We predict that across populations with different levels of consanguinity, (1) in a manner that is qualitatively parallel to the increase of autosomal IBD-sharing with autosomal ROH, X-chromosomal IBD-sharing increases with X-chromosomal ROH, owing to the dependence of both quantities on consanguinity levels; (2) even in the absence of consanguinity, X-chromosomal ROH and IBD-sharing levels exceed corresponding values for the autosomes, owing to the smaller population size and lower coalescence time for the X chromosome than for autosomes; (3) with matrilateral consanguinity, the relative increase in ROH and IBD-sharing on the X chromosome compared to the autosomes is greater than in the absence of consanguinity. Examining genome-wide SNPs in human populations for which consanguinity levels have been estimated, we find that autosomal and X-chromosomal ROH and IBD-sharing levels generally accord with the predictions. We find that each 1% increase in autosomal ROH is associated with an increase of 2.1% in X-chromosomal ROH, and each 1% increase in autosomal IBD-sharing is associated with an increase of 1.6% in X-chromosomal IBD-sharing. For each calculation, particularly for ROH, the estimate is reasonably close to the increase of 2% predicted by the population-size difference between autosomes and X chromosomes. The results support the utility of coalescent models for understanding patterns of genomic sharing and their dependence on sex-biased processes.

[226] E Lappo, NA Rosenberg (2024). Solving the Arizona search problem by imputation. iScience 27: 108831. [PDF] [Supplement]

An "Arizona search" is an evaluation of the numbers of pairs of profiles in a forensic-genetic database that possess partial or complete genotypic matches; such a search assists in establishing the extent to which a set of loci provides unique identifications. In forensic genetics, however, the potential for performing Arizona searches is constrained by the limited availability of actual forensic profiles for research purposes. Here, we use genotype imputation to circumvent this problem. From a database of genomes, we impute genotypes of forensic short-tandem-repeat (STR) loci from neighboring single-nucleotide polymorphisms (SNPs), searching for partial STR matches using the imputed profiles. We compare the distributions of the numbers of partial matches in imputed and actual profiles, finding close agreement. Despite limited potential for performing Arizona searches with actual forensic STR profiles, the questions that such searches seek to answer can be posed with imputation-based Arizona searches in increasingly large SNP databases.

[225] X Liu, NM Kopelman, NA Rosenberg (2024) Clumppling: cluster matching and permutation program with integer linear programming. Bioinformatics 40: btad751. [PDF] [Supplement]

Motivation. In the mixed-membership unsupervised clustering analyses commonly used in population genetics, multiple replicate data analyses can differ in their clustering solutions. Combinatorial algorithms assist in aligning clustering outputs from multiple replicates so that clustering solutions can be interpreted and combined across replicates. Although several algorithms have been introduced, challenges exist in achieving optimal alignments and performing alignments in reasonable computation time. Results. We present Clumppling, a method for aligning replicate solutions in mixed-membership unsupervised clustering. The method uses integer linear programming for finding optimal alignments, embedding the cluster alignment problem in standard combinatorial optimization frameworks. In example analyses, we find that it achieves solutions with preferred values of a desired objective function relative to those achieved by Pong and that it proceeds with less computation time than Clumpak. It is also the first method to permit alignments across replicates with multiple arbitrary values of the number of clusters K. Availability and implementation. Clumppling is available at https://github.com/PopGenClustering/Clumppling.

[224] E Lappo, NA Rosenberg (2024) A lattice structure for ancestral configurations arising from the relationship between gene trees and species trees. Discrete Applied Mathematics 343: 65-81. [PDF]

To a given gene tree topology G and species tree topology S with leaves labeled bijectively from a fixed set X, one can associate a set of ancestral configurations, each of which encodes a set of gene lineages that can be found at a given node of the species tree. We introduce a lattice structure on ancestral configurations, studying the directed graphs that provide graphical representations of lattices of ancestral configurations. For a matching gene tree topology and species tree topology G=S, we present a method for defining the digraph of ancestral configurations from the tree topology by using iterated cartesian products of graphs. We show that a specific set of paths on the digraph of ancestral configurations is in bijection with the set of labeled histories — a well-known phylogenetic object that enumerates possible temporal orderings of the coalescences of a tree. For each of a series of tree families, we obtain closed-form expressions for the number of labeled histories by using this bijection to count paths on associated digraphs. Finally, we prove that our lattice construction extends to nonmatching tree pairs, and we use it to characterize pairs (G,S) having the maximal number of ancestral configurations for a fixed G. We discuss how the construction provides new methods for performing enumerations of combinatorial aspects of gene and species trees.

[223] ARP Maranca, NA Rosenberg (2024) Bijections between the multifurcating unlabeled rooted trees and the positive integers. Advances in Applied Mathematics 153: 102612. [PDF].

Colijn and Plazzotta (2018) [1] described a bijective scheme for associating the unlabeled bifurcating rooted trees with the positive integers. In mathematical and biological applications of unlabeled rooted trees, however, nodes of rooted trees are sometimes multifurcating rather than bifurcating. Building on the bijection between the unlabeled bifurcating rooted trees and the positive integers, we describe bijective schemes for associating the unlabeled multifurcating rooted trees with the positive integers. We devise bijections with the positive integers for a set of trees in which each non-leaf node has exactly k child nodes, and for a set of trees in which each non-leaf node has at most k child nodes. The calculations make use of Macaulay's binomial expansion formula. The generalization to multifurcating trees can assist with the use of unlabeled trees for applications in evolutionary biology, such as the measurement of phylogenetic patterns of genetic lineages in pathogens.

[222] F Disanto, M Fuchs, C-Y Huang, AR Paningbatan, NA Rosenberg (2024) The distributions under two species-tree models of the total number of ancestral configurations for matching gene trees and species trees. Advances in Applied Mathematics 152: 102594.

Given a gene-tree labeled topology G and a species tree S, the ancestral configurations at an internal node k of S represent the combinatorially different sets of gene lineages that can be present at k when all possible realizations of G in S are considered. Ancestral configurations have been introduced as a data structure for evaluating the conditional probability of a gene-tree labeled topology given a species tree, and their enumeration assists in describing the complexity of this computation. In the case that the gene-tree labeled topology G=t matches that of the species tree S, by techniques of analytic combinatorics, we study distributional properties of the total number of ancestral configurations measured across the different nodes of a random labeled topology t selected under the uniform and the Yule probability models. Under both of these probabilistic scenarios, we show that the total number T_n of ancestral configurations of a random labeled topology of n taxa asymptotically follows a lognormal distribution. Over uniformly distributed labeled topologies, the asymptotic growth of the mean and variance of T_n are found to satisfy E_U[T_n] ~ 2.449 · 1.333ⁿ and V_U[T_n] ~ 5.050 · 1.822ⁿ, respectively. Under the Yule model, which assigns higher probabilities to more balanced topologies, we obtain the mean E_Y[T_n] ~ 1.425ⁿ and V_Y[T_n] ~ 2.045ⁿ.

[221] MC King, NA Rosenberg (2023) A mathematical connection between single-elimination sports tournaments and evolutionary trees. Mathematics Magazine 96: 484-497. [PDF]

How many ways are there to arrange the sequence of games in a single-elimination sports tournament? We consider the connection between this enumeration problem and the enumeration of "labeled histories," or sequences of asynchronous branching events, in mathematical phylogenetics. The possibility of playing multiple games simultaneously in different arenas suggests an extension of the enumeration of labeled histories to scenarios in which multiple branching events occur simultaneously. We provide a recursive result enumerating game sequences and labeled histories in which simultaneity is allowed. For a March Madness basketball tournament of 68 labeled teams, the number of possible sequences of games is ~1.91x10⁷⁸ if arbitrarily many arenas are available, but only ~3.60x10⁶⁸ if all games must be played sequentially in the same arena.

[220] X Liu, Z Ahsan, TK Martheswaran, NA Rosenberg (2023) When is the allele-sharing dissimilarity between two populations exceeded by the allele-sharing dissimilarity of a population with itself? Statistical Applications in Genetics and Molecular Biology 22: 2023004. [PDF]

Allele-sharing statistics for a genetic locus measure the dissimilarity between two populations as a mean of the dissimilarity between random pairs of individuals, one from each population. Owing to within-population variation in genotype, allele-sharing dissimilarities can have the property that they have a nonzero value when computed between a population and itself. We consider the mathematical properties of allele-sharing dissimilarities in a pair of populations, treating the allele frequencies in the two populations parametrically. Examining two formulations of allele-sharing dissimilarity, we obtain the distributions of within-population and between-population dissimilarities for pairs of individuals. We then mathematically explore the scenarios in which, for certain allele-frequency distributions, the within-population dissimilarity – the mean dissimilarity between randomly chosen members of a population – can exceed the dissimilarity between two populations. Such scenarios assist in explaining observations in population-genetic data that members of a population can be empirically more genetically dissimilar from each other on average than they are from members of another population. For a population pair, however, the mathematical analysis finds that at least one of the two populations always possesses smaller within-population dissimilarity than the value of the between-population dissimilarity. We illustrate the mathematical results with an application to human population-genetic data.

[219] E Lappo, NA Rosenberg, MW Feldman (2023) Cultural transmission of move choice in chess. Proceedings of the Royal Society B: Biological Sciences 290: 37964528. [PDF] [Supplement]

The study of cultural evolution benefits from detailed analysis of cultural transmission in specific human domains. Chess provides a platform for understanding the transmission of knowledge due to its active community of players, precise behaviours and long-term records of high-quality data. In this paper, we perform an analysis of chess in the context of cultural evolution, describing multiple cultural factors that affect move choice. We then build a population-level statistical model of move choice in chess, based on the Dirichlet-multinomial likelihood, to analyse cultural transmission over decades of recorded games played by leading players. For moves made in specific positions, we evaluate the relative effects of frequency-dependent bias, success bias and prestige bias on the dynamics of move frequencies. We observe that negative frequency-dependent bias plays a role in the dynamics of certain moves, and that other moves are compatible with transmission under prestige bias or success bias. These apparent biases may reflect recent changes, namely the introduction of computer chess engines and online tournament broadcasts. Our analysis of chess provides insights into broader questions concerning how social learning biases affect cultural evolution.

[218] J Kim, NA Rosenberg (2023) Record-matching of STR profiles with fragmentary genomic SNP data. European Journal of Human Genetics 31: 1283-1290. [PDF] [Supplement]

In many forensic settings, identity of a DNA sample is sought from poor-quality DNA, for which the typical STR loci tabulated in forensic databases are not possible to reliably genotype. Genome-wide SNPs, however, can potentially be genotyped from such samples via next-generation sequencing, so that queries can in principle compare SNP genotypes from DNA samples of interest to STR genotype profiles that represent proposed matches. We use genetic record-matching to evaluate the possibility of testing SNP profiles obtained from poor-quality DNA samples to identify exact and relatedness matches to STR profiles. Using simulations based on whole-genome sequences, we show that in some settings, similar match accuracies to those seen with full coverage of the genome are obtained by genetic record-matching for SNP data that represent 5-10% genomic coverage. Thus, if even a fraction of random genomic SNPs can be genotyped by next-generation sequencing, then the potential may exist to test the resulting genotype profiles for matches to profiles consisting exclusively of nonoverlapping STR loci. The result has implications in relation to criminal justice, mass disasters, missing-person cases, studies of ancient DNA, and genomic privacy.

[217] ML Morrison, NA Rosenberg (2023) Mathematical bounds on Shannon entropy given the abundance of the ith most abundant taxon. Journal of Mathematical Biology 87: 76. [PDF] [Supplement]

The measurement of diversity is a central component of studies in ecology and evolution, with broad uses spanning multiple biological scales. Studies of diversity conducted in population genetics and ecology make use of analogous concepts and even employ equivalent mathematical formulas. For the Shannon entropy statistic, recent developments in the mathematics of diversity in population genetics have produced mathematical constraints on the statistic in relation to the frequency of the most frequent allele. These results have characterized the ways in which standard measures depend on the highest-frequency class in a discrete probability distribution. Here, we extend mathematical constraints on the Shannon entropy in relation to entries in specific positions in a vector of species abundances, listed in decreasing order. We illustrate the new mathematical results using abundance data from examples involving coral reefs and sponge microbiomes. The new results update the understanding of the relationship of a standard measure to the abundance vectors from which it is calculated, potentially contributing to improved interpretation of numerical measurements of biodiversity.

[216] X Liu, NM Kopelman, NA Rosenberg (2023) A Dirichlet model of alignment cost in mixed-membership unsupervised clustering. Journal of Computational and Graphical Statistics 32: 1145-1159. [PDF] [Supplement]

Mixed-membership unsupervised clustering is widely used to extract informative patterns from data in many application areas. For a shared dataset, the stochasticity and unsupervised nature of clustering algorithms can cause difficulties in comparing clustering results produced by different algorithms, or even multiple runs of the same algorithm, as outcomes can differ owing to permutation of the cluster labels or genuine differences in clustering results. Here, with a focus on inference of individual genetic ancestry in population-genetic studies, we study the cost of misalignment of mixed-membership unsupervised clustering replicates under a theoretical model of cluster memberships. Using Dirichlet distributions to model membership coefficient vectors, we provide theoretical results quantifying the alignment cost as a function of the Dirichlet parameters and the Hamming permutation difference between replicates. For fixed Dirichlet parameters, the alignment cost is seen to increase with the Hamming distance between permutations. Datasets with low variance across individuals of membership coefficients for specific clusters generally produce high misalignment costs — so that a single optimal permutation has far lower cost than suboptimal permutations. Higher variability in data, as represented by greater variance of membership coefficients, generally results in alignment costs that are similar between the optimal permutation and suboptimal permutations. We demonstrate the application of the theoretical results to data simulated under the Dirichlet model, as well as to membership estimates from inference of human-genetic ancestry. The results can contribute to improving cluster alignment algorithms that seek to find optimal permutations of replicates. Supplementary materials for this article are available online.

[215] JA Mooney, L Agranat-Tamir, JK Pritchard, NA Rosenberg (2023) On the number of genealogical ancestors tracing to the source groups of an admixed population. Genetics 224: iyad079. [PDF] [Supplement]

Members of genetically admixed populations possess ancestry from multiple source groups, and studies of human genetic admixture frequently estimate ancestry components corresponding to fractions of individual genomes that trace to specific ancestral populations. However, the same numerical ancestry fraction can represent a wide array of admixture scenarios within an individual's genealogy. Using a mechanistic model of admixture, we consider admixture genealogically: how many ancestors from the source populations does the admixture represent? We consider African-Americans, for whom continent-level estimates produce a 75-85% value for African ancestry on average and 15-25% for European ancestry. Genetic studies together with key features of African-American demographic history suggest ranges for parameters of a simple three-epoch model. Considering parameter sets compatible with estimates of current ancestry levels, we infer that if all genealogical lines of a random African-American born during 1960-1965 are traced back until they reach members of source populations, the mean over parameter sets of the expected number of genealogical lines terminating with African individuals is 314 (interquartile range 240-376), and the mean of the expected number terminating in Europeans is 51 (interquartile range 32-69). Across discrete generations, the peak number of African genealogical ancestors occurs in birth cohorts from the early 1700s, and the probability exceeds 50% that at least one European ancestor was born more recently than 1835. Our genealogical perspective can contribute to further understanding the admixture processes that underlie admixed populations. For African-Americans, the results provide insight both on how many of the ancestors of a typical African-American might have been forcibly displaced in the Transatlantic Slave Trade and on how many separate European admixture events might exist in a typical African-American genealogy.

[214] R Laurent, ZA Szpiech, SS da Cosa, V Thouzeau, CA Fortes-Lima, F Dessarps-Freichey, L Lémée, J Utgé, NA Rosenberg, M Baptista, P Verdu (2023) A genetic and linguistic analysis of the admixture histories of the islands of Cabo Verde. eLife 12: e79827. [PDF]

From the 15th to the 19th century, the Trans-Atlantic Slave-Trade (TAST) influenced the genetic and cultural diversity of numerous populations. We explore genomic and linguistic data from the nine islands of Cabo Verde, the earliest European colony of the era in Africa, a major Slave-Trade platform between the 16th and 19th centuries, and a previously uninhabited location ideal for investigating early admixture events between Europeans and Africans. Using local-ancestry inference approaches, we find that genetic admixture in Cabo Verde occurred primarily between Iberian and certain Senegambian populations, although forced and voluntary migrations to the archipelago involved numerous other populations. Inter-individual genetic and linguistic variation recapitulates the geographic distribution of individuals' birth-places across Cabo Verdean islands, following an isolation-by-distance model with reduced genetic and linguistic effective dispersals within the archipelago, and suggesting that Kriolu language variants have developed together with genetic divergences at very reduced geographical scales. Furthermore, based on approximate bayesian computation inferences of highly complex admixture histories, we find that admixture occurred early on each island, long before the 18th-century massive TAST deportations triggered by the expansion of the plantation economy in Africa and the Americas, and after this era mostly during the abolition of the TAST and of slavery in European colonial empires. Our results illustrate how shifting socio-cultural relationships between enslaved and non-enslaved communities during and after the TAST, shaped enslaved-African descendants' genomic diversity and structure on both sides of the Atlantic.

[213] DJ Cotter, EF Hofgard, J Novembre, ZA Szpiech, NA Rosenberg (2023) A rarefaction approach for measuring population differences in rare and common variation. Genetics 224: iyad070. [PDF] [Supplement]

In studying allele-frequency variation across populations, it is often convenient to classify an allelic type as "rare," with nonzero frequency less than or equal to a specified threshold, "common," with a frequency above the threshold, or entirely unobserved in a population. When sample sizes differ across populations, however, especially if the threshold separating "rare" and "common" corresponds to a small number of observed copies of an allelic type, discreteness effects can lead a sample from one population to possess substantially more rare allelic types than a sample from another population, even if the two populations have extremely similar underlying allele-frequency distributions across loci. We introduce a rarefaction-based sample-size correction for use in comparing rare and common variation across multiple populations whose sample sizes potentially differ. We use our approach to examine rare and common variation in worldwide human populations, finding that the sample-size correction introduces subtle differences relative to analyses that use the full available sample sizes. We introduce several ways in which the rarefaction approach can be applied: we explore the dependence of allele classifications on subsample sizes, we permit more than two classes of allelic types of nonzero frequency, and we analyze rare and common variation in sliding windows along the genome. The results can assist in clarifying similarities and differences in allele-frequency patterns across populations.

[212] S Mathur, NA Rosenberg (2023) All galls are divided into three or more parts: recursive enumeration of labeled histories for galled trees. Algorithms for Molecular Biology 18:1. [PDF]

Objective:. In mathematical phylogenetics, a labeled rooted binary tree topology can possess any of a number of labeled histories, each of which represents a possible temporal ordering of its coalescences. Labeled histories appear frequently in calculations that describe the combinatorics of phylogenetic trees. Here, we generalize the concept of labeled histories from rooted phylogenetic trees to rooted phylogenetic networks, specifically for the class of rooted phylogenetic networks known as rooted galled trees. Results: Extending a recursive algorithm for enumerating the labeled histories of a labeled tree topology, we present a method to enumerate the labeled histories associated with a labeled rooted galled tree. The method relies on a recursive decomposition by which each gall in a galled tree possesses three or more descendant subtrees. We exhaustively provide the numbers of labeled histories for all small galled trees, finding that each gall reduces the number of labeled histories relative to a specified galled tree that does not contain it. Conclusion: The results expand the set of structures for which labeled histories can be enumerated, extending a well-known calculation for phylogenetic trees to a class of phylogenetic networks.

[211] F Disanto, M Fuchs, AR Paningbatan, NA Rosenberg (2022) The distributions under two species-tree models of the number of root ancestral configurations for matching gene trees and species trees. Annals of Applied Probability 32: 4426-4458. [PDF]

For a pair consisting of a gene tree and a species tree, the ancestral configurations at a species-tree internal node are the distinct sets of gene lineages that can be present at that node. The enumeration of root ancestral configurations — ancestal configurations at the species-tree root — assists in describing the complexity of gene-tree probability calculations in evolutionary biology. Assuming that the gene tree and species tree match in topology, we study the distribution of the numbe of root ancestral configurations of a random labeled tree topology under the uniform and Yule-Harding models. We employ analytic combinatorics, considering ancestral configurations in the context of additive tree parameters and using singularity analysis to evaluate asymptotic growth of the coefficients of generating functions. For both models, we obtain asymptotic lognormal distributions for the number of root ancestral configurations. For Yule-Harding random trees, we also obtain the asymptotic mean (~1.425ⁿ) and variance (~2.045ⁿ) of the number of root ancestral configurations, paralleling previous results for the uniform model (mean (4/3)ⁿ, variance ~1.822ⁿ). A methodological innovation is that to obtain the Yule-Harding asymptotic variance, singularity analysis is conducted frmo the Riccati differential equation that the generating function satisfies — without possessing the generating function itself.

[210] E Lappo, NA Rosenberg (2022) Approximations to the expectations and variances of ratios of tree properties under he coalescent. G3: Genes, Genomes, Genetics 12: jkac205. [PDF]

Properties of gene genealogies such as tree height (H), total branch length (L), total lengths of external (E) and internal (I) branches, mean length of basal branches (B), and the underlying coalescence times (T) can be used to study population-genetic processes and to develop statistical tests of population-genetic models. Uses of tree features in statistical tests often rely on predictions that depend on pairwise relationships among such features. For genealogies under the coalescent, we provide exact expressions for Taylor approximations to expected values and variances of ratios X_n/Y_n, for all 15 pairs among the variables {H_n, L_n, E_n, I_n, B_n, T_k}, considering n leaves and 2 ≤ k ≤ n. For expected values of the ratios, the approximations match closely with empirical simulation-based values. The approximations to the variances are not as accurate, but they generally match simulations in their trends as n increases. Although E_n has expectation 2 and H_n has expectation 2 in the limit as n → ∞, the approximation to the limiting expectation for E_n/H_n is not 1, instead equaling π²/3-2 ≈ 1.28987. The new approximations augment fundamental results in coalescent theory on the shapes of genealogical trees.

[209] NA Rosenberg (2022) Mendel of mathematics. Notices of the American Mathematical Society 69: 1564-1565. [PDF]

(No abstract)

[208] ML Morrison, N Alcala, NA Rosenberg (2022) FSTruct: an F_ST-based tool for measuring ancestry variation in inference of population structure. Molecular Ecology Resources 22: 2614-2626. [PDF] [Supplement]

In model-based inference of population structure from individual-level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across predefined groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector and does not depend on its mean. We apply the approach, which makes use of a normalized F_ST statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more predefined groups in their level of membership coefficient variability. Our methods are implemented in the R package FSTruct.

[207] DJ Cotter, AL Severson, S Carmi, NA Rosenberg (2022) Limiting distribution of X-chromosomal coalescence times under first-cousin consanguineous mating. Theoretical Population Biology 147: 1-15.

By providing additional opportunities for coalescence within families, the presence of consanguineous unions in a population reduces coalescence times relative to non-consanguineous populations. First-cousin consanguinity can take one of six forms differing in the configuration of sexes in the pedigree of the male and female cousins who join in a consanguineous union: patrilateral parallel, patrilateral cross, matrilateral parallel, matrilateral cross, bilateral parallel, and bilateral cross. Considering populations with each of the six types of first-cousin consanguinity individually and a population with a mixture of the four unilateral types, we examine coalescent models of consanguinity. We previously computed, for first-cousin consanguinity models, the mean coalescence time for X-chromosomal loci and the limiting distribution of coalescence times for autosomal loci. Here, we use the separation-of-time-scales approach to obtain the limiting distribution of coalescence times for X-chromosomal loci. This limiting distribution has an instantaneous coalescence probability that depends on the probability that a union is consanguineous; lineages that do not coalesce instantaneously coalesce according to an exponential distribution. We study the effects on the coalescence time distribution of the type of first-cousin consanguinity, showing that patrilateral-parallel and patrilateral-cross consanguinity have no effect on X-chromosomal coalescence time distributions and that matrilateral-parallel consanguinity decreases coalescence times to a greater extent than does matrilateral-cross consanguinity.

[206] RS Mehta, M Steel, NA Rosenberg (2022) The probability of joint monophyly of samples of gene lineages for all species in an arbitrary species tree. Journal of Computational Biology 27: 679-703.

Monophyly is a feature of a set of genetic lineages in which every lineage in the set is more closely related to all other members of the set than it is to any lineage outside the set. Multiple sets of lineages that are separately monophyletic are said to be reciprocally monophyletic, or jointly monophyletic. The prevalence of reciprocal monophyly, or joint monophyly (JM), has been used to evaluate phylogenetic and phylogeographic hypotheses, as well as to delimit species. These applications often make use of a probability of JM under models of gene lineage evolution. Studies in coalescent theory have computed this JM probability for small numbers of separate groups in arbitrary species trees and for arbitrary numbers of separate groups in trivial species trees. In this study, generalizing existing results on monophyly probabilities under the multispecies coalescent, we derive the probability of JM for arbitrary numbers of separate groups in arbitrary species trees. We illustrate how our result collapses to previously examined cases. We also study the effect of tree height, sample size, and number of species on the probability of JM. We obtain relatively simple lower and upper bounds on the JM probability. Our results expand the scope of JM calculations beyond small numbers of species, subsuming past formulas that have been used in simpler cases.

[205] X Liu, NA Rosenberg, G Greenbaum (2022) Extracting hierarchical features of cultural variation using network-based clustering. Evolutionary Human Sciences 4: e18. [PDF] [Supplement] [Supplementary code] [Supplementary data]

High-dimensional datasets on cultural characters contribute to uncovering insights about factors that influence cultural evolution. Because cultural variation in part reflects descent processes with a hierarchical structure – including the descent of populations and vertical transmission of cultural traits – methods designed for hierarchically structured data have potential to find applications in the analysis of cultural variation. We adapt a network-based hierarchical clustering method for use in analysing cultural variation. Given a set of entities, the method constructs a similarity network, hierarchically depicting community structure among them. We illustrate the approach using four datasets: pronunciation variation in the US mid-Atlantic region, folklore variation in worldwide cultures, phonemic variation across worldwide languages and temporal variation in first names in the US. In these examples, the method provides insights into processes that affect cultural variation, uncovering geographic and other influences on observed patterns and cultural characters that make important contributions to them.

[204] JA Palacios, A Bhaskar, F Disanto, NA Rosenberg (2022) Enumeration of binary trees compatible with a perfect phylogeny. Journal of Mathematical Biology 84: 54. [PDF]

Evolutionary models used for describing molecular sequence variation suppose that at a non-recombining genomic segment, sequences share ancestry that can be represented as a genealogy — a rooted, binary, timed tree, with tips corresponding to individual sequences. Under the infinitely-many-sites mutation model, mutations are randomly superimposed along the branches of the genealogy, so that every mutation occurs at a chromosomal site that has not previously mutated; if a mutation occurs at an interior branch, then all individuals descending from that branch carry the mutation. The implication is that observed patterns of molecular variation from this model impose combinatorial constraints on the hidden state space of genealogies. In particular, observed molecular variation can be represented in the form of a perfect phylogeny, a tree structure that fully encodes the mutational differences among sequences. For a sample of n sequences, a perfect phylogeny might not possess n distinct leaves, and hence might be compatible with many possible binary tree structures that could describe the evolutionary relationships among the n sequences. Here, we investigate enumerative properties of the set of binary ranked and unranked tree shapes that are compatible with a perfect phylogeny, and hence, the binary ranked and unranked tree shapes conditioned on an observed pattern of mutations under the infinitely-many-sites mutation model. We provide a recursive enumeration of these shapes. We consider both perfect phylogenies that can be represented as binary and those that are multifurcating. The results have implications for computational aspects of the statistical inference of evolutionary parameters that underlie sets of molecular sequences.

[203] N Alcala, NA Rosenberg (2022) Mathematical constraints on F_ST: multiallelic markers in arbitrarily many populations. Philosophical Transactions of the Royal Society B: Biological Sciences 377: 20200414. [PDF] [Supplement]

Interpretations of values of the F_ST measure of genetic differentiation rely on an understanding of its mathematical constraints. Previously, it has been shown that F_ST values computed from a biallelic locus in a set of multiple populations and F_ST values computed from a multiallelic locus in a pair of populations are mathematically constrained as a function of the frequency of the allele that is most frequent across populations. We generalize from these cases to report here the mathematical constraint on F_ST given the frequency M of the most frequent allele at a multiallelic locus in a set of multiple populations. Using coalescent simulations of an island model of migration with an infinitely-many-alleles mutation model, we argue that the joint distribution of F_ST and M helps in disentangling the separate influences of mutation and migration on F_ST. Finally, we show that our results explain a puzzling pattern of microsatellite differentiation: the lower F_ST in an interspecific comparison between humans and chimpanzees than in the comparison of chimpanzee populations. We discuss the implications of our results for the use of F_ST.

[202] MD Edge, S Ramachandran, NA Rosenberg (2022) Celebrating 50 years since Lewontin's apportionment of human diversity. Philosophical Transactions of the Royal Society of London B: Biological Sciences 377: 20200405. [PDF] (introduction to special issue)

(no abstract)

[201] NA Rosenberg, MF Boni (2022) Mathematical epidemiology for a later age. Theoretical Population Biology 144: 81-83. [PDF] (editorial)

(no abstract)

[200] AL Severson, BF Byrd, EK Mallott, AC Owings, M DeGiorgio, A de Flamingh, C, Nijmeh, MV Arellano, A Leventhal, NA Rosenberg, RS Malhi (2022) Ancient and modern genomics of the Ohlone Indigenous population of California. Proceedings of the National Academy of Sciences USA 119: e2111533119. [PDF] [Supplement]

Traditional knowledge, along with archaeological and linguistic evidence, documents that California supports cultural and linguistically diverse Indigenous populations. Studies that have included ancient genomes in this region, however, have focused primarily on broad-scale migration history of the North American continent, with relatively little attention to local population dynamics. Here, in a partnership involving researchers and the Muwekma Ohlone tribe, we analyze genomic data from ancient and present-day individuals from the San Francisco Bay Area in California: 12 ancient individuals dated to 1905 to 1826 and 601 to 184 calibrated years before the present (cal BP) from two archaeological sites and eight present-day members of the Muwekma Ohlone tribe, whose ancestral lands include these two sites. We find that when compared to other ancient and modern individuals throughout the Americas, the 12 ancient individuals from the San Francisco Bay Area cluster with ancient individuals from Southern California. At a finer scale of analysis, we find that the 12 ancient individuals from the San Francisco Bay Area have distinct ancestry from the other groups and that this ancestry has a component of continuity over time with the eight present-day Muwekma Ohlone individuals. These results add to our understanding of Indigenous population history in the San Francisco Bay Area, in California, and in western North America more broadly.

[199] NA Rosenberg (2022) The 2022 Feldman Prize. Theoretical Population Biology 143: 105-106. [PDF]

(No abstract)

[198] E Alimpiev, NA Rosenberg (2022) A compendium of covariances and correlation coefficients of coalescent tree properties. Theoretical Population Biology 143: 1-13. [PDF]

Gene genealogies are frequently studied by measuring properties such as their height (H), length (L), sum of external branches (E), sum of internal branches (I), and mean of their two basal branches (B), and the coalescence times that contribute to the other genealogical features (T). These tree properties and their relationships can provide insight into the effects of population-genetic processes on genealogies and genetic sequences. Here, under the coalescent model, we study the 15 correlations among pairs of features of genealogical trees: H_n, L_n, E_n, I_n, B_n, and T_k for a sample of size k, with 2 ≤ k ≤ n. We report high correlations among H_n, L_n, I_n, and B_n, with all pairwise correlations of these quantities having values greater than or equal to √(6) [6ζ(3) + 6 - π²]/(π √(18+9π²-π⁴)) ≈ 0.84930 in the limit as n → ∞, where ζ is the Riemann zeta function. Although E_n has expectation 2 for all n and H_n has expectation 2 in the limit, their limiting correlation is 0. The results contribute toward understanding features of the shapes of coalescent trees.

[197] MC King, NA Rosenberg (2021) A simple derivation of the mean of the Sackin index of tree balance under the uniform model on rooted binary labeled trees. Mathematical Biosciences 342: 108688. [PDF]

(No abstract)

[196] E Alimpiev, NA Rosenberg Enumeration of coalescent histories for caterpillar species trees and p-pseudocaterpillar gene trees. Advances in Applied Mathematics 131: 102265 (2021). [PDF]

For a fixed set X containing n taxon labels, an ordered pair consisting of a gene tree topology G and a species tree topology S bijectively labeled with the labels of X possesses a set of coalescent histories — mappings from the set of internal nodes of G to the set of edges of S describing possible lists of edges in S on which the coalescences in G take place. Enumerations of coalescent histories for gene trees and species trees have produced suggestive results regarding the pairs that, for a fixed n, have the largest number of coalescent histories. We define a class of 2-cherry binary tree topologies that we term p-pseudocaterpillars, examining coalescent histories for non-matching pairs in the case in which S has a caterpillar shape and G has a p-pseudocaterpillar shape. Using a construction that associates coalescent histories for with a class of "roadblocked" monotonic paths, we identify the p-pseudocaterpillar labeled gene tree topology that, for a fixed caterpillar labeled species tree topology, gives rise to the largest number of coalescent histories. The shape that maximizes the number of coalescent histories places the "second" cherry of the p-pseudocaterpillar equidistantly from the root of the "first" cherry and from the tree root. A symmetry in the numbers of coalescent histories for p-pseudocaterpillar gene trees and caterpillar species trees is seen to exist around the maximizing value of the parameter p. The results provide insight into the factors that influence the number of coalescent histories possible for a given gene tree and species tree.

[195] DJ Cotter, AL Severson, NA Rosenberg The effect of consanguinity on coalescence times on the X chromosome. Theoretical Population Biology 140: 32-43 (2021).

Consanguineous unions increase the frequency at which identical genomic segments are inherited along separate paths of descent, decreasing coalescence times for pairs of alleles drawn from an individual who is the offspring of a consanguineous pair. For an autosomal locus, it has recently been shown that the mean time to the most recent common ancestor (TMRCA) for two alleles in the same individual and the mean TMRCA for two alleles in two separate individuals both decrease with increasing consanguinity in a population. Here, we extend this analysis to the X chromosome, considering X-chromosomal coalescence times under a coalescent model with diploid, male–female mating pairs. We examine four possible first-cousin mating schemes that are equivalent in their effects on autosomes, but that have differing effects on the X chromosome: patrilateral-parallel, patrilateral-cross, matrilateral-parallel, and matrilateral-cross. In each mating model, we calculate mean TMRCA for X-chromosomal alleles sampled either within or between individuals. We describe a consanguinity effect on X-chromosomal TMRCA that differs from the autosomal pattern under matrilateral but not under patrilateral first-cousin mating. For matrilateral first cousins, the effect of consanguinity in reducing TMRCA is stronger on the X chromosome than on the autosomes, with an increased effect of parallel-cousin mating compared to cross-cousin mating. The theoretical computations support the utility of the model in understanding patterns of genomic sharing on the X chromosome.

[194] AL Severson, S Carmi, NA Rosenberg (2021) Variance and limiting distribution of coalescence times in a diploid model of a consanguineous population. Theoretical Population Biology 139: 50-65.

Recent modeling studies interested in runs of homozygosity (ROH) and identity by descent (IBD) have sought to connect these properties of genomic sharing to pairwise coalescence times. Here, we examine a variety of features of pairwise coalescence times in models that consider consanguinity. In particular, we extend a recent diploid analysis of mean coalescence times for lineage pairs within and between individuals in a consanguineous population to derive the variance of coalescence times, studying its dependence on the frequency of consanguinity and the kinship coefficient of consanguineous relationships. We also introduce a separation-of-time-scales approach that treats consanguinity models analogously to mathematically similar phenomena such as partial selfing, using this approach to obtain coalescence-time distributions. This approach shows that the consanguinity model behaves similarly to a standard coalescent, scaling population size by a factor 1-3c, where c represents the kinship coefficient of a randomly chosen mating pair. It provides the explanation for an earlier result describing mean coalescence time in the consanguinity model in terms of c. The results extend the potential to make predictions about ROH and IBD in relation to demographic parameters of diploid populations.

[193] J Kim, MD Edge, A Goldberg, NA Rosenberg (2021) Skin deep: the decoupling of genetic admixture levels from phenotypes that differed between source populations. American Journal of Physical Anthropology 175: 406-421.

Objectives: In genetic admixture processes, source groups for an admixed population possess distinct patterns of genotype and phenotype at the onset of admixture. Particularly in the context of recent and ongoing admixture, such differences are sometimes taken to serve as markers of ancestry for individuals — that is, phenotypes initially associated with the ancestral background in one source population are assumed to continue to reflect ancestry in that population. Such phenotypes might possess ongoing significance in social categorizations of individuals, owing in part to perceived continuing correlations with ancestry. However, genotypes or phenotypes initially associated with ancestry in one specific source population have been seen to decouple from overall admixture levels, so that they no longer serve as proxies for genetic ancestry. Here, we aim to develop an understanding of the joint dynamics of admixture levels and phenotype distributions in an admixed population. Methods: We devise a mechanistic model, consisting of an admixture model, a quantitative trait model, and a mating model. We analyze the behavior of the mechanistic model in relation to the model parameters. Results: We find that it is possible for the decoupling of genetic ancestry and phenotype to proceed quickly, and that it occurs faster if the phenotype is driven by fewer loci. Positive assortative mating attenuates the process of dissociation relative to a scenario in which mating is random with respect to genetic admixture and with respect to phenotype. Conclusions: The mechanistic framework suggests that in an admixed population, a trait that initially differed between source populations might serve as a reliable proxy for ancestry for only a short time, especially if the trait is determined by few loci. It follows that a social categorization based on such a trait is increasingly uninformative about genetic ancestry and about other traits that differed between source populations at the onset of admixture.

[192] G Greenbaum, MW Feldman, NA Rosenberg, J Kim (2021) Designing gene drives to limit spillover to non-target populations. PLoS Genetics 17: e1009278. [PDF] [Supplement]

The prospect of utilizing CRISPR-based gene-drive technology for controlling populations has generated much excitement. However, the potential for spillovers of gene-drive alleles from the target population to non-target populations has raised concerns. Here, using mathematical models, we investigate the possibility of limiting spillovers to non-target populations by designing differential-targeting gene drives, in which the expected equilibrium gene-drive allele frequencies are high in the target population but low in the non-target population. We find that achieving differential targeting is possible with certain configurations of gene-drive parameters, but, in most cases, only under relatively low migration rates between populations. Under high migration, differential targeting is possible only in a narrow region of the parameter space. Because fixation of the gene drive in the non-target population could severely disrupt ecosystems, we outline possible ways to avoid this outcome. We apply our model to two potential applications of gene drives — field trials for malaria-vector gene drives and control of invasive species on islands. We discuss theoretical predictions of key requirements for differential targeting and their practical implications.

[191] NA Rosenberg (2021) Population models, mathematical epidemiology, and the COVID-19 pandemic. Theoretical Population Biology 137: 1. [PDF]

(No abstract)

[190] A Harpak, N Garud, NA Rosenberg, DA Petrov, M Combs, PS Pennings, J Munshi-South (2021) Genetic adaptation in New York City rats. Genome Biology and Evolution 13: evaa247. [PDF] [Supplement]

Brown rats (Rattus norvegicus) thrive in urban environments by navigating the anthropocentric environment and taking advantage of human resources and by-products. From the human perspective, rats are a chronic problem that causes billions of dollars in damage to agriculture, health, and infrastructure. Did genetic adaptation play a role in the spread of rats in cities? To approach this question, we collected whole-genome sequences from 29 brown rats from New York City (NYC) and scanned for genetic signatures of adaptation. We tested for 1) high-frequency, extended haplotypes that could indicate selective sweeps and 2) loci of extreme genetic differentiation between the NYC sample and a sample from the presumed ancestral range of brown rats in northeast China. We found candidate selective sweeps near or inside genes associated with metabolism, diet, the nervous system, and locomotory behavior. Patterns of differentiation between NYC and Chinese rats at putative sweep loci suggest that many sweeps began after the split from the ancestral population. Together, our results suggest several hypotheses on adaptation in rats living in proximity to humans.

[189] NA Rosenberg (2021) On the Colijn-Plazzotta numbering scheme for unlabeled binary rooted trees. Discrete Applied Mathematics 291: 88-98. [PDF]

Colijn and Plazzotta (2018) introduced a scheme for bijectively associating the unlabeled binary rooted trees with the positive integers. First, the rank 1 is associated with the 1-leaf tree. Proceeding recursively, ordered pair (k₁,k₂), k₁ ≥ k₂ ≥ 1, is then associated with the tree whose left subtree has rank k₁ and whose right subtree has rank k₂. Following dictionary order on ordered pairs, the tree whose left and right subtrees have the ordered pair of ranks (k₁,k₂) is assigned rank k₁(k₁-1)/2 + 1 + k₂. With this ranking, given a number of leaves n, we determine recursions for a_n, the smallest rank assigned to some tree with leaves, and b_n, the largest rank assigned to some tree with leaves. The smallest rank is assigned to the maximally balanced tree, and the largest rank is assigned to the caterpillar. For n equal to a power of 2, the value of a_n is seen to increase exponentially 2αⁿ with for a constant α ≈ 1.24602; more generally, we show it is bounded a_n < 1.5ⁿ. The value of b_n is seen to increase with 2β^(2ⁿ) for a constant β ≈ 1.05653. The great difference in the rates of increase for and indicates that as the index is incremented, the number of leaves for the tree associated with rank quickly traverses a wide range of values. We interpret the results in relation to applications in evolutionary biology.

[188] NA Rosenberg (2020) A population-genetic perspective on the similarities and differences among worldwide human populations. Human Biology 92: 135-152.

Recent studies have produced a variety of advances in the investigation of genetic similarities and differences among human populations. In this reprinted article, originally published in Human Biology in 2011 (vol. 83, no. 6, pp. 659-684), I pose a series of questions about human population-genetic similarities and differences, and I then answer these questions by numerical computation with a single shared population-genetic data set. The collection of answers obtained provides an introductory perspective for understanding key results on the features of worldwide human genetic variation. A new foreword discusses the original article in light of the research that has followed.

[187] SM Boca, L Huang, NA Rosenberg (2020) On the heterozygosity of an admixed population. Journal of Mathematical Biology 81: 1217-1250.

In this study, we consider admixed populations through their expected heterozygosity, a measure of genetic diversity. A population is termed admixed if its members possess recent ancestry from two or more separate sources. As a result of the fusion of source populations with different genetic variants, admixed populations can exhibit high levels of genetic diversity, reflecting contributions of their multiple ancestral groups. For a model of an admixed population derived from K source populations, we obtain a relationship between its heterozygosity and its proportions of admixture from the various source populations. We show that the heterozygosity of the admixed population is at least as great as that of the least heterozygous source population, and that it potentially exceeds the heterozygosities of all of the source populations. The admixture proportions that maximize the heterozygosity possible for an admixed population formed from a specified set of source populations are also obtained under specific conditions. We examine the special case of K=2 source populations in detail, characterizing the maximal admixture in terms of the heterozygosities of the two source populations and the value of F_ST between them. In this case, the heterozygosity of the admixed population exceeds the maximal heterozygosity of the source groups if the divergence between them, measured by F_ST, is large enough, namely above a certain bound that is a function of the heterozygosities of the source groups. We present applications to simulated data as well as to data from human admixture scenarios, providing results useful for interpreting the properties of genetic variability in admixed populations.

[186] J Kim, NA Rosenberg, JA Palacios (2020) Distance metrics for ranked evolutionary trees. Proceedings of the National Academy of Sciences 117: 28876-28886. [PDF] [Supplement]

Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.

[185] AL Fortier, J Kim, NA Rosenberg (2020) Human-genetic ancestry inference and false positives in forensic familial searching. G3: Genes, Genomes, Genetics 10: 2893-2902. [PDF] [Supplement]

In forensic familial search methods, a query DNA profile is tested against a database to determine if the query profile represents a close relative of a database entrant. One challenge for familial search is that the calculations may require specification of allele frequencies for the unknown population from which the query profile has originated. The choice of allele frequencies affects the rate at which non-relatives are erroneously classified as relatives, and allele-frequency misspecification can substantially inflate false positive rates compared to use of allele frequencies drawn from the same population as the query profile. Here, we use ancestry inference on the query profile to circumvent the high false positive rates that result from highly misspecified allele frequencies. In particular, we perform ancestry inference on the query profile and make use of allele frequencies based on its inferred genetic ancestry. In a test for sibling matches on profiles that represent unrelated individuals, we demonstrate that false positive rates for familial search with use of ancestry inference to specify the allele frequencies are similar to those seen when allele frequencies align with the population of origin of a profile. Because ancestry inference is possible to perform on query profiles, the extreme allele-frequency misspecifications that produce the highest false positive rates can be avoided. We discuss the implications of the results in the context of concerns about the forensic use of familial searching.

[184] A Goldberg, A Rastogi, NA Rosenberg (2020) Assortative mating by population of origin in a mechanistic model of admixture. Theoretical Population Biology 134: 129-146.

Populations whose mating pairs have levels of similarity in phenotypes or genotypes that differ systematically from the level expected under random mating are described as experiencing assortative mating. Excess similarity in mating pairs is termed positive assortative mating, and excess dissimilarity is negative assortative mating. In humans, empirical studies suggest that mating pairs from various admixed populations — whose ancestry derives from two or more source populations — possess correlated ancestry components that indicate the occurrence of positive assortative mating on the basis of ancestry. Generalizing a two-sex mechanistic admixture model, we devise a model of one form of ancestry-assortative mating that occurs through preferential mating based on source population. Under the model, we study the moments of the admixture fraction distribution for different assumptions about mating preferences, including both positive and negative assortative mating by population. We demonstrate that whereas the mean admixture under assortative mating is equivalent to that of a corresponding randomly mating population, the variance of admixture depends on the level and direction of assortative mating. We consider two special cases of assortative mating by population: first, a single admixture event, and second, constant contributions to the admixed population over time. In contrast to standard settings in which positive assortment increases variation within a population, certain assortative mating scenarios allow the variance of admixture to decrease relative to a corresponding randomly mating population: with the three populations we consider, the variance-increasing effect of positive assortative mating within a population might be overwhelmed by a variance-decreasing effect emerging from mating preferences involving other pairs of populations. The effect of assortative mating is smaller on the X chromosome than on the autosomes because inheritance of the X in males depends only on the mother’s ancestry, not on the mating pair. Because the variance of admixture is informative about the timing of admixture and possibly about sex-biased admixture contributions, the effects of assortative mating are important to consider in inferring features of population history from distributions of admixture values. Our model provides a framework to quantitatively study assortative mating under flexible scenarios of admixture over time.

[183] RS Mehta, NA Rosenberg (2020) Modelling anti-vaccine sentiment as a cultural pathogen. Evolutionary Human Sciences 2: e21. [PDF] [Supplement]

Culturally transmitted traits that have deleterious effects on health-related traits can be regarded as cultural pathogens. A cultural pathogen can produce coupled dynamics with its associated health-related traits, so that understanding the dynamics of a health-related trait benefits from consideration of the dynamics of the associated cultural pathogen. Here, we treat anti-vaccine sentiment as a cultural pathogen, modelling its 'infection' dynamics with the infection dynamics of the associated vaccine-preventable disease. In a coupled susceptible-infected-resistant (SIR) model, consisting of an SIR model for the anti-vaccine sentiment and an interacting SIR model for the infectious disease, we explore the effect of anti-vaccine sentiment on disease dynamics. We find that disease endemism is contingent on the presence of the sentiment, and that presence of sentiment can enable diseases to become endemic when they would otherwise have disappeared. Furthermore, the sentiment dynamics can create situations in which the disease suddenly returns after a long period of dormancy. We study the effect of assortative sentiment-based interactions on the dynamics of sentiment and disease, identifying a tradeoff whereby assortative meeting aids the spread of a disease but hinders the spread of sentiment. Our results can contribute to finding strategies that reduce the impact of a cultural pathogen on disease, illuminating the value of cultural evolutionary modelling in the analysis of disease dynamics.

[182] IM Arbisser, NA Rosenberg (2020) F_ST and the triangle inequality for biallelic markers. Theoretical Population Biology 133: 117-129.

The population differentiation statistic F_ST, introduced by Sewall Wright, is often treated as a pairwise distance measure between populations. As was known to Wright, however, is not a true metric because allele frequencies exist for which it does not satisfy the triangle inequality. We prove that a stronger result holds: for biallelic markers whose allele frequencies differ across three populations, F_ST never satisfies the triangle inequality. We study the deviation from the triangle inequality as a function of the allele frequencies of three populations, identifying the frequency vector at which the deviation is maximal. We also examine the implications of the failure of the triangle inequality for four-point conditions for placement of groups of four populations on evolutionary trees. Next, we study the extent to which F_ST fails to satisfy the triangle inequality in human genomic data, finding that some loci produce deviations near the maximum. We provide results describing the consequences of the theory for various types of data analysis, including multidimensional scaling and inference of neighbor-joining trees from pairwise F_ST matrices.

[181] NA Rosenberg (2020) Fifty years of Theoretical Population Biology. Theoretical Population Biology 133: 1-12. [PDF] [Supplement]

The year 2020 marks the 50th anniversary of Theoretical Population Biology. This special issue examines the past and continuing contributions of the journal. We identify some of the most important developments that have taken place in the pages of TPB, connecting them to current research and to the numerous forms of significance achieved by theory in population biology.

[180] NM Kopelman, L Stone, DG Hernandez, D Gefel, AB Singleton, E Heyer, MW Feldman, J Hillel, NA Rosenberg (2020) High-resolution inference of genetic relationships among Jewish populations. European Journal of Human Genetics 28: 804-814.

Recent studies have used genome-wide single-nucleotide polymorphisms (SNPs) to investigate relationships among various Jewish populations and their non-Jewish historical neighbors, often focusing on small subsets of populations from a limited geographic range or relatively small samples within populations. Here, building on the significant progress that has emerged from genomic SNP studies in the placement of Jewish populations in relation to non-Jewish populations, we focus on population structure among Jewish populations. In particular, we examine Jewish population-genetic structure in samples that span much of the historical range of Jewish populations in Europe, the Middle East, North Africa, and South Asia. Combining 429 newly genotyped samples from 29 Jewish and 3 non-Jewish populations with previously reported genotypes on Jewish and non-Jewish populations, we investigate variation in 2789 individuals from 114 populations at 486,592 genome-wide autosomal SNPs. Using multidimensional scaling analysis, unsupervised model-based clustering, and population trees, we find that, genetically, most Jewish samples fall into four major clusters that largely represent four culturally defined groupings, namely the Ashkenazi, Mizrahi, North African, and Sephardi subdivisions of the Jewish population. We detect high-resolution population structure, including separation of the Ashkenazi and Sephardi groups and distinctions among populations within the Mizrahi and North African groups. Our results refine knowledge of Jewish population-genetic structure and contribute to a growing understanding of the distinctive genetic ancestry evident in closely related but historically separate Jewish communities.

[179] A Kim, NA Rosenberg, JH Degnan (2020) Probabilities of unranked and ranked anomaly zones under birth-death models. Molecular Biology and Evolution 37: 1480-1494.

A labeled gene tree topology that is more probable than the labeled gene tree topology matching a species tree is called "anomalous." Species trees that can generate such anomalous gene trees are said to be in the "anomaly zone." Here, probabilities of "unranked" and "ranked" gene tree topologies under the multispecies coalescent are considered. A ranked tree depicts not only the topological relationship among gene lineages, as an unranked tree does, but also the sequence in which the lineages coalesce. In this article, we study how the parameters of a species tree simulated under a constant-rate birth–death process can affect the probability that the species tree lies in the anomaly zone. We find that with more than five taxa, it is possible for species trees to have both anomalous unranked and ranked gene trees. The probability of being in either type of anomaly zone increases with more taxa. The probability of anomalous gene trees also increases with higher speciation rates. We observe that the probabilities of unranked anomaly zones are higher and grow much faster than those of ranked anomaly zones as the speciation rate increases. Our simulation shows that the most probable ranked gene tree is likely to have the same unranked topology as the species tree. We design the software PRANC, which computes probabilities of ranked gene tree topologies given a species tree under the coalescent model.

[178] NA Rosenberg, DM Zulman (2020) Measures of care fragmentation: mathematical insights from population genetics. Health Services Research 55: 318-327.

OBJECTIVE: To identify novel properties of health care fragmentation measures, drawing on insights from mathematically equivalent measures of genetic diversity. STUDY DESIGN: We describe mathematical relationships between two measures: (a) Breslau's Usual Provider of Care (UPC), the proportion of care with the most frequently visited provider, analogous to the "frequency of the most frequent allele" at a genetic locus; and (b) Bice-Boxerman's Continuity of Care Index (COCI), a measure of care dispersion across multiple providers, analogous to "Nei's estimator of homozygosity" in genetics. PRINCIPAL FINDINGS: Just as the frequency of the most frequent allele places a tight constraint on homozygosity, the proportion of care with the most frequently visited provider (UPC) places lower and upper bounds on dispersion of care (COCI), and vice versa. This property presents the possibility of a normalized COCI given UPC (NCGU) measure, which reflects a bounded range of care dispersion dependent on the number of visits with the most frequently visited provider. Mathematical aspects of UPC and COCI also suggest thresholds for the minimal number of patient visits to use when studying fragmentation. CONCLUSIONS: Applying knowledge from population genetics elucidated relationships between care fragmentation measures and produced novel insights for care fragmentation studies.

[177] NA Rosenberg (2020) The 2020 Feldman Prize. Theoretical Population Biology 131: 1. [PDF]

(No abstract)

[176] ZM Himwich, NA Rosenberg (2020) Roadblocked monotonic paths and the enumeration of coalescent histories for non-matching caterpillar gene trees and species trees. Advances in Applied Mathematics 113: 101939.

Given a gene tree topology and a species tree topology, a coalescent history represents a possible mapping of the list of gene tree coalescences to associated branches of a species tree on which those coalescences take place. Enumerative properties of coalescent histories have been of interest in the analysis of relationships between gene trees and species trees. The simplest enumerative result identifies a bijection between coalescent histories for a matching caterpillar gene tree and species tree with monotonic paths that do not cross the diagonal of a square lattice, establishing that the associated number of coalescent histories for n-taxon matching caterpillar trees (n ≥ 2) is the Catalan number C_n-1 = (1/n){2n-2 \choose n-1}. Here, we show that a similar bijection applies for non-matching caterpillars, connecting coalescent histories for a non-matching caterpillar gene tree and species tree to a class of roadblocked monotonic paths. The result provides a simplified algorithm for enumerating coalescent histories in the non-matching caterpillar case. It enables a rapid proof of a known result that given a caterpillar species tree, no non-matching caterpillar gene tree has a number of coalescent histories exceeding that of the matching gene tree. Additional results on coalescent histories can be obtained by a bijection between permissible roadblocked monotonic paths and Dyck paths. We study the number of coalescent histories for non-matching caterpillar gene trees that differ from the species tree by nearest-neighbor-interchange and subtree-prune-and-regraft moves, characterizing the non-matching caterpillar with the largest number of coalescent histories. We discuss the implications of the results for the study of the combinatorics of gene trees and species trees.

[175] JTL Kang, NA Rosenberg (2019) Mathematical properties of linkage disequilibrium statistics defined by normalization of the coefficient D=p_AB-p_Ap_B. Human Heredity 84: 127-143. [PDF]

Background: Many statistics for measuring linkage disequilibrium (LD) take the form of a normalization of the LD coefficient D. Different normalizations produce statistics with different ranges, interpretations, and arguments favoring their use. Methods: Here, to compare the mathematical properties of these normalizations, we consider 5 of these normalized statistics, describing their upper bounds, the mean values of their maxima over the set of possible allele frequency pairs, and the size of the allele frequency regions accessible given specified values of the statistics. Results: We produce detailed characterizations of these properties for the statistics d and ρ, analogous to computations previously performed for r². We examine the relationships among the statistics, uncovering conditions under which some of them have close connections. Conclusion: The results contribute insight into LD measurement, particularly the understanding of differences in the features of different LD measures when computed on the same data.

[174] G Greenbaum, A Rubin, AR Templeton, NA Rosenberg (2019) Network-based hierarchical population structure analysis for large genomic datasets. Genome Research 29: 2020-2033. [PDF] [Supplement]

Analysis of population structure in natural populations using genetic data is a common practice in ecological and evolutionary studies. With large genomic data sets of populations now appearing more frequently across the taxonomic spectrum, it is becoming increasingly possible to reveal many hierarchical levels of structure, including fine-scale genetic clusters. To analyze these data sets, methods need to be appropriately suited to the challenges of extracting multilevel structure from whole-genome data. Here, we present a network-based approach for constructing population structure representations from genetic data. The use of community-detection algorithms from network theory generates a natural hierarchical perspective on the representation that the method produces. The method is computationally efficient, and it requires relatively few assumptions regarding the biological processes that underlie the data. We show the approach by analyzing population structure in the model plant species Arabidopsis thaliana and in human populations. These examples illustrate how network-based approaches for population structure analysis are well-suited to extracting valuable ecological and evolutionary information in the era of large genomic data sets

[173] G Greenbaum, WM Getz, NA Rosenberg, MW Feldman, E Hovers, O Kolodny (2019) Disease transmission and introgression can explain the long-lasting contact zone of modern humans and Neanderthals. Nature Communications 10: 5003. [PDF] [Supplement]

Neanderthals and modern humans both occupied the Levant for tens of thousands of years prior to the spread of modern humans into the rest of Eurasia and their replacement of the Neanderthals. That the inter-species boundary remained geographically localized for so long is a puzzle, particularly in light of the rapidity of its subsequent movement. Here, we propose that infectious-disease dynamics can explain the localization and persistence of the inter-species boundary. We further propose, and support with dynamical-systems models, that introgression-based transmission of alleles related to the immune system would have gradually diminished this barrier to pervasive inter-species interaction, leading to the eventual release of the inter-species boundary from its geographic localization. Asymmetries between the species in the characteristics of their associated ‘pathogen packages’ could have generated feedback that allowed modern humans to overcome disease burden earlier than Neanderthals, giving them an advantage in their subsequent spread into Eurasia.

[172] RS Mehta, NA Rosenberg (2019) The probability of reciprocal monophyly of gene lineages in three and four species. Theoretical Population Biology 129: 133-147. [PDF]

Reciprocal monophyly, a feature of a genealogy in which multiple groups of descendant lineages each consist of all of the descendants of their respective most recent common ancestors, has been an important concept in studies of species delimitation, phylogeography, population history reconstruction, systematics, and conservation. Computations involving the probability that reciprocal monophyly is observed in a genealogy have played a key role in criteria for defining taxonomic groups and inferring divergence times. The probability of reciprocal monophyly under a coalescent model of population divergence has been studied in detail for groups of gene lineages for pairs of species. Here, we extend this computation to generate corresponding probabilities for sets of gene lineages from three and four species. We study the effects of model parameters on the probability of reciprocal monophyly, finding that it is driven primarily by species tree height, with lesser but still substantial influences of internal branch lengths and sample sizes. We also provide an example application of our results to data from maize and teosinte.

[171] L Altenberg, N Creanza, L Fogarty, L Hadany, O Kolodny, KN Laland, L Lehmann, SP Otto, NA Rosenberg, J Van Cleve, J Wakeley. Some topics in theoretical population genetics: editorial commentaries on a selection of Marc Feldman's TPB papers. Theoretical Population Biology 129: 4-8. [PDF]

This article consists of commentaries on a selected group of papers of Marc Feldman published in Theoretical Population Biology from 1970 to the present. The papers describe a diverse set of population-genetic models, covering topics such as cultural evolution, social evolution, and the evolution of recombination. The commentaries highlight Marc Feldman's role in providing mathematically rigorous formulations to explore qualitative hypotheses, in many cases generating surprising conclusions.

[170] N Alcala, A Goldberg, U Ramakrishnan, NA Rosenberg (2019) Coalescent theory of migration network motifs. Molecular Biology and Evolution 36: 2358-2374. [PDF] [Supplement]

Natural populations display a variety of spatial arrangements, each potentially with a distinctive impact on genetic diversity and genetic differentiation among subpopulations. Although the spatial arrangement of populations can lead to intricate migration networks, theoretical developments have focused mainly on a small subset of such networks, emphasizing the island-migration and stepping-stone models. In this study, we investigate all small network motifs: the set of all possible migration networks among populations subdivided into at most four subpopulations. For each motif, we use coalescent theory to derive expectations for three quantities that describe genetic variation: nucleotide diversity, F_ST, and half-time to equilibrium diversity. We describe the impact of network properties on these quantities, finding that motifs with a high mean node degree have the largest nucleotide diversity and the longest time to equilibrium, whereas motifs with low density have the largest F_ST. In addition, we show that the motifs whose pattern of variation is most strongly influenced by loss of a connection or a subpopulation are those that can be split easily into disconnected components. We illustrate our results using two example data sets — sky island birds of genus Sholicola and Indian tigers — identifying disturbance scenarios that produce the greatest reduction in genetic diversity; for tigers, we also compare the benefits of two assisted gene flow scenarios. Our results have consequences for understanding the effect of geography on genetic diversity, and they can assist in designing strategies to alter population migration networks toward maximizing genetic variation in the context of conservation of endangered species.

[169] RS Mehta, AF Feder, SM Boca, NA Rosenberg (2019) The relationship between haplotype-based F_ST and haplotype length. Genetics 213: 281-295. [PDF] [Supplementary Figure 1] [Supplementary Figure 2] [Supplementary Figure 3] [Supplementary Figure 4]

The population-genetic statistic F_ST is widely used to describe allele frequency distributions in subdivided populations. The increasing availability of DNA sequence data has recently enabled computations of F_ST from sequence-based "haplotype loci." At the same time, theoretical work has revealed that F_ST has a strong dependence on the underlying genetic diversity of a locus from which it is computed, with high diversity constraining values of F_ST to be low. In the case of haplotype loci, for which two haplotypes that are distinct over a specified length along a chromosome are treated as distinct alleles, genetic diversity is influenced by haplotype length: longer haplotype loci have the potential for greater genetic diversity. Here, we study the dependence of F_ST on haplotype length. Using a model in which a haplotype locus is sequentially incremented by one biallelic locus at a time, we show that increasing the length of the haplotype locus can either increase or decrease the value of F_ST, and usually decreases it. We compute F_ST on haplotype loci in human populations, finding a close correspondence between the observed values and our theoretical predictions. We conclude that effects of haplotype length are valuable to consider when interpreting F_ST calculated on haplotypic data.

[168] AL Severson*, LH Uricchio*, IM Arbisser*, EC Glassberg, NA Rosenberg (2019) Analysis of author gender in TPB, 1991-2018. Theoretical Population Biology 127: 1-6. [PDF] [Supplement]

(No abstract)

[167] N Alcala, AE Launer, MF Westphal, R Seymour, EM Cole*, NA Rosenberg* (2019) Use of stochastic patch occupancy models in the California red-legged frog for Bayesian inference regarding past events and future persistence. Conservation Biology 33: 685-696. [PDF] [Supplement] [Data and Software]

Assessing causes of population decline is critically important to management of threatened species. Stochastic patch occupancy models (SPOMs) are popular tools for examining spatial and temporal dynamics of populations when presence-absence data in multiple habitat patches are available. We developed a Bayesian Markov chain method that extends existing SPOMs by focusing on past environmental changes that may have altered occupancy patterns prior to the beginning of data collection. Using occupancy data on 3 creeks, we applied the method to assess 2 hypothesized causes of population decline — in situ die-off and residual immpact of past source population loss — in the California red-legged frog. Despite having no data for the 20-30 years between the hypothetical event leading to population decline and the first data collected, we were able to discriminate among hypotheses, finding evidence that in situ die-off increased in 2 of the creeks. Although the creeks had comparable numbers of occupied segments, owing to different extinction-colonization dynamics, our model predicted an 8-fold difference in persistence probabilities of their populations to 2030. Adding a source population led to a greater predicted persistence probability than did decreasing the in situ die-off, emphasizing that reversing the deleterious impacts of a disturbance may not be the most efficient management strategy. We expect our method will be useful for studying dynamics and evaluating management strategies of many species.

[166] AL Severson, S Carmi, NA Rosenberg (2019) The effect of consanguinity on between-individual identity-by-descent sharing. Genetics 212: 305-316. [PDF]

Consanguineous unions increase the rate at which identical genomic segments are paired within individuals to produce runs of homozygosity (ROH). The extent to which such unions affect identity-by-descent (IBD) genomic sharing between rather than within individuals in a population, however, is not immediately evident from within-individual ROH levels. Using the fact that the time to the most recent common ancestor (T_MRCA) for a pair of genomes at a specific locus is inversely related to the extent of IBD sharing between the genomes in the neighborhood of the locus, we study IBD sharing for a pair of genomes sampled either within the same individual or in different individuals. We develop a coalescent model for a set of mating pairs in a diploid population, treating the fraction of consanguineous unions as a parameter. Considering mating models that include unions between sibs, first cousins, and nth cousins, we determine the effect of the consanguinity rate on the mean T_MRCA for pairs of lineages sampled either within the same individual or in different individuals. The results indicate that consanguinity not only increases ROH sharing between the two genomes within an individual, it also increases IBD sharing between individuals in the population, the magnitude of the effect increasing with the kinship coefficient of the type of consanguineous union. Considering computations of ROH and between-individual IBD in Jewish populations whose consanguinity rates have been estimated from demographic data, we find that, in accord with the theoretical results, increases in consanguinity and ROH levels inflate levels of IBD sharing between individuals in a population. The results contribute more generally to the interpretation of runs of homozygosity, IBD sharing between individuals, and the relationship between ROH and IBD.

[165] N Alcala, NA Rosenberg (2019) G^'_ST, Jost's D, and F_ST are similarly constrained by allele frequencies: a mathematical, simulation, and empirical study. Molecular Ecology 28: 1624-1636. [PDF] [Supplement]

Statistics G_ST' and Jost's D have been proposed for replacing F_ST as measures of genetic differentiation. A principal argument in favour of these statistics is the independence of their maximal values with respect to the subpopulation heterozygosity H_S, a property not shared by F_ST. Nevertheless, it has been unclear if these alternative differentiation measures are constrained by other aspects of the allele frequencies. Here, for biallelic markers, we study the mathematical properties of the maximal values of G_ST' and Jost's D, comparing them to those of F_ST. We show that G_ST' and D exhibit the same pecular frequency-dependence phenomena as F_ST, including a maximal value as a function of the frequency of the most frequent allele that lies well below one. Although the functions describing G_ST', D, and F_ST in terms of the frequency of the most frequent allele are different, the allele frequencies that maximize them are identical. Moreover, we show using coalescent simulations that when taking into account the specific maximal values of the three statistics, their behaviours become similar across a large range of migration rates. We use our results to explain two empirical patterns: the similar values of the three statistics among North American wolves, and the low D values compared to G_ST' and F_ST in Atlantic salmon. The results suggest that the three statistics are often predictably similar, so that they can make quite similar contributions to data analysis. When they are not similar, the difference can be understood in relation to features of genetic diversity.

[164] F Disanto, NA Rosenberg (2019) Enumeration of compact coalescent histories for matching gene trees and species trees. Journal of Mathematical Biology 78: 155-188. [PDF]

Compact coalescent histories are combinatorial structures that describe for a given gene tree G and species tree S possibilities for the numbers of coalescences of G that take place on the various branches of S. They have been introduced as a data structure for evaluating probabilities of gene tree topologies conditioning on species trees, reducing computation time compared to standard coalescent histories. When gene trees and species trees have a matching labeled topology G=S=t, the compact coalescent histories of t are encoded by particular integer labelings of the branches of t, each integer specifying the number of coalescent events of G present in a branch of S. For matching gene trees and species trees, we investigate enumerative properties of compact coalescent histories. We report a recursion for the number of compact coalescent histories for matching gene trees and species trees, using it to study the numbers of compact coalescent histories for small trees. We show that the number of compact coalescent histories equals the number of coalescent histories if and only if the labeled topology is a caterpillar or a bicaterpillar. The number of compact coalescent histories is seen to increase with tree imbalance: we prove that as the number of taxa n increases, the exponential growth of the number of compact coalescent histories follows 4ⁿ in the case of caterpillar or bicaterpillar labeled topologies and approximately 3.3302ⁿ and 2.8565ⁿ for lodgepole and balanced topologies, respectively. We prove that the mean number of compact coalescent histories of a labeled topology of size n selected uniformly at random grows with 3.3750ⁿ. Our results contribute to the analysis of the computational complexity of algorithms for computing gene tree probabilities, and to the combinatorial study of gene trees and species trees more generally.

[163] NA Rosenberg, MD Edge, JK Pritchard, MW Feldman (2019) Interpreting polygenic scores, polygenic adaptation, and human phenotypic differences. Evolution, Medicine, and Public Health 2019: 26-34. [PDF]

Recent analyses of polygenic scores have opened new discussions concerning the genetic basis and evolutionary significance of differences among populations in distributions of phenotypes. Here, we highlight limitations in research on polygenic scores, polygenic adaptation and population differences. We show how genetic contributions to traits, as estimated by polygenic scores, combine with environmental contributions so that differences among populations in trait distributions need not reflect corresponding differences in genetic propensity. Under a null model in which phenotypes are selectively neutral, genetic propensity differences contributing to phenotypic differences among populations are predicted to be small. We illustrate this null hypothesis in relation to health disparities between African Americans and European Americans, discussing alternative hypotheses with selective and environmental effects. Close attention to the limitations of research on polygenic phenomena is important for the interpretation of their relationship to human population differences.

[162] J Kim, F Disanto, NM Kopelman, NA Rosenberg (2019) Mathematical and simulation-based analysis of the behavior of admixed taxa in the neighbor-joining algorithm. Bulletin of Mathematical Biology 81: 452-493. [PDF]

The neighbor-joining algorithm for phylogenetic inference (NJ) has been seen to have three specific properties when applied to distance matrices that contain an admixed taxon: (1) antecedence of clustering, in which the admixed taxon agglomerates with one of its source taxa before the two source taxa agglomerate with each other; (2) intermediacy of distances, in which the distance on an inferred NJ tree between an admixed taxon and either of its source taxa is smaller than the distance between the two source taxa; and (3) intermediacy of path lengths, in which the number of edges separating the admixed taxon and either of its source taxa is less than or equal to the number of edges between the source taxa. We examine the behavior of neighbor-joining on distance matrices containing an admixed group, investigating the occurrence of antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. We first mathematically predict the frequency with which the properties are satisfied for a labeled unrooted binary tree selected uniformly at random in the absence of admixture. We then introduce a taxon constructed by a linear admixture of distances from two source taxa, examining three admixture scenarios by simulation: a model in which distance matrices are chosen at random, a model in which an admixed taxon is added to a set of taxa that reflect treelike evolution, and a model that introduces a perturbation of the treelike scenario. In contrast to previous conjectures, we observe that the three properties are sometimes violated by distance matrices that include an admixed taxon. However, we also find that they are satisfied more often than is expected by chance when the distance matrix contains an admixed taxon, especially when evolution among the non-admixed taxa is treelike. The results contribute to a deeper understanding of the nature of evolutionary trees constructed from data that do not necessarily reflect a treelike evolutionary process.

[161] F Disanto, NA Rosenberg (2019) On the number of non-equivalent ancestral configurations for matching gene trees and species trees. Bulletin of Mathematical Biology 81: 384-407. [PDF]

An ancestral configuration is one of the combinatorially distinct sets of gene lineages that, for a given gene tree, can reach a given node of a specified species tree. Ancestral configurations have appeared in recursive algebraic computations of the conditional probability that a gene tree topology is produced under the multispecies coalescent model for a given species tree. For matching gene trees and species trees, we study the number of ancestral configurations, considered up to an equivalence relation introduced by Wu (Evolution 66:762-775, 2012) to reduce the complexity of the recursive probability computation. We examine the largest number of non-equivalent ancestral configurations possible for a given tree size n. Whereas the smallest number of non-equivalent ancestral configurations increases polynomially with n, we show that the largest number increases with kⁿ, where k is a constant that satisfies 3^1/3 ≤ k < 1.503. Under a uniform distribution on the set of binary labeled trees with a given size n, the mean number of non-equivalent ancestral configurations grows exponentially with n. The results refine an earlier analysis of the number of ancestral configurations considered without applying the equivalence relation, showing that use of the equivalence relation does not alter the exponential of the increase with tree size.

[160] NA Rosenberg (2019) Enumeration of lonely pairs of gene trees and species trees by means of antipodal cherries. Advances in Applied Mathematics 102: 1-17. [PDF]

In mathematical phylogenetics, given a rooted binary leaf-labeled gene tree topology G and a rooted binary leaf-labeled species tree topology S with the same leaf labels, a coalescent history represents a possible mapping of the list of gene tree coalescences to the associated branches of the species tree on which those coalescences take place. For certain families of ordered pairs (G,S), the number of coalescent histories increases exponentially or even faster than exponentially with the number of leaves n. Other pairs have only a single coalescent history. We term a pair (G,S) lonely if it has only one coalescent history. Here, we characterize the set of all lonely pairs (G,S). Further, we characterize the set of pairs of rooted binary unlabeled tree shapes at least one of the labelings of which is lonely. We provide formulas for counting lonely pairs and pairs of unlabeled tree shapes with at least one lonely labeling. The lonely pairs provide a set of examples of pairs for which the number of compact coalescent histories — which condense coalescent histories into a set of equivalence classes — is equal to the number of coalescent histories. Application of the condition that characterizes lonely pairs can also be used to reduce computation time for the enumeration of coalescent histories.

[159] J Kim, MD Edge, BFB Algee-Hewitt, JZ Li, NA Rosenberg (2018) Statistical detection of relatives typed with disjoint forensic and biomedical loci. Cell 175: 848-858. [PDF] [Supplement] [Code]

In familial searching in forensic genetics, a query DNA profile is tested against a database to determine whether it represents a relative of a database entrant. We examine the potential for using linkage disequilibrium to identify pairs of profiles as belonging to relatives when the query and database rely on nonoverlapping genetic markers. Considering data on individuals genotyped with both microsatellites used in forensic applications and genome-wide SNPs, we find that ~30-32% of parent-offspring pairs and ~35-36% of sib pairs can be identified from the SNPs of one member of the pair and the microsatellites of the other. The method suggests the possibility of performing familial searches of microsatellite databases using query SNP profiles, or vice versa. It also reveals that privacy concerns arising from computations across multiple databases that share no genetic markers in common entail risks, not only for database entrants, but for their close relatives as well.

[158] AJ Aw, NA Rosenberg (2018) Bounding measures of genetic similarity and diversity using majorization. Journal of Mathematical Biology 77: 711-737. [PDF]

The homozygosity and the frequency of the most frequent allele at a polymorphic genetic locus have a close mathematical relationship, so that each quantity places a tight constraint on the other. We use the theory of majorization to provide a simplified derivation of the bounds on homozygosity J in terms of the frequency M of the most frequent allele. The method not only enables simpler derivations of known bounds on J in terms of M, it also produces analogous bounds on entropy statistics for genetic diversity and on homozygosity-like statistics that range in their emphasis on the most frequent allele in relation to other alleles. We illustrate the constraints on the statistics using data from human populations. The approach suggests the potential of the majorization method as a tool for deriving inequalities that characterize mathematical relationships between statistics in population genetics.

[157] A Goldberg, LH Uricchio, NA Rosenberg (2018) Natural selection in human populations. Oxford Bibliographies in Evolutionary Biology doi:10.1093/OBO/9780199941728-0112. [PDF]

(No abstract)

[156] NA Rosenberg (2018) Variance-partitioning and classification in human population genetics. In RG Winther, ed. Phylogenetic Inference, Selection Theory, and History of Science: Selected Papers of A. W. F. Edwards with Commentaries, pp. 399-403. Cambridge: Cambridge University Press. [PDF]

(No abstract)

[155] TJ Pemberton*, P Verdu*, NS Becker, CJ Willer, BS Hewlett, S Le Bomin, A Froment, NA Rosenberg, E Heyer (2018) A genome scan for genes underlying adult body size differences between Central African hunter-gatherers and farmers. Human Genetics 137: 487-509. [PDF] [Supplement]

The evolutionary and biological bases of the Central African "pygmy" phenotype, a characteristic of rainforest hunter-gatherers defined by reduced body size compared with neighboring farmers, remain largely unknown. Here, we perform a joint investigation in Central African hunter-gatherers and farmers of adult standing height, sitting height, leg length, and body mass index (BMI), considering 358 hunter-gatherers and 169 farmers with genotypes for 153,798 SNPs. In addition to reduced standing heights, hunter-gatherers have shorter sitting heights and leg lengths and higher sitting/standing height ratios than farmers and lower BMI for males. Standing height, sitting height, and leg length are strongly correlated with inferred levels of farmer genetic ancestry, whereas BMI is only weakly correlated, perhaps reflecting greater contributions of non-genetic factors to body weight than to height. Single- and multi-marker association tests identify one region and eight genes associated with hunter-gatherer/farmer status, and 24 genes associated with the height-related traits. Many of these genes have putative functions consistent with roles in determining their associated traits and the pygmy phenotype, and they include three associated with standing height in non-Africans (PRKG1, DSCAM, MAGI2). We find evidence that European height-associated SNPs or variants in linkage disequilibrium with them contribute to standing- and sitting-height determination in Central Africans, but not to the differential status of hunter-gatherers and farmers. These findings provide new insights into the biological basis of the pygmy phenotype, and they highlight the potential of cross-population studies for exploring the genetic basis of phenotypes that vary naturally across population.

[154] IM Arbisser, EM Jewett, NA Rosenberg (2018) On the joint distribution of tree height and tree length under the coalescent. Theoretical Population Biology 122: 46-56. [PDF]

Many statistics that examine genetic variation depend on the underlying shapes of genealogical trees. Under the coalescent model, we investigate the joint distribution of two quantities that describe genealogical tree shape: tree height and tree length. We derive a recursive formula for their exact joint distribution under a demographic model of a constant-sized population. We obtain approximations for the mean and variance of the ratio of tree height to tree length, using them to show that this ratio converges in probability to 0 as the sample size increases. We find that as the sample size increases, the correlation coefficient for tree height and length approaches (π²-6)/[π √(2π²-18)]. Using simulations, we examine the joint distribution of height and length under demographic models with population growth and population subdivision. We interpret the joint distribution in relation to problems of interest in data analysis, including inference of the time to the most recent common ancestor. The results assist in understanding the influences of demographic histories on two fundamental features of tree shape.

[153] NA Rosenberg (2018) The 2018 Marcus W. Feldman Prize in Theoretical Population Biology. Theoretical Population Biology 119: 1-2. [PDF]

(No abstract)

[152] F Disanto, NA Rosenberg (2017) Enumeration of ancestral configurations for matching gene trees and species trees. Journal of Computational Biology 24: 831-850. [PDF]

Given a gene tree and a species tree, ancestral configurations represent the combinatorially distinct sets of gene lineages that can reach a given node of the species tree. They have been introduced as a data structure for use in the recursive computation of the conditional probability under the multispecies coalescent model of a gene tree topology given a species tree, the cost of this computation being affected by the number of ancestral configurations of the gene tree in the species tree. For matching gene trees and species trees, we obtain enumerative results on ancestral configurations. We study ancestral configurations in balanced and unbalanced families of trees determined by a given seed tree, showing that for seed trees with more than one taxon, the number of ancestral configurations increases for both families exponentially in the number of taxa n. For fixed n, the maximal number of ancestral configurations tabulated at the species tree root node and the largest number of labeled histories possible for a labeled topology occur for trees with precisely the same unlabeled shape. For ancestral configurations at the root, the maximum increases with k₀ⁿ, where k₀ ≈ 1.5028 is a quadratic recurrence constant. Under a uniform distribution over the set of labeled trees of given size, the mean number of root ancestral configurations grows with √(3/2) (4/3)ⁿ and the variance with ∼ 1.4048(1.8215)ⁿ. The results provide a contribution to the combinatorial study of gene trees and species trees.

[151] P Verdu*, EM Jewett*, TJ Pemberton, NA Rosenberg, M Baptista (2017) Parallel trajectories of genetic and linguistic admixture in a genetically admixed creole population. Current Biology 27: 2529-2535. [PDF] [Supplement]

Joint analyses of genes and languages, both of which are transmitted in populations by descent modification with modification — genes vertically by Mendel's laws, language via combinations of vertical, oblique, and horizontal processes [1-4] — provide an informative approach for human evolutionary studies [5-10]. Although gene-language analyses have employed extensive data on individual genetic variation [11-23], their linguistic data have not considered corresponding long-recognized [24] variability in individual speech patterns, or idiolects. Genetically admixed populations that speak creole languages show high genetic and idiolectal variation — genetic variation owing to heterogeneity in ancestry within admixed groups [25,26] and idiolectal variation owing to recent language formation from differentiated sources [27-31]. To examine cotransmission of genetic and linguistic variation within populations, we collected genetic markers and speech recordings in the adixed creole-speaking population of Cape Verde, whose Kriolu language traces to West African languages and Portuguese [29,32-35] and whose genetic ancestry has individual variation in European and continental African contributions [36-39]. In parallel with the combined Portuguese and West African origin of Kriolu, we find that genetic admixture in Cape Verde varies on an axis separating Iberian and Senegambian populations. We observe, analogously to vertical genetic transmission, transmission of idiolect from parents to offspring, as idiolect is predicted by parental birthplace, even after controlling for shared parent-child birthplaces. Further, African genetic admixture correlated with an index tabulating idiolectal features with likely African origins. These results suggest that Cape Verdean genetic and linguistic admixture have followed parallel evolutionary trajectories, with cotransmission of genetic and linguistic variation.

[150] OK Kamneva, J Syring, A Liston, NA Rosenberg (2017) Evaluating allopolyploid origins in strawberries (Fragaria) using haplotypes generated from target capture sequencing. BMC Evolutionary Biology 17: 180. [PDF] [File S1 Table S1] [File S2 Sequences] [File S3 Table S2] [File S4 Figures S1-S11] [File S5 Table S3] [File S6 R code]

Background. Hybridization is observed in many eukaryotic lineages and can lead to the formation of polyploid species. The study of hybridization and polyploidization faces challenges both in data generation and in accounting for population-level phenomena such as coalescence processes in phylogenetic analysis. Genus Fragaria is one example of a set of plant taxa in which a range of ploidy levels is observed across species, but phylogenetic origins are unknown. Results. Here, using 20 diploid and polyploid Fragaria species, we combine approaches from NGS data analysis and phylogenetics to infer evolutionary origins of polyploid strawberries, taking into account coalescence processes. We generate haplotype sequences for 257 low-copy nuclear markers assembled from Illumina target capture sequence data. We then identify putative hybridization events by analyzing gene tree topologies, and further test predicted hybridizations in a coalescence framework. This approach confirms the allopolyploid ancestry of F. chiloensis and F. virginiana, and provides new allopolyploid ancestry hypotheses for F. iturupensis, F. moschata, and F. orientalis. Evidence of gene flow between diploids F. bucharica and F. vesca is also detected, suggesting that it might be appropriate to consider these groups as conspecifics. Conclusions. This study is one of the first in which target capture sequencing followed by computational deconvolution of individual haplotypes is used for tracing origins of polyploid taxa. The study also provides new perspectives on the evolutionary history of Fragaria.

[149] N Alcala, NA Rosenberg (2017) Mathematical constraints on F_ST: biallelic markers in arbitrarily many populations. Genetics 206: 1581-1600. [PDF] [File S1] [File S2]

F_ST is one of the most widely used statistics in population genetics. Recent mathematical studies have identified constraints that challenge interpretations of F_ST as a measure with potential to range from 0 for genetically similar populations to 1 for divergent populations. We generalize results obtained for population pairs to arbitrarily many populations, characterizing the mathematical relationship between F_ST, the frequency M of the more frequent allele at a polymorphic biallelic marker, and the number of subpopulations K. We show that for fixed K, F_ST has a peculiar constraint as a function of M, with a maximum of 1 only if M = i/K, for integers i with ⌈ K/2 ⌉ ≤ i ≤ K-1. For fixed M, as K grows large, the range of F_ST becomes the closed or half-open unit interval. For fixed K, however, some M < (K-1)/K always exists at which the upper bound on F_ST lies below 2 √ 2 - 2 ≈ 0.8284. We use coalescent simulations to show that under weak migration, F_ST depends strongly on M when K is small, but not when K is large. Finally, examining data on human genetic variation, we use our results to explain the generally smaller F_ST values between pairs of continents relative to global F_ST values. We discuss implications for the interpretation and use of F_ST.

[148] MD Edge, BFB Algee-Hewitt, TJ Pemberton, JZ Li, NA Rosenberg (2017) Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets. Proceedings of the National Academy of Sciences USA 114: 5671-5676. [PDF] [Supplement]

Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching — the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people — one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications — we find that 90-98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99-100% when ~30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers — including databases of forensic significance.

[147] A Goldberg, T Günther, NA Rosenberg, M Jakobsson (2017) Robust model-based inference of male-biased admixture during Bronze Age migration from the Pontic-Caspian Steppe. Proceedings of the National Academy of Sciences USA 114: E3875-E3877. [PDF]

(No abstract)

[146] A Goldberg, T Günther, NA Rosenberg, M Jakobsson (2017) Ancient X chromosomes reveal contrasting sex bias in Neolithic and Bronze Age migrations. Proceedings of the National Academy of Sciences USA 114: 2657-2662. [PDF] [Supplement]

Dramatic events in human prehistory, such as the spread of agriculture to Europe from Anatolia and the late Neolithic/Bronze Age migration from the Pontic-Caspian Steppe, can be investigated using patterns of genetic variation among the people who lived in those times. In particular, studies of differing female and male demographic histories on the basis of ancient genomes can provide information about complexities of social structures and cultural interactions in prehistoric populations. We use a mechanistic admixture model to compare the sex-specifically-inherited X chromosome with the autosomes in 20 early Neolithic and 16 late Neolithic/Bronze Age human remains. Contrary to previous hypotheses suggested by the patrilocality of many agricultural populations, we find no evidence of sex-biased admixture during the migration that spread farming across Europe during the early Neolithic. For later migrations from the Pontic Steppe during the late Neolithic/Bronze Age, however, we estimate a dramatic male bias, with approximately five to 14 migrating males for every migrating female. We find evidence of ongoing, primarily male, migration from the steppe to central Europe over a period of multiple generations, with a level of sex bias that excludes a pulse migration during a single generation. The contrasting patterns of sex-specific migration during these two migrations suggest a view of differing cultural histories in which the Neolithic transition was driven by mass migration of both males and females in roughly equal numbers, perhaps whole families, whereas the later Bronze Age migration and cultural shift were instead driven by male migration, potentially connected to new technology and conquest.

[145] OK Kamneva, NA Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 13: 1176934317691935. [PDF] [Supplement]

Hybridization events generate reticulate species relationships, giving rise to species networks rather than species trees. We report a comparative study of consensus, maximum parsimony, and maximum likelihood methods of species network reconstruction using gene trees simulated assuming a known species history. We evaluate the role of the divergence time between species involved in a hybridization event, the relative contributions of the hybridizing species, and the error in gene tree estimation. When gene tree discordance is mostly due to hybridization and not due to incomplete lineage sorting (ILS), most of the methods can detect even highly skewed hybridization events between highly divergent species. For recent divergences between hybridizing species, when the influence of ILS is sufficiently high, likelihood methods outperform parsimony and consensus methods, which erroneously identify extra hybridizations. The more sophisticated likelihood methods, however, are affected by gene tree errors to a greater extent than are consensus and parsimony.

[144] JTL Kang, A Goldberg, MD Edge, DM Behar, NA Rosenberg (2016) Consanguinity rates predict long runs of homozygosity in Jewish populations. Human Heredity 82: 87-102. [PDF]

Objectives: Recent studies have highlighted the potential of analyses of genomic sharing to produce insight into the demographic processes affecting human populations. We study runs of homozygosity (ROH) in 18 Jewish populations, examining these groups in relation to 123 non-Jewish populations sampled worldwide. Methods: By sorting ROH into 3 length classes (short, intermediate, and long), we evaluate the impact of demographic processes on genomic patterns in Jewish populations. Results: We find that the portion of the genome appearing in long ROH — the length class most directly related to recent consanguinity — closely accords with data gathered from interviews during the 1950s on frequencies of consanguineous unions in various Jewish groups. Conclusion: The high correlation between 1950s consanguinity levels and coverage by long ROH explains differences across populations in ROH patterns. The dissection of ROH into length classes and the comparison to consanguinity data assist in understanding a number of additional phenomena, including similarities of Jewish populations to Middle Eastern, European, and Central and South Asian non- Jewish populations in short ROH patterns, relative lengths of identity-by-descent tracts in different Jewish groups, and the "population isolate" status of the Ashkenazi Jews.

[143] LH Uricchio, T Warnow, NA Rosenberg (2016) An analytical upper bound on the number of loci required for all splits of a species tree to appear in a set of gene trees. BMC Bioinformatics 17: 417. [PDF]

Background. Many methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates. However, few studies have attempted to use analytical theory to quantify "sufficiently large." Results. Using the multispecies coalescent model, we report a general analytical upper bound on the number of gene trees n required such that with probability q, each bipartition of a species tree is represented at least once in a set of n random gene trees. This bound employs a formula that is straightforward to compute, depends only on the minimum internal branch length of the species tree and the number of taxa, and applies irrespective of the species tree topology. Using simulations, we investigate numerical properties of the bound as well as its accuracy under the multispecies coalescent. Conclusions. Our results are helpful for conservatively bounding the number of gene trees required by the ASTRAL inference method, and the approach has potential to be extended to bound other properties of gene tree sets under the model.

[142] F Disanto, NA Rosenberg (2016) Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13: 913-925. [PDF]

Coalescent histories provide lists of species tree branches on which gene tree coalescences can take place, and their enumerative properties assist in understanding the computational complexity of calculations central in the study of gene trees and species trees. Here, we solve an enumerative problem left open by Rosenberg (IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 1253-1262, 2013) concerning the number of coalescent histories for gene trees and species trees with a matching labeled topology that belongs to a generic caterpillar-like family. By bringing a generating function approach to the study of coalescent histories, we prove that for any caterpillar-like family with seed tree t, the sequence (h_n)_{n ≥ 0} describing the number of matching coalescent histories of the nth tree of the family grows asymptotically as a constant multiple of the Catalan numbers. Thus, h_n~β_tc_n, where the asymptotic constant β_t depends no the shape of the seed tree t. The result extends a claim demonstrated only for seed trees with at most eight taxa to arbitrary seed trees, expanding the set of cases for which detailed enumerative properties of coalescent histories can be determined. We introduce a procedure that computes from t the constant β_t as well as the algebraic expression for the generating function of the sequence (h_n)_{n ≥ 0}.

[141] RS Mehta, D Bryant, NA Rosenberg (2016) The probability of monophyly of a sample of gene lineages on a species tree. Proceedings of the National Academy of Sciences USA 113: 8002-8009. [PDF] [Supplement] [Software]

Monophyletic groups — groups that consist of all of the descendants of a most recent common ancestor — arise naturally as a consequence of descent processes that result in meaningful distinctions between organisms. Aspects of monophyly are therefore central to fields that examine and use genealogical descent. In particular, studies in conservation genetics, phylogeography, population genetics, species delimitation, and systematics can all make use of mathematical predictions under evolutionary models about features of monophyly. One important calculation, the probability that a set of gene lineages is monophyletic under a two-species neutral coalescent model, has been used in many studies. Here, we extend this calculation for a species tree model that contains arbitrarily many species. We study the effects of species tree topology and branch lengths on the monophyly probability. These analyses reveal new behavior, including the maintenance of nontrivial monophyly probabilities for gene lineage samples that span multiple species and even for lineages that do not derive from a monophyletic species group. We illustrate the mathematical results using an example application to data from maize and teosinte.

[140] T Stadler, JH Degnan, NA Rosenberg (2016) Does gene tree discordance explain the mismatch between macroevolutionary models and empirical patterns of tree shape and branching times? Systematic Biology 65: 628-639. [PDF] [Supplement]

Classic null models for speciation and extinction give rise to phylogenies that differ in distribution from empirical phylogenies. In particular, empirical phylogenies are less balanced and have branching times closer to the root compared to phylogenies predicted by common null models. This difference might be due to null models of the speciation and extinction process being too simplistic, or due to the empirical datasets not being representative of random phylogenies. A third possibility arises because phylogenetic reconstruction methods often infer gene trees rather than species trees, producing an incongruity between models that predict species tree patterns and empirical analyses that consider gene trees. We investigate the extent to which the difference between gene trees and species trees under a combined birth-death and multispecies coalescent model can explain the difference in empirical trees and birth-death species trees. We simulate gene trees embedded in simulated species trees and investigate their difference with respect to tree balance and branching times. We observe that the gene trees are less balanced and typically have branching times closer to the root than the species trees. Empirical trees from TreeBase are also less balanced than our simulated species trees, and model gene trees can explain an imbalance increase of up to 8% compared to species trees. However, we see a much larger imbalance increase in empirical trees, about 100%, meaning that additional features must also be causing imbalance in empirical trees. This simulation study highlights the necessity of revisiting the assumptions made in phylogenetic analyses, as these assumptions, such as equating the gene tree with the species tree, might lead to a biased conclusion.

[139] M DeGiorgio, NA Rosenberg (2016) Consistency and inconsistency of consensus methods for inferring species trees from gene trees in the presence of ancestral population structure. Theoretical Population Biology 110: 12-24. [PDF]

In the last few years, several statistically consistent consensus methods for species tree inference have been devised that are robust to the gene tree discordance caused by incomplete lineage sorting in unstructured ancestral populations. One source of gene tree discordance that has only recently been identified as a potential obstacle for phylogenetic inference is ancestral population structure. In this article, we describe a general model of ancestral population structure, and by relying on a single carefully constructed example scenario, we show that the consensus methods Democratic Vote, STEAC, STAR, R* Consensus, Rooted Triple Consensus, Minimize Deep Coalescences, and Majority-Rule Consensus are statistically inconsistent under the model. We find that among the consensus methods evaluated, the only method that is statistically consistent in the presence of ancestral population structure is GLASS/Maximum Tree. We use simulations to evaluate the behavior of the various consensus methods in a model with ancestral population structure, showing that as the number of gene trees increases, estimates on the basis of GLASS/Maximum Tree approach the true species tree topology irrespective of the level of population structure, whereas estimates based on the remaining methods only approach the true species tree topology if the level of structure is low. However, through simulations using species trees both with and without ancestral population structure, we show that GLASS/Maximum Tree performs unusually poorly on gene trees inferred from alignments with little information. This practical limitation of GLASS/Maximum Tree together with the inconsistency of other methods prompts the need for both further testing of additional existing methods and development of novel methods under conditions that incorporate ancestral population structure.

[138] BFB Algee-Hewitt*, MD Edge*, J Kim, JZ Li, NA Rosenberg (2016) Individual identifiability predicts population identifiability in forensic microsatellite markers. Current Biology 26: 935-942. [PDF] [Supplement]

Highly polymorphic genetic markers with significant potential for distinguishing individual identity are used as a standard tool in forensic testing [1,2]. At the same time, population-genetic studies have suggested that genetically diverse markers with high individual identifiability also confer information about genetic ancestry [3-6]. The dual influence of polymorphism levels on ancestry inference and forensic desirability suggests that forensically useful marker sets with high levels of individual identifiability might also possess substantial ancestry information. We study a standard forensic marker set — the 13 CODIS loci used in the United States and elsewhere [2,7-9] — together with 779 additional microsatellites [10], using direct population structure inference to test whether markers with substantial individual identifiability also produce considerable information about ancestry. Despite having been selected for individual identification and not for ancestry inference [11], the CODIS markers generate nontrivial model-based clustering patterns similar to those of other sets of 13 tetranucleotide microsatellites. Although the CODIS markers have relatively low values of the F_ST divergence statistic, their high heterozygosities produce greater ancestry inference potential than is possessed by less heterozygous marker sets. More generally, we observe that marker sets with greater individual identifiability also tend toward greater population identifiability. We conclude that population identifiability regularly follows as a byproduct of the use of highly polymorphic forensic markers. Our findings have implications for the design of new forensic marker sets and for evaluations of the extent to which individual characteristics beyond identification might be predicted from current and future forensic data.

[137] NA Rosenberg (2016) Admixture models and the breeding systems of H. S. Jennings: a GENETICS connection. Genetics 202: 9-13. [PDF]

(No abstract)

[136] JTL Kang, P Zhang, S Zöllner, NA Rosenberg (2015) Choosing subsamples for sequencing studies by minimizing the average distance to the closest leaf. Genetics 201: 499-511. [PDF]

Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal panel — minimizing the average distance to the closest leaf (ADCL) — and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.

[135] F Disanto, NA Rosenberg (2015) Coalescent histories for lodgepole species trees. Journal of Computational Biology 22: 918-929. [PDF]

Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the lodgepole species trees (λ_n)_{n ≥ 0}, in which tree λ_n has m=2n+1 taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with m!! in the number of taxa m. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with m taxa, increasing a previous bound of ( $\sqrt{π}$ /32)[(5m-12)/(4m-6)]m $\sqrt{m}$ to [ $\sqrt{m-1}$ /(4 $\sqrt{e}$ )]^m. We discuss the implications of our enumerative results for phylogenetic computations.

[134] R Ronen*, G Tesler*, A Akbari*, S Zakov, NA Rosenberg, V Bafna (2015) Predicting carriers of ongoing selective sweeps without knowledge of the favored allele. PLoS Genetics 11: e1005527. [PDF] [Supplement]

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

[133] A Goldberg, NA Rosenberg (2015) Beyond 2/3 and 1/3: the complex signatures of sex-biased admixture on the X chromosome. Genetics 201: 263-279. [PDF]

Sex-biased demography, in which parameters governing migration and population size differ between females and males, has been studied through comparisons of X chromosomes, which are inherited sex-specifically, and autosomes, which are not. A common form of sex bias in humans is sex-biased admixture, in which at least one of the source populations differs in its proportions of females and males contributing to an admixed population. Studies of sex-biased admixture often examine the mean ancestry for markers on the X chromosome in relation to the autosomes. A simple framework noting that in a population with equally many females and males, two-thirds of X chromosomes appear in females, suggests that the mean X-chromosomal admixture fraction is a linear combination of female and male admixture parameters, with coefficients 2/3 and 1/3, respectively. Extending a mechanistic admixture model to accommodate the X chromosome, we demonstrate that this prediction is not generally true in admixture models, although it holds in the limit for an admixture process occurring as a single event. For a model with constant ongoing admixture, we determine the mean X-chromosomal admixture, comparing admixture on female and male X chromosomes to corresponding autosomal values. Surprisingly, in reanalyzing African-American genetic data to estimate sex-specific contributions from African and European sources, we find that the range of contributions compatible with the excess African ancestry on the X chromosome compared to autosomes has a wide spread, permitting scenarios either without male-biased contributions from Europe or without female-biased contributions from Africa.

[132] MD Edge, NA Rosenberg (2015) A general model of the relationship between the apportionment of human genetic diversity and the apportionment of human phenotypic diversity. Human Biology 87: 313-337. [PDF]

Models that examine genetic differences between populations alongside genotype aphenotype map can provide insight about phenotypic variation among groups. We generalize a simple model of a completely heritable, additive, selectively neutral quantitative trait to examine the relationship between single-locus genetic differentiation and phenotypic differentiation on quantitative traits. In agreement with similar efforts using different models, we show that the expected degree to which two groups differ on a neutral quantitative trait is not strongly affected by the number of genetic loci that influence the trait: neutral trait differences are expected to have a magnitude comparable to the genetic differences at a single neutral locus. We discuss this result with respect to population differences in disease phenotypes, arguing that although neutral genetic differences between populations can contribute to specific differences between populations in health outcomes, systematic patterns of difference that run in the same direction for many genetically independent health conditions are unlikely to be explained by neutral genetic differentiation.

[131] NA Rosenberg, JTL Kang (2015) Genetic diversity and societally important disparities. Genetics 201: 1-12. [PDF] [Supplement]

The magnitude of genetic diversity within human populations varies in a way that reflects the sequence of migrations by which people spread throughout the world. Beyond its use in human evolutionary genetics, worldwide variation in genetic diversity sometimes can interact with social processes to produce differences among populations in their relationship to modern societal problems. We review the consequences of genetic diversity differences in the settings of familial identification in forensic genetic testing, match probabilities in bone marrow transplantation, and representation in genome-wide association studies of disease. In each of these three cases, the contribution of genetic diversity to social differences follows from population-genetic principles. For a fourth setting that is not similarly grounded, we reanalyze with expanded genetic data a report that genetic diversity differences influence global patterns of human economic development, finding no support for the claim. The four examples describe a limit to the importance of genetic diversity for explaining societal differences while illustrating a distinction that certain biologically based scenarios do require consideration of genetic diversity for solving problems to which populations have been differentially predisposed by the unique history of human migrations.

[130] NM Kopelman, J Mayzel, M Jakobsson, NA Rosenberg, I Mayrose (2015) CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources 15: 1179-1191. [PDF] [Supplement] [Software]

The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present CLUMPAK (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, CLUMPAK identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software CLUMPP. Next, CLUMPAK identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. CLUMPAK incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. CLUMPAK, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.

[129] MD Edge, NA Rosenberg (2015) Implications of the apportionment of human genetic diversity for the apportionment of human phenotypic diversity. Studies in History and Philosophy of Biological and Biomedical Sciences 52: 32-45. [PDF]

Researchers in many fields have considered the meaning of two results about genetic variation for concepts of "race." First, at most genetic loci, apportionments of human genetic diversity find that worldwide populations are genetically similar. Second, when multiple genetic loci are examined, it is possible to distinguish people with ancestry from different geographical regions. These two results raise an important question about human phenotypic diversity: To what extent do populations typically differ on phenotypes determined by multiple genetic loci? It might be expected that such phenotypes follow the pattern of similarity observed at individual loci. Alternatively, because they have a multilocus genetic architecture, they might follow the pattern of greater differentiation suggested by multilocus ancestry inference. To address the question, we extend a well-known classification model of Edwards (2003) by adding a selectively neutral quantitative trait. Using the extended model, we show, in line with previous work in quantitative genetics, that regardless of how many genetic loci influence the trait, one neutral trait is approximately as informative about ancestry as a single genetic locus. The results support the relevance of single-locus genetic-diversity partitioning for predictions about phenotypic diversity.

[128] L Lehmann, NA Rosenberg (2015) Hamilton's rule: game theory meets coalescent theory. Theoretical Population Biology 103: 1. [PDF]

(No abstract)

[127] NR Garud, NA Rosenberg (2015) Enhancing the mathematical properties of new haplotype homozygosity statistics for the detection of selective sweeps. Theoretical Population Biology 102: 94-101. [PDF]

Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H₁₂, a statistic designed to identify both hard and soft selective sweeps, and H₂/H₁, a statistic that conditional on high H₁₂ values seeks to distinguish between hard and soft sweeps. A challenge in the use of H₂/H₁ is that its range depends on the associated value of H₁₂, so that equal H₂/H₁ values might provide different levels of support for a soft sweep model at different values of H₁₂. Here, we enhance the H₁₂ and H₂/H₁, haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H₂/H₁ as a function of H₁₂, thereby generating a statistic that normalizes H₂/H₁ to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization.

[126] NA Rosenberg (2015) Theory in population biology, or biologically inspired mathematics? Theoretical Population Biology 102: 1-2. [PDF]

(No abstract)

[125] N Creanza, M Ruhlen, T Pemberton, NA Rosenberg, MW Feldman, S Ramachandran (2015) Comparison of worldwide phonemic and genetic variation in human populations. Proceedings of the National Academy of Sciences USA 112: 1265-1272. [PDF] [Supplementary Appendix] [Supplementary Data S1] [Supplementary Data S2] [Supplementary Data S3]

Worldwide patterns of genetic variation are driven by human demographic history. Here, we test whether this demographic history has left similar signatures phonemes — sound units that distinguish meaning between words languages — into those it has left on genes. We analyze, jointly and in parallel, phoneme inventories from 2,082 worldwide languages and microsatellite polymorphisms from 246 worldwide populations. On a global scale, both genetic distance and phonemic distance between populations are significantly correlated with geographic distance. Geographically close language pairs share significantly more phonemes than distant language pairs, whether or not the languages are closely related. The regional geographic axes of greatest phonemic differentiation correspond to axes of genetic differentiation, suggesting that there is a relationship between human dispersal and linguistic variation. However, the geographic distribution of phoneme inventory sizes does not follow the predictions of a serial founder effect during human expansion out of Africa. Furthermore, although geographically isolated populations lose genetic diversity via genetic drift, phonemes are not subject to drift in the same way: within a given geographic radius, languages that are relatively isolated exhibit more variance in number of phonemes than languages with many neighbors. This finding suggests that relatively isolated languages are more susceptible to phonemic change than languages with many neighbors. Within a language family, phoneme evolution along genetic, geographic, or cognate-based linguistic trees predicts similar ancestral phoneme states to those predicted from ancient sources. More genetic sampling could further elucidate the relative roles of vertical and horizontal transmission in phoneme evolution.

[124] EO Buzbas, NA Rosenberg (2015) AABC: approximate approximate Bayesian computation for inference in population-genetic models. Theoretical Population Biology 99: 31-42. [PDF]

Approximate Bayesian computation (ABC) methods perform inference on model-specific parameters of mechanistically motivated parametric models when evaluating likelihoods is difficult. Central to the success of ABC methods, which have been used frequently in biology, is computationally inexpensive simulation of data sets from the parametric model of interest. However, when simulating data sets from a model is so computationally expensive that the posterior distribution of parameters cannot be adequately sampled by ABC, inference is not straightforward. We present "approximate approximate Bayesian computation" (AABC), a class of computationally fast inference methods that extends ABC to models in which simulating data is expensive. In AABC, we first simulate a number of data sets small enough to be computationally feasible to simulate from the parametric model. Conditional on these data sets, we use a statistical model that approximates the correct parametric model and enables efficient simulation of a large number of data sets. We show that under mild assumptions, the posterior distribution obtained by AABC converges to the posterior distribution obtained by ABC, as the number of data sets simulated from the parametric model and the sample size of the observed data set increase. We demonstrate the performance of AABC on a population-genetic model of natural selection, as well as on a model of the admixture history of hybrid populations. The latter example illustrates how, in population genetics, AABC is of particular utility in scenarios that rely on conceptually straightforward but potentially slow forward-in-time simulations.

[123] F Disanto, NA Rosenberg (2014) On the number of ranked species trees producing anomalous ranked gene trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11: 1229-1238. [PDF]

Analysis of probability distributions conditional on species trees has demonstrated the existence of anomalous ranked gene trees (ARGTs), ranked gene trees that are more probable than the ranked gene tree that accords with the ranked species tree. Here, to improve the characterization of ARGTs, we study enumerative and probabilistic properties of two classes of ranked labeled species trees, focusing on the presence or avoidance of certain subtree patterns associated with the production of ARGTs. We provide exact enumerations and asymptotic estimates for cardinalities of these sets of trees, showing that as the number of species increases without bound, the fraction of all ranked labeled species trees that are ARGT-producing approaches 1. This result extends beyond earlier existence results to provide a probabilistic claim about the frequency of ARGTs.

[122] A Goldberg, P Verdu, NA Rosenberg (2014) Autosomal admixture levels are informative about sex bias in admixed populations. Genetics 198: 1209-1229. [PDF]

Sex-biased admixture has been observed in a wide variety of admixed populations. Genetic variation in sex chromosomes and functions of quantities computed from sex chromosomes and autosomes have often been examined to infer patterns of sex-biased admixture, typically using statistical approaches that do not mechanistically model the complexity of a sex-specific history of admixture. Here, expanding on a model of Verdu and Rosenberg (2011) that did not include sex specificity, we develop a model that mechanistically examines sex-specific admixture histories. Under the model, multiple source populations contribute to an admixed population, potentially with their male and female contributions varying over time. In an admixed population descended from two source groups, we derive the moments of the distribution of the autosomal admixture fraction from a specific source population as a function of sex-specific introgression parameters and time. Considering admixture processes that are constant in time, we demonstrate that surprisingly, although the mean autosomal admixture fraction from a specific source population does not reveal a sex bias in the admixture history, the variance of autosomal admixture is informative about sex bias. Specifically, the long-term variance decreases as the sex bias from a contributing source population increases. This result can be viewed as analogous to the reduction in effective population size for populations with an unequal number of breeding males and females. Our approach suggests that it may be possible to use the effect of sex-biased admixture on autosomal DNA to assist with methods for inference of the history of complex sex-biased admixture processes.

[121] MD Edge, NA Rosenberg (2014) Upper bounds on F_ST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theoretical Population Biology 97: 20-34. [PDF]

F_ST is one of the most frequently-used indices of genetic differentiation among groups. Though F_ST takes values between 0 and 1, authors going back to Wright have noted that under many circumstances, F_ST is constrained to be less than 1. Recently, we showed that at a genetic locus with an unspecified number of alleles, F_ST for two subpopulations is strictly bounded from above by functions of both the frequency of the most frequent allele (M) and the homozygosity of the total population (H_T). In the two-subpopulation case, F_ST can equal one only when the frequency of the most frequent allele and the total homozygosity are 1/2. Here, we extend this work by deriving strict bounds on F_ST for two subpopulations when the number of alleles at the locus is specified to be I. We show that restricting to I alleles produces the same upper bound on F_ST over much of the allowable domain for M and H_T, and we derive more restrictive bounds in the windows M ∈ [1/I,1/(I-1)) and H_T ∈ [1/I,I/(I²-1)). These results extend our understanding of the behavior of F_ST in relation to other population-genetic statistics.

[120] P Verdu, TJ Pemberton, R Laurent, BM Kemp, A Gonzalez-Oliver, C Gorodesky, CE Hughes, MR Shattuck, B Petzelt, J Mitchell, H Harry, T William, R Worl, JS Cybulski, NA Rosenberg, RS Malhi (2014) Patterns of admixture and population structure in native populations of northwest North America. PLoS Genetics 10: e1004530. [PDF] [Supplement]

The initial contact of European populations with indigenous populations of the Americas produced diverse admixture processes across North, Central, and South America. Recent studies have examined the genetic structure of indigenous populations of Latin America and the Caribbean and their admixed descendants, reporting on the genomic impact of the history of admixture with colonizing populations of European and African ancestry. However, relatively little genomic research has been conducted on admixture in indigenous North American populations. In this study, we analyze genomic data at 475,109 single-nucleotide polymorphisms sampled in indigenous peoples of the Pacific Northwest in British Columbia and Southeast Alaska, populations with a well-documented history of contact with European and Asian traders, fishermen, and contract laborers. We find that the indigenous populations of the Pacific Northwest have higher gene diversity than Latin American indigenous populations. Among the Pacific Northwest populations, interior groups provide more evidence for East Asian admixture, whereas coastal groups have higher levels of European admixture. In contrast with many Latin American indigenous populations, the variance of admixture is high in each of the Pacific Northwest indigenous populations, as expected for recent and ongoing admixture processes. The results reveal some similarities but notable differences between admixture patterns in the Pacific Northwest and those in Latin America, contributing to a more detailed understanding of the genomic consequences of European colonization events throughout the Americas.

[119] TJ Pemberton, NA Rosenberg (2014) Population-genetic influences on genomic estimates of the inbreeding coefficient: a global perspective. Human Heredity 77: 37-48. [PDF] [Supplementary Figure 1] [Supplementary Figure 2] [Supplementary Table 1] [Supplementary Table 2] [Supplementary Table 3]

Background/Aims: Culturally driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation. Methods: Using genotypes at 645 microsatellites, we compare inbreeding coefficients from 5,043 individuals representing 237 populations worldwide to demographic consanguinity frequency estimates available for 26 populations as well as to other quantities that can illuminate population-genetic influences on inbreeding coefficients. Results: We observe higher inbreeding coefficient estimates in populations and geographic regions with known high levels of consanguinity or genetic isolation and in populations with an increased effect of genetic drift and decreased genetic diversity with increasing distance from Africa. For the small number of populations with specific consanguinity estimates, we find a correlation between inbreeding coefficients and consanguinity frequency (r=0.349, p=0.040). Conclusions: The results emphasize the importance of both consanguinity and population-genetic factors in influencing variation in inbreeding coefficients, and they provide insight into factors useful for assessing the effect of consanguinity on genomic patterns in different populations.

[118] CV Than, NA Rosenberg (2014) Mean deep coalescence cost under exchangeable probability distributions. Discrete Applied Mathematics 174: 11-26. [PDF]

We derive formulas for mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, under probability distributions that satisfy the exchangeability property. We than apply the formulas to study mean deep coalescence cost under two commonly used exchangeable models - the uniform and Yule models. We find that mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, tends to be larger for unbalanced trees than for balanced trees. These results provide a better understanding of the deep coalescence cost, as well as allow for the development of new species tree inference criteria.

[117] M DeGiorgio, J Syring, AJ Eckert, AI Liston, R Cronn, DB Neale, NA Rosenberg (2014) An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines. BMC Evolutionary Biology 14: 67. [PDF] [Supplementary File 1 (.xlsx, accession numbers)] [Supplementary File 2 (.pdf, supplementary analyses)] [Supplementary File 3 (.zip, data)]

BACKGROUND. As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size.
RESULTS. Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ~47 kilobases of sequence at 121 loci. Each "strategy" for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies.
CONCLUSIONS. When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.

[116] EM Jewett, NA Rosenberg (2014) Theory and applications of a deterministic approximation to the coalescent model. Theoretical Population Biology 93: 14-29. [PDF]

Under the coalescent model, the random number n_t of lineages ancestral to a sample is nearly deterministic as a function of time when n_t is moderate to large in value, and it is well approximated by its expectation E[n_t]. In turn, this expectation is well approximated by simple deterministic functions that are easy to compute. Such deterministic functions have been applied to estimate allele age, effective population size, and genetic diversity, and they have been used to study properties of models of infectious disease dynamics. Although a number of simple approximations of E[n_t] have been derived and applied to problems of population-genetic inference, the theoretical accuracy of the resulting approximate formulas and the inferences obtained using these approximations is not known, and the range of problmes to which they can be applied is not well understood. Here, we demonstrate general procedures by which the approximation n_t ≈ E[n_t] can be used to reduce the computational complexity of coalescent formulas, and we show that the resulting approximations converge to their true values under simple assumptions. Such approximations provide alternatives to exact formulas that are computationally intractable or numerically unstable when the number of sampled lineages is moderate or large. We also extend an existing class of approximations of E[n_t] to the case of multiple populations of time-varying size with migration among them. Our results facilitate the use of the deterministic approximation n_t ≈ E[n_t] for deriving functionally simple, computationally efficient, and numerically stable approximations of coalescent formuals under complicated demographic scenarios.

[115] NA Rosenberg (2014) Editorial: core elements of a TPB paper. Theoretical Population Biology 92: 118-119. [PDF]

(No abstract)

[114] DM Behar*, M Metspalu*, Y Baran, NM Kopelman, B Yunusbayev, A Gladstein, S Tzur, H Sahakyan, A Bahmanimehr, L Yepiskoposyan, K Tambets, EK Khusnutdinova, A Kushniarevich, O Balanovsky, E Balanovsky, L Kovacevic, D Marjanovic, E Mihailov, A Kouvatsi, C Triantaphyllidis, RJ King, O Semino, A Torroni, MF Hammer, E Metspalu, K Skorecki, S Rosset, E Halperin, R Villems, NA Rosenberg (2013) No evidence from genome-wide data of a Khazar origin for the Ashkenazi Jews. Human Biology 85: 859-900. [PDF]

The origin and history of the Ashkenazi Jewish population have long been of great interest, and advances in high-throughput genetic analysis have recently provided a new approach for investigating these topics. We and others have argued on the basis of genome-wide data that the Ashkenazi Jewish population derives its ancestry from a combination of sources tracing to both Europe and the Middle East. It has been claimed, however, through a reanalysis of some of our data, that a large part of the ancestry of the Ashkenazi population originates with the Khazars, a Turkic-speaking group that lived to the north of the Caucasus region ~1,000 years ago. Because the Khazar population has left no obvious modern descendants that could enable a clear test for a contribution to Ashkenazi Jewish ancestry, the Khazar hypothesis has been difficult to examine using genetics. Furthermore, because only limited genetic data have been available from the Caucasus region, and because these data have been concentrated in populations that are genetically close to populations from the Middle East, the attribution of any signal of Ashkenazi-Caucasus genetic similarity to Khazar ancestry rather than shared ancestral Middle Eastern ancestry has been problematic. Here, through integration of genotypes from newly collected samples with data from several of our past studies, we have assembled the largest data set available to date for assessment of Ashkenazi Jewish genetic origins. This data set contains genome-wide single-nucleotide polymorphisms in 1,774 samples from 106 Jewish and non-Jewish populations that span the possible regions of potential Ashkenazi ancestry: Europe, the Middle East, and the region historically associated with the Khazar Khaganate. The data set includes 261 samples from 15 populations from the Caucasus region and the region directly to its north, samples that have not previously been included alongside Ashkenazi Jewish samples in genomic studies. Employing a variety of standard techniques for the analysis of population-genetic structure, we found that Ashkenazi Jews share the greatest genetic ancestry with other Jewish populations and, among non-Jewish populations, with groups from Europe and the Middle East. No particular similarity of Ashkenazi Jews to populations from the Caucasus is evident, particularly populations that most closely represent the Khazar region. Thus, analysis of Ashkenazi Jews together with a large sample from the region of the Khazar Khaganate corroborates the earlier results that Ashkenazi Jews derive their ancestry primarily from populations of the Middle East and Europe, that they possess considerable shared ancestry with other Jewish populations, and that there is no indication of a significant genetic contribution either from within or from north of the Caucasus region.

[113] PJ Oefner, G Hölzl, P Shen, I Shpirer, D Gefel, T Lavi, E Woolf, J Cohen, C Cinnioglu, PA Underhill, NA Rosenberg, J Hochrein, JM Granka, J Hillel, MW Feldman (2013) Genetics and the history of the Samaritans: Y-chromosomal microsatellites and genetic affinity between Samaritans and Cohanim. Human Biology 85: 825-857. [PDF]

The Samaritans are a group of some 750 indigenous Middle Eastern people, about half of whom live in Holon, a suburb of Tel Aviv, and the other half near Nablus. The Samaritan population is believed to have numbered more than a million in late Roman times but less than 150 in 1917. The ancestry of the Samaritans has been subject to controversy from late Biblical times to the present. In this study, liquid chromatography/electrospray ionization/quadrupole ion trap mass spectrometry was used to allelotype 13 Y-chromosomal and 15 autosomal microsatellites in a sample of 12 Samaritans chosen to have as low a level of relationship as possible, and 461 Jews and non-Jews. Estimation of genetic distances between the Samaritans and seven Jewish and three non-Jewish populations from Israel, as well as populations from Africa, Pakistan, Turkey, and Europe, revealed that the Samaritans were closely related to Cohanim. This result supports the position of the Samaritans that they are descendants from the tribes of Israel dating to before the Assyrian exile in 722-720 BCE. In concordance with previously published single-nucleotide polymorphism haplotypes, each Samaritan family, with the exception of the Samaritan Cohen lineage, was observed to carry a distinctive Y-chromosome short tandem repeat haplotype that was not more than one mutation removed from the six-marker Cohen modal haplotype.

[112] NA Rosenberg, SP Weitzman (2013) From generation to generation: the genetics of Jewish populations. Human Biology 85: 817-823. [PDF]

(No abstract)

[111] NA Rosenberg (2013) Coalescent histories for caterpillar-like families. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 1253-1262. [PDF]

A coalescent history is an assignment of branches of a gene tree to branches of a species tree on which coalescences in the gene tree occur. The number of coalescent histories for a pair consisting of a labeled gene tree topology and a labeled species tree topology is important in gene tree probability computations, and more generally, in studying evolutionary possibilities for gene trees on species trees. Defining the T_r-caterpillar like family as a sequence of n-taxon trees constructed by replacing the r-taxon subtree of n-taxon caterpillars by a specific r-taxon labeled topology T_r, we examine the number of coalescent histories for caterpillar-like families with matching gene tree and species tree labeled topologies. For each T_r with size r ≤ 8, we compute the number of coalescent histories for n-taxon trees in the T_r-caterpillar-like family. Next, as n → ∞, we find that the limiting ratio of the numbers of coalescent histories for the T_r family and caterpillars themselves is correlated with the number of labeled histories for T_r. The results support a view that large numbers of coalescent histories occur when a tree has both a relatively balanced subtree and a high tree depth, contributing to deeper understanding of the combinatorics of gene trees and species trees.

[110] MD Edge, P Gorroochurn, NA Rosenberg (2013) Windfalls and pitfalls: applications of population genetics to the search for disease genes. Evolution, Medicine, and Public Health 2013: 254-272. [Full-text at journal website] [PDF]

Association mapping can be viewed as an application of population genetics and evolutionary biology to the problem of identifying genes causally connected to phenotypes. However, some population-genetic principles important to the design and analysis of association studies have not been widely understood or have even been generally misunderstood. Some of these principles underlie techniques that can aid in the discovery of genetic variants that influence phenotypes ('windfalls'), whereas others can interfere with study design or interpretation of results ('pitfalls'). Here, considering examples involving genetic variant discovery, linkage disequilibrium, power to detect associations, population stratification and genotype imputation, we address misunderstandings in the application of population genetics to association studies, and we illuminate how some surprising results in association contexts can be easily explained when considered from evolutionary and population-genetic perspectives. Through our examples, we argue that population-genetic thinking — which takes a theoretical view of the evolutionary forces that guide the emergence and propagation of genetic variants — substantially informs the design and interpretation of genetic association studies. In particular, population-genetic thinking sheds light on genetic confounding, on the relationships between association signals of typed markers and causal variants, and on the advantages and disadvantages of particular strategies for measuring genetic variation in association studies.

[109] NA Rosenberg (2013) Discordance of species trees with their most likely gene trees: a unifying principle. Molecular Biology and Evolution 30: 2709-2713. [Full-text at journal website] [PDF]

A labeled gene tree topology that disagrees with a labeled species tree topology is said to be anomalous if it is more probable under a coalescent model for gene lineage evolution than the labeled gene tree topology that matches the species tree. It has previously been shown that as a consequence of short internal branches of the species tree, for every labeled species tree topology with five or more taxa, and for asymmetric four-taxon species tree topologies, an assignment of species tree branch lengths can be made which gives rise to anomalous gene trees (AGTs). Here, I offer an alternative characterization of this result — a labeled species tree topology produces AGTs if and only if it contains two consecutive internal branches in an ancestor-descendant relationship — and I provide a proof that follows from the change in perspective. The reformulation and alternative proof of the existence result for AGTs provide the insight that it is not merely short internal branches that generate AGTs, but instead, short internal branches that are arranged consecutively.

[108] P Zhang, X Zhan, NA Rosenberg, S Zöllner (2013) Genotype imputation reference panel selection using maximal phylogenetic diversity. Genetics 195: 319-330. [PDF]

The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the "most diverse reference panel," defined as the subset with the maximal "phylogenetic diversity," thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.

[107] NA Rosenberg, TJ Pemberton, JZ Li, JW Belmont (2013) Runs of homozygosity and parental relatedness. Genetics in Medicine 15: 753-754. [PDF]

(No abstract)

[106] L Huang, EO Buzbas, NA Rosenberg (2013) Genotype imputation in a coalescent model with infinitely-many-sites mutation. Theoretical Population Biology 87: 62-74. [PDF]

Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that, under this model, imputation accuracy — as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence — increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined.

[105] ZA Szpiech, J Xu, TJ Pemberton, W Peng, S Zöllner, NA Rosenberg, JZ Li (2013) Long runs of homozygosity are enriched for deleterious variation. American Journal of Human Genetics 93: 90-102. [PDF] [Supplement] [Data]

Exome sequencing offers the potential to study the population-genomic variables that underlie patterns of deleterious variation. Runs of homozygosity (ROH) are long stretches of consecutive homozygous genotypes probably reflecting segments shared identically by descent as the result of processes such as consanguinity, population size reduction, and natural selection. The relationship between ROH and patterns of predicted deleterious variation can provide insight into the way in which these processes contribute to the maintenance of deleterious variants. Here, we use exome sequencing to examine ROH in relation to the distribution of deleterious variation in 27 individuals of varying levels of apparent inbreeding from 6 human populations. A significantly greater fraction of all genome-wide predicted damaging homozygotes fall in ROH than would be expected from the corresponding fraction of nondamaging homozygotes in ROH (p < 0.001). This pattern is strongest for long ROH (p < 0.05). ROH, and especially long ROH, harbor disproportionately more deleterious homozygotes than would be expected on the basis of the total ROH coverage of the genome and the genomic distribution of nondamaging homozygotes. The results accord with a hypothesis that recent inbreeding, which generates long ROH, enables rare deleterious variants to exist in homozygous form. Thus, just as inbreeding can elevate the occurrence of rare recessive diseases that represent homozygotes for strongly deleterious mutations, inbreeding magnifies the occurrence of mildly deleterious variants as well.

[104] TJ Pemberton, M DeGiorgio, NA Rosenberg (2013) Population structure in a comprehensive data set on human microsatellite variation. G3: Genes, Genomes, Genetics 3: 891-907. [Full-text at journal website] [PDF] [Supplement] [Data (.zip) - File S1]

Over the past two decades, microsatellite genotypes have provided the data for landmark studies of human population-genetic variation. However, the various microsatellite data sets have been prepared with different procedures and sets of markers, so that it has been difficult to synthesize available data for a comprehensive analysis. Here, we combine eight human population-genetic data sets at the 645 microsatellite loci they share in common, accounting for procedural differences in the production of the different data sets, to assemble a single data set containing 5795 individuals from 267 worldwide populations. We perform a systematic analysis of genetic relatedness, detecting 240 intra-population and 92 inter-population pairs of previously unidentified close relatives and proposing standardized subsets of unrelated individuals for use in future studies. We then augment the human data with a data set of 84 chimpanzees at the 246 loci they share in common with the human samples. Multidimensional scaling and neighbor-joining analyses of these data sets offer new insights into the structure of human populations and enable a comparison of genetic variation patterns in chimpanzees with those in humans. Our combined data sets are the largest of their kind reported to date and provide a resource for use in human population-genetic studies.

[103] CV Than, NA Rosenberg (2013) Mathematical properties of the deep coalescence cost. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 61-72. [PDF]

In the minimizing-deep-coalescences (MDC) approach for species tree inference, a tree that has the minimal deep coalescence cost for reconciling a collection of gene trees is taken as an estimate of the species tree topology. The MDC method possesses the desirable Pareto property, and in practice it is quite accurate and computationally efficient. Here, in order to better understand the MDC method, we investigate some properties of the deep coalescence cost. We prove that the unit neighborhood of either a rooted species tree or a rooted gene tree under the deep coalescence cost is exactly the same as the tree's unit neighborhood under the rooted nearest-neighbor interchange (NNI) distance. Next, for a fixed species tree, we obtain the maximum deep coalescence cost across all gene trees as well as the number of gene trees that achieve the maximum cost. We also study corresponding problems for a fixed gene tree.

[102] M Jakobsson, MD Edge, NA Rosenberg (2013) The relationship between F_ST and the frequency of the most frequent allele. Genetics 193: 515-528. [PDF]

F_ST is frequently used as a summary of genetic differentiation among groups. It has been suggested that F_ST depends on the allele frequencies at a locus, as it exhibits a variety of peculiar properties related to genetic diversity: higher values for biallelic single-nucleotide polymorphisms (SNPs) than for multiallelic microsatellites, low values among high-diversity populations viewed as substantially distinct, and low values for populations that differ primarily in their profiles of rare alleles. A full mathematical understanding of the dependence of F_ST on allele frequencies, however, has been elusive. Here, we examine the relationship between F_ST and the frequency of the most frequent allele, demonstrating that the range of values that F_ST can take is restricted considerably by the allele-frequency distribution. For a two-population model, we derive strict bounds on F_ST as a function of the frequency M of the allele with highest mean frequency between the pair of populations. Using these bounds, we show that for a value of M chosen uniformly between 0 and 1 at a multiallelic locus whose number of alleles is left unspecified, the mean maximum F_ST is ~0.3585. Further, F_ST is restricted to values much less than 1 when M is low or high, and the contribution to the maximum F_ST made by the most frequent allele is on average ~0.4485. Using bounds on homozygosity that we have previously derived as functions of M, we describe strict bounds on F_ST in terms of the the homozygosity of the total population, finding that the mean maximum F_ST given this homozygosity is 1 - ln 2 ≈ 0.3069. Our results provide a conceptual basis for understanding the dependence of F_ST on allele frequencies and genetic diversity and for interpreting the roles of these quantities in computations of F_ST from population-genetic data. Further, our analysis suggests that many unusual observations of F_ST, including the relatively low F_ST values in high-diversity human populations from Africa and the relatively low estimates of F_ST for microsatellites compared to SNPs, can be understood not as biological phenomena associated with different groups of populations or classes of markers but rather as consequences of the intrinsic mathematical dependence of F_ST on the properties of allele-frequency distributions.

[101] NA Rosenberg (2013) Editorial. Theoretical Population Biology 83: A2-A3. [PDF]

(No abstract)

[100] M DeGiorgio, NA Rosenberg (2013) Geographic sampling scheme as a determinant of the major axis of genetic variation in principal components analysis. Molecular Biology and Evolution 30: 480-488. [PDF]

Principal component (PC) maps, which plot the values of a given PC estimated on the basis of allele frequency variation at the geographic sampling locations of a set of populations, are often used to investigate the properties of past range expansions. Some studies have argued that in a range expansion, the axis of greatest variation (i.e., the first PC) is parallel to the axis of expansion. In contrast, others have identified a pattern in which the axis of greatest variation is perpendicular to the axis of expansion. Here, we seek to understand this difference in outcomes by investigating the effect of the geographic sampling scheme on the direction of the axis of greatest variation under a two-dimensional range expansion model. From datasets simulated using each of two different schemes for the geographic sampling of populations under the model, we create PC maps for the first PC. We find that depending on the geographic sampling scheme, the axis of greatest variation can be either parallel or perpendicular to the axis of expansion. We provide an explanation for this result in terms of intra- and interpopulation coalescence times.

[99] NM Kopelman, L Stone, O Gascuel, NA Rosenberg (2013) The behavior of admixed populations in neighbor-joining inference of population trees. Pacific Symposium on Biocomputing 18: 273-284. [PDF]

Neighbor-joining is one of the most widely used methods for constructing evolutionary trees. This approach from phylogenetics is often employed in population genetics, where distance matrices obtained from allele frequencies are used to produce a representation of population relationships in the form of a tree. In phylogenetics, the utility of neighbor-joining derives partly from a result that for a class of distance matrices including those that are additive or tree-like — generated by summing weights over the edges connecting pairs of taxa in a tree to obtain pairwise distances — application of neighbor-joining recovers exactly the underlying tree. For populations within a species, however, migration and admixture can produce distance matrices that reflect more complex processes than those obtained from the bifurcating trees typical in the multispecies context. Admixed populations — populations descended from recent mixture of groups that have long been separated — have been observed to be located centrally in inferred neighbor-joining trees, with short external branches incident to the path connecting their source populations. Here, using a simple model, we explore mathematically the behavior of an admixed population under neighbor-joining. We show that with an additive distance matrix, a population admixed among two source populations necessarily lies on the path between the sources. Relaxing the additivity requirement, we examine the smallest nontrivial case — four populations, one of which is admixed between two of the other three — showing that the two source populations never merge with each other before one of them merges with the admixed population. Furthermore, the distance on the constructed tree between the admixed population and either source population is always smaller than the distance between the source populations, and the external branch for the admixed population is always incident to the path connecting the sources. We define three properties that hold for four taxa and that we hypothesize are satisfied under more general conditions: antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. Our findings can inform interpretations of neighbor-joining trees with admixed groups, and they provide an explanation for patterns observed in trees of human populations.

[98] LK Nakhleh, NA Rosenberg, T Warnow (2013) Phylogenomics and population genomics: models, algorithms, and analytical tools. Pacific Symposium on Biocomputing 18: 247-249. [PDF] (session introduction)

(No abstract)

[97] JH Degnan, NA Rosenberg, T Stadler (2012) A characterization of the set of species trees that produce anomalous ranked gene trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9: 1558-1568. [PDF]

Ranked gene trees, which consider both the gene tree topology and the sequence in which gene lineages separate, can potentially provide a new source of information for use in modeling genealogies and performing inference of species trees. Recently, we have calculated the probability distribution of ranked gene trees under the standard multispecies coalescent model for the evolution of gene lineages along the branches of a fixed species tree, demonstrating the existence of anomalous ranked gene trees (ARGTs), in which a ranked gene tree that does not match the ranked species tree can have greater probability under the model than the matching ranked gene tree. Here, we fully characterize the set of unranked species tree topologies that give rise to ARGTs, showing that this set contains all species tree topologies with five or more taxa, with the exceptions of caterpillars and pseudocaterpillars. The results have implications for the use of ranked gene trees in phylogenetic inference.

[96] EO Buzbas (2012) On the article titled "Estimating species trees using approximate Bayesian computation" (Fan and Kubatko, Molecular Phylogenetics and Evolution 59:354-363). Molecular Phylogenetics and Evolution 65: 1014-1016 (2012). [PDF]

(No abstract)

[95] C Wang, KB Schroeder, NA Rosenberg (2012) A maximum-likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes. Genetics 192: 651-669. [Full-text at journal website] [PDF] [Supplement] [Software]

Allelic dropout is a commonly observed source of missing data in microsatellite genotypes, in which one or both allelic copies at a locus fail to be amplified by the polymerase chain reaction. Especially for samples with poor DNA quality, this problem causes a downward bias in estimates of observed heterozygosity and an upward bias in estimates of inbreeding, owing to mistaken classifications of heterozygotes as homozygotes when one of the two copies drops out. One general approach for avoiding allelic dropout involves repeated genotyping of homozygous loci to minimize the effects of experimental error. Existing computational alternatives often require replicate genotyping as well. These approaches, however, are costly and are suitable only when enough DNA is available for repeated genotyping. In this study, we propose a maximum-likelihood approach together with an expectation-maximization algorithm to jointly estimate allelic dropout rates and allele frequencies when only one set of nonreplicated genotypes is available. Our method considers estimates of allelic dropout caused by both sample-specific factors and locus-specific factors, and it allows for deviation from Hardy-Weinberg equilibrium owing to inbreeding. Using the estimated parameters, we correct the bias in the estimation of observed heterozygosity through the use of multiple imputations of alleles in cases where dropout might have occurred. With simulated data, we show that our method can (1) effectively reproduce patterns of missing data and heterozygosity observed in real data; (2) correctly estimate model parameters, including sample-specific dropout rates, locus-specific dropout rates, and the inbreeding coefficient; and (3) successfully correct the downward bias in estimating the observed heterozygosity. We find that our method is fairly robust to violations of model assumptions caused by population structure and by genotyping errors from sources other than allelic dropout. Because the data sets imputed under our model can be investigated in additional subsequent analyses, our method will be useful for preparing data for applications in diverse contexts in population genetics and molecular ecology.

[94] C Wang, S Zöllner, NA Rosenberg (2012) A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genetics 8: e1002886. [Full-text at journal website] [PDF] [Supplement]

Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized two-dimensional maps. Many recent studies have reported similarity between geographic maps of population locations and MDS or PCA maps of genetic variation inferred from single-nucleotide polymorphisms (SNPs). However, this similarity has been evident primarily in a qualitative sense; and, because different multivariate techniques and marker sets have been used in different studies, it has not been possible to formally compare genetic variation datasets in terms of their levels of similarity with geography. In this study, using genome-wide SNP data from 128 populations worldwide, we perform a systematic analysis to quantitatively evaluate the similarity of genes and geography in different geographic regions. For each of a series of regions, we apply a Procrustes analysis approach to find an optimal transformation that maximizes the similarity between PCA maps of genetic variation and geographic maps of population locations. We consider examples in Europe, Sub-Saharan Africa, Asia, East Asia, and Central/South Asia, as well as in a worldwide sample, finding that significant similarity between genes and geography exists in general at different geographic levels. The similarity is highest in our examples for Asia and, once highly distinctive populations have been removed, Sub-Saharan Africa. Our results provide a quantitative assessment of the geographic structure of human genetic variation worldwide, supporting the view that geography plays a strong role in giving rise to human population structure.

[93] TJ Pemberton, F-Y Li, EK Hanson, NU Mehta, S Choi, J Ballantyne, JW Belmont, NA Rosenberg, C Tyler-Smith, PI Patel (2012) Impact of restricted marital practices on genetic variation in an endogamous Gujarati group. American Journal of Physical Anthropology 149: 92-103. [PDF] [Supplement (.docx)] [Data]

Recent studies have examined the influence on patterns of human genetic variation of a variety of cultural practices. In India, centuries-old marriage customs have introduced extensive social structuring into the contemporary population, potentially with significant consequences for genetic variation. Social stratification in India is evident as social classes that are defined by endogamous groups known as castes. Within a caste, there exist endogamous groups known as gols (marriage circles), each of which comprises a small number of exogamous gotra (lineages). Thus, while consanguinity is strictly avoided and some randomness in mate selection occurs within the gol, gene flow is limited with groups outside the gol. Gujarati Patels practice this form of "exogamic endogamy." We have analyzed genetic variation in one such group of Gujarati Patels, the Chha Gaam Patels (CGP), who comprise individuals from six villages. Population structure analysis of 1,200 autosomal loci offers support for the existence of distinctive multilocus genotypes in the CGP with respect to both non-Gujaratis and other Gujaratis, and indicates that CGP individuals are genetically very similar. Analysis of Y-chromosomal and mitochondrial haplotypes provides support for both patrilocal and patrilineal practices within the gol, and a low-level of female gene flow into the gol. Our study illustrates how the practice of gol endogamy has introduced fine-scale genetic structure into the population of India, and contributes more generally to an understanding of the way in which marriage practices affect patterns of genetic variation.

[92] EM Jewett*, M Zawistowski*, NA Rosenberg, S Zöllner (2012) A coalescent model for genotype imputation. Genetics 191: 1239-1255. [PDF]

The potential for imputed genotypes to enhance an analysis of genetic data depends largely on the accuracy of imputation, which in turn depends on properties of the reference panel of template haplotypes used to perform the imputation. To provide a basis for exploring how properties of the reference panel affect imputation accuracy theoretically rather than with computationally intensive imputation experiments, we introduce a coalescent model that considers imputation accuracy in terms of population-genetic parameters. Our model allows us to investigate sampling designs in the frequently occurring scenario in which imputation targets and templates are sampled from different populations. In particular, we derive expressions for expected imputation accuracy as a function of reference panel size and divergence time between the reference and target populations. We find that a modestly sized "internal" reference panel from the same population as a target haplotype yields, on average, greater imputation accuracy than a larger "external" panel from a different population, even if the divergence time between the two populations is small. The improvement in accuracy for the internal panel increases with increasing divergence time between the target and reference populations. Thus, in humans, our model predicts that imputation accuracy can be improved by generating small population-specific custom reference panels to augment existing collections such as those of the HapMap or 1000 Genomes Projects. Our approach can be extended to understand additional factors that affect imputation accuracy in complex population-genetic settings, and the results can ultimately facilitate improvements in imputation study designs.

[91] TJ Pemberton, D Absher, MW Feldman, RM Myers, NA Rosenberg, JZ Li (2012) Genomic patterns of homozygosity in worldwide human populations. American Journal of Human Genetics 91: 275-292. [PDF] [Main Supplement] [Supplementary Table 2 (.zip)] [Supplementary Table 3 (.zip)] [Supplementary Table 4 (.zip)] [Supplementary Table 5 (.zip)]

Genome-wide patterns of homozygosity runs and their variation across individuals provide a valuable and often untapped resource for studying human genetic diversity and evolutionary history. Using genotype data at 577,489 autosomal SNPs, we employed a likelihood-based approach to identify runs of homozygosity (ROH) in 1,839 individuals representing 64 worldwide populations, classifying them by length into three classes — short, intermediate, and long — with a model-based clustering algorithm. For each class, the number and total length of ROH per individual show considerable variation across individuals and populations. The total lengths of short and intermediate ROH per individual increase with the distance of a population from East Africa, in agreement with similar patterns previously observed for locus-wise homozygosity and linkage disequilibrium. By contrast, total lengths of long ROH show large interindividual variations that probably reflect recent inbreeding patterns, with higher values occurring more often in populations with known high frequencies of consanguineous unions. Across the genome, distributions of ROH are not uniform, and they have distinctive continental patterns. ROH frequencies across the genome are correlated with local genomic variables such as recombination rate, as well as with signals of recent positive selection. In addition, long ROH are more frequent in genomic regions harboring genes associated with autosomal-dominant diseases than in regions not implicated in Mendelian diseases. These results provide insight into the way in which homozygosity patterns are produced, and they generate baseline homozygosity patterns that can be used to aid homozygosity mapping of genes associated with recessive diseases.

[90] D Bryant, R Bouckaert, J Felsenstein, NA Rosenberg, A RoyChoudhury (2012) Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution 29: 1917-1932. [PDF] [Supplement]

The multispecies coalescent provides an elegant theoretical framework for estimating species trees and species demographics from genetic markers. However, practical applications of the multispecies coalescent model are limited by the need to integrate or sample over all gene trees possible for each genetic marker. Here we describe a polynomial-time algorithm that computes the likelihood of a species tree directly from the markers under a finite-sites model of mutation effectively integrating over all possible gene trees. The method applies to independent (unlinked) biallelic markers such as well-spaced single nucleotide polymorphisms, and we have implemented it in SNAPP, a Markov chain Monte Carlo sampler for inferring species trees, divergence dates, and population sizes. We report results from simulation experiments and from an analysis of 1997 amplified fragment length polymorphism loci in 69 individuals sampled from six species of Ourisia (New Zealand native foxglove).

[89] LJ Helmkamp, EM Jewett, NA Rosenberg (2012) Improvements to a class of distance matrix methods for inferring species trees from gene trees. Journal of Computational Biology 19: 632-649. [PDF]

Among the methods currently available for inferring species trees from gene trees, the GLASS method of Mossel and Roch (2010), the Shallowest Divergence (SD) method of Maddison and Knowles (2006), the STEAC method of Liu et al. (2009), and a related method that we call Minimum Average Coalescence (MAC) are computationally efficient and provide branch length estimates. Further, GLASS and STEAC have been shown to be consistent estimators of tree topology under a multispecies coalescent model. However, divergence time estimates obtained with these methods are all systematically biased under the model because the pairwise interspecific gene divergence times on which they rely must be more ancient than the species divergence time. Jewett and Rosenberg (2012) derived an expression for the bias of GLASS and used it to propose an improved method that they termed iGLASS. Here, we derive the biases of SD, STEAC, and MAC, and we propose improved analogues of these methods that we call iSD, iSTEAC, and iMAC. We conduct simulations to compare the performance of these methods with their original counterparts and with GLASS and iGLASS, finding that each of them decreases the bias and mean squared error of pairwise divergence time estimates. The new methods can therefore contribute to improvements in the estimation of species trees from information on gene trees.

[88] EM Jewett, NA Rosenberg (2012) iGLASS: an improvement to the GLASS method for estimating species trees from gene trees. Journal of Computational Biology 19: 293-315. [PDF]

Several methods have been designed to infer species trees from gene trees while taking into account gene tree/species tree discordance. Although some of these methods provide consistent species tree topology estimates under a standard model, most either do not estimate branch lengths or are computationally slow. An exception, the GLASS method of Mossel and Roch, is consistent for the species tree topology, estimates branch lengths, and is computationally fast. However, GLASS systematically overestimates divergence times, leading to biased estimates of species tree branch lengths. By assuming a multispecies coalescent model in which multiple lineages are sampled from each of two taxa at L independent loci, we derive the distribution of the waiting time until the first interspecific coalescence occurs between the two taxa, considering all loci and measuring from the divergence time. We then use the mean of this distribution to derive a correction to the GLASS estimator of pairwise divergence times. We show that our improved estimator, which we call iGLASS, consistently estimates the divergence time between a pair of taxa as the number of loci approaches infinity, and that it is an unbiased estimator of divergence times when one lineage is sampled per taxon. We also show that many commonly used clustering methods can be combined with the iGLASS estimator of pairwise divergence times to produce a consistent estimator of the species tree topology. Through simulations, we show that iGLASS can greatly reduce the bias and mean squared error in obtaining estimates of divergence times in a species tree.

[87] SB Reddy, NA Rosenberg (2012) Refining the relationship between homozygosity and the frequency of the most frequent allele. Journal of Mathematical Biology 64: 87-108. [PDF]

Recent work has established that for an arbitrary genetic locus with its number of alleles unspecified, the homozygosity of the locus confines the frequency of the most frequent allele within a narrow range, and vice versa. Here we extend beyond this limiting case by investigating the relationship between homozygosity and the frequency of the most frequent allele when the number of alleles at the locus is treated as known. Given the homozygosity of a locus with at most K alleles, we find that by taking into account the value of K, the width of the allowed range for the frequency of the most frequent allele decreases from 2/3 - π²/18 ≈ 0.1184 to 1/3 - 1/(3K) - {K/[3(K-1)]} ∑_k=2^K 1/k². We further show that properties of the relationship between homozygosity and the frequency of the most frequent allele in the unspecified-K case can be obtained from the specified-K case by taking limits as K→∞. The results contribute to a greater understanding of the mathematical properties of fundamental statistics employed in population-genetic analysis.

[86] FA San Lucas, NA Rosenberg, P Scheet (2012) Haploscope: a tool for the graphical display of haplotype structure in populations. Genetic Epidemiology 35: 17-21. [PDF]

Patterns of linkage disequilibrium are often depicted pictorially by using tools that rely on visualizations of raw data or pairwise correlations among individual markers. Such approaches can fail to highlight some of the more interesting and complex features of haplotype structure. To enable natural visual comparisons of haplotype structure across subgroups of a population (e.g. isolated subpopulations or cases and controls), we propose an alternative visualization that provides a novel graphical representation of haplotype frequencies. We introduce Haploscope, a tool for visualizing the haplotype cluster frequencies that are produced by statistical models for population haplotype variation. We demonstrate the utility of our technique by examining haplotypes around the LCT gene, an example of recent positive selection, in samples from the Human Genome Diversity Panel. Haploscope, which has flexible options for annotation and inspection of haplotypes, is available for download at http://scheet.org/software.

[85] JH Degnan, NA Rosenberg, T Stadler (2012) The probability distribution of ranked gene trees on a species tree. Mathematical Biosciences 235: 45-55. [PDF]

The properties of random gene tree topologies have recently been studied under a coalescent model that treats a species tree as a fixed parameter. Here we develop the analogous theory for random ranked gene tree topologies, in which both the topology and the sequence of coalescences for a random gene tree are considered. We derive the probability distribution of ranked gene tree topologies conditional on a fixed species tree. We then show that similar to the unranked case, ranked gene trees that do not match either the ranking or the topology of the species tree can have greater probability than the matching ranked gene tree.

[84] NA Rosenberg (2011) A population-genetic perspective on the similarities and differences among worldwide human populations. Human Biology 83: 659-684. [PDF]

Recent studies have produced a variety of advances in the investigation of genetic similarities and differences among human populations. Here, I pose a series of questions about human population-genetic similarities and differences, and I then answer these questions by numerical computation with a single shared population-genetic data set. The collection of answers obtained provides an introductory perspective for understanding key results on the features of worldwide human genetic variation.

[83] L Huang*, M Jakobsson*, TJ Pemberton, M Ibrahim, T Nyambo, S Omar, JK Pritchard, SA Tishkoff, NA Rosenberg (2011) Haplotype variation and genotype imputation in African populations. Genetic Epidemiology 35: 766-780. [PDF] [Supplement] [Data]

Sub-Saharan Africa has been identified as the part of the world with the greatest human genetic diversity. This high level of diversity causes difficulties for genome-wide association (GWA) studies in African populations — for example, by reducing the accuracy of genotype imputation in African populations compared to non-African populations. Here, we investigate haplotype variation and imputation in Africa, using 253 unrelated individuals from 15 Sub-Saharan African populations. We identify the populations that provide the greatest potential for serving as reference panels for imputing genotypes in the remaining groups. Considering reference panels comprising samples of recent African descent in Phase 3 of the HapMap Project, we identify mixtures of reference groups that produce the maximal imputation accuracy in each of the sampled populations. We find that optimal HapMap mixtures and maximal imputation accuracies identified in detailed tests of imputation procedures can instead be predicted by using simple summary statistics that measure relationships between the pattern of genetic variation in a target population and the patterns in potential reference panels. Our results provide an empirical basis for facilitating the selection of reference panels in GWA studies of diverse human populations, especially those of African ancestry.

[82] P Verdu, NA Rosenberg (2011) A general mechanistic model for admixture histories of hybrid populations. Genetics 189: 1413-1426. [PDF] [Supplementary Text] [Supplementary Figure 1] [Supplementary Table 1]

Admixed populations have been used for inferring migrations, detecting natural selection, and finding disease genes. These applications often use a simple statistical model of admixture rather than a modeling perspective that incorporates a more realistic history of the admixture process. Here, we develop a general model of admixture that mechanistically accounts for complex historical admixture processes. We consider two source populations contributing to the ancestry of a hybrid population, potentially with variable contributions across generations. For a random individual in the hybrid population at a given point in time, we study the fraction of genetic admixture originating from a specific one of the source populations by computing its moments as functions of time and of introgression parameters. We show that very different admixture processes can produce identical mean admixture proportions, but that such processes produce different values for the variance of the admixture proportion. When introgression parameters from each source population are constant over time, the long-term limit of the expectation of the admixture proportion depends only on the ratio of the introgression parameters. The variance of admixture decreases quickly over time after the source populations stop contributing to the hybrid population, but remains substantial when the contributions are ongoing. Our approach will facilitate the understanding of admixture mechanisms, illustrating how the moments of the distribution of admixture proportions can be informative about the historical admixture processes contributing to the genetic diversity of hybrid populations.

[81] S Ramachandran, NA Rosenberg (2011) A test of the influence of continental axes of orientation on patterns of human gene flow. American Journal of Physical Anthropology 146: 515-529. [PDF] [Supplementary Figure 1] [Supplementary Figure 2] [Supplementary Tables]

The geographic distribution of genetic variation reflects trends in past population migrations and can be used to make inferences about these migrations. It has been proposed that the east-west orientation of the Eurasian landmass facilitated the rapid spread of ancient technological innovations across Eurasia, while the north-south orientation of the Americas led to a slower diffusion of technology there. If the diffusion of technology was accompanied by gene flow, then this hypothesis predicts that genetic differentiation in the Americas along lines of longitude will be greater than that in Eurasia along lines of latitude. We use 678 microsatellite loci from 68 indigenous populations in Eurasia and the Americas to investigate the spatial axes that underlie population-genetic variation. We find that genetic differentiation increases more rapidly along lines of longitude in the Americas than along lines of latitude in Eurasia. Distance along lines of latitude explains a sizeable portion of genetic distance in Eurasia, whereas distance along lines of longitude does not explain a large proportion of Eurasian genetic variation. Genetic differentiation in the Americas occurs along both latitudinal and longitudinal axes and has a greater magnitude than corresponding differentiation in Eurasia, even when adjusting for the lower level of genetic variation in the American populations. These results support the view that continental orientation has influenced migration patterns and has played an important role in determining both the structure of human genetic variation and the distribution and spread of cultural traits.

[80] M DeGiorgio, JH Degnan, NA Rosenberg (2011) Coalescence-time distributions in a serial founder model of human evolutionary history. Genetics 189: 579-593. [PDF]

Simulation studies have demonstrated that a variety of patterns in worldwide genetic variation are compatible with the trends predicted by a serial founder model, in which populations expand outward from an initial source via a process in which new populations contain only subsets of the genetic diversity present in their parental populations. Here, we provide analytical results for key quantities under the serial founder model, deriving distributions of coalescence times for pairs of lineages sampled either from the same population or from different populations. We use these distributions to obtain expectations for coalescence times and for homozygosity and heterozygosity values. A predicted approximate linear decline in expected heterozygosity with increasing distance from the source population reproduces a pattern that has been observed both in human genetic data and in simulations. Our formulas predict that populations close to the source location have lower between-population gene identity than populations far from the source, also mirroring results obtained from data and simulations. We show that different models that produce similar declining patterns in heterozygosity generate quite distinct patterns in coalescence-time distributions and gene identity measures, thereby providing a basis for distinguishing these models. We interpret the theoretical results in relation to their implications for human population genetics.

[79] SM Boca, NA Rosenberg (2011) Mathematical properties of F_st between admixed populations and their parental source populations. Theoretical Population Biology 80: 208-216. [PDF]

We consider the properties of the F_st measure of genetic divergence between an admixed population and its parental source populations. Among all possible populations admixed among an arbitrary set of parental populations, we show that the value of F_st between an admixed population and a specific source population is maximized when the admixed population is simply the most distant of the other source populations. For the case with only two parental populations, as a function of the admixture fraction, we further demonstrate that this F_st value is monotonic and convex, so that F_st is informative about the admixture fraction. We illustrate our results using example human population-genetic data, showing how they provide a framework in which to interpret the features of F_st in admixed populations.

[78] ZA Szpiech, NA Rosenberg (2011) On the size distribution of private microsatellite alleles. Theoretical Population Biology 80: 100-113. [PDF]

Private microsatellite alleles tend to be found in the tails rather than in the interior of the allele size distribution. To explain this phenomenon, we have investigated the size distribution of private alleles in a coalescent model of two populations, assuming the symmetric stepwise mutation model as the mode of microsatellite mutation. For the case in which four alleles are sampled, two from each population, we condition on the configuration in which three distinct allele sizes are present, one of which is common to both populations, one of which is private to one population, and the third of which is private to the other population. Conditional on this configuration, we calculate the probability that the two private alleles occupy the two tails of the size distribution. This probability, which increases as a function of mutation rate and divergence time between the two populations, is seen to be greater than the value that would be predicted if there was no relationship between privacy and location in the allele size distribution. In accordance with the prediction of the model, we find that in pairs of human populations, the frequency with which private microsatellite alleles occur in the tails of the allele size distribution increases as a function of genetic differentiation between populations.

[77] Z Yang, M Rosenthal, NA Rosenberg, S Talarico, L Zhang, C Marrs, VO Thomsen, T Lillebaek, AB Andersen (2011) How dormant is Mycobacterium tuberculosis during latency? A study integrating genomics and molecular epidemiology. Infection, Genetics and Evolution 11: 1164-1167. [PDF]

Mycobacterium tuberculosis may survive for decades in the human body in a state termed latent tuberculosis infection (LTBI). We investigated the occurrence during LTBI of insertion/deletion events in a selected set of mononucleotide simple sequence repeats, DNA sequence changes in four M. tuberculosis genes, and large sequence variations in 4750 M. tuberculosis open reading frames. We studied 13 paired M. tuberculosis clinical isolates, with each pair representing a reactivation of LTBI more than three decades after primary infection. Absence of sequence variations between paired isolates in nearly all investigated loci suggests a low likelihood of bacterial replication during LTBI.

[76] EO Buzbas, P Joyce, NA Rosenberg (2011) Inference on the strength of balancing selection for epistatically interacting loci. Theoretical Population Biology 79: 102-113. [PDF]

Existing inference methods for estimating the strength of balancing selection in multi-locus genotypes rely on the assumption that there are no epistatic interactions between loci. Complex systems in which balancing selection is prevalent, such as sets of human immune system genes, are known to contain components that interact epistatically. Therefore, current methods may not produce reliable inference on the strength of selection at these loci. In this paper, we address this problem by presenting statistical methods that can account for epistatic interactions in making inference about balancing selection. A theoretical result due to Fearnhead (2006) is used to build a multi-locus Wright-Fisher model of balancing selection, allowing for epistatic interactions among loci. Antagonistic and synergistic types of interactions are examined. The joint posterior distribution of the selection and mutation parameters is sampled by Markov chain Monte Carlo methods, and the plausibility of models is assessed via Bayes factors. As a component of the inference process, an algorithm to generate multi-locus allele frequencies under balancing selection models with epistasis is also presented. Recent evidence on interactions among a set of human immune system genes is introduced as a motivating biological system for the epistatic model, and data on these genes are used to demonstrate the methods.

[75] CV Than, NA Rosenberg (2011) Consistency properties of species tree inference by minimizing deep coalescences. Journal of Computational Biology 18: 1-15. [PDF]

Methods for inferring species trees from sets of gene trees need to account for the possibility of discordance among the gene trees. Assuming that discordance is caused by incomplete lineage sorting, species tree estimates can be obtained by finding those species trees that minimize the number of "deep" coalescence events required for a given collection of gene trees. Efficient algorithms now exist for applying the minimizing-deep-coalescence (MDC) criterion, and simulation experiments have demonstrated its promising performance. However, it has also been noted from simulation results that the MDC criterion is not always guaranteed to infer the correct species tree estimate. In this article, we investigate the consistency of the MDC criterion. Using the multipscies coalescent model, we show that there are indeed anomaly zones for the MDC criterion for asymmetric four-taxon species tree topologies, and for all species tree topologies with five or more taxa.

[74] M DeGiorgio*, I Jankovic*, NA Rosenberg (2010) Unbiased estimation of gene diversity in samples containing related individuals: exact variance and arbitrary ploidy. Genetics 186: 1367-1387. [PDF]

Gene diversity, a commonly used measure of genetic variation, evaluates the proportion of heterozygous individuals expected at a locus in a population, under the assumption of Hardy-Weinberg equilibrium. When using the standard estimator of gene diversity, the inclusion of related or inbred individuals in a sample produces a downward bias. Here, we extend a recently developed estimator shown to be unbiased in a diploid autosomal sample that includes known related or inbred individuals to the general case of arbitrary ploidy. We derive an exact formula for the variance of the new estimator, \tilde{H}, and present an approximation to facilitate evaluation of the variance when each individual is related to at most one other individual in a sample. When examining samples from the human X chromosome, which represent a mixture of haploid and diploid individuals, we find that \tilde{H} performs favorably compared to the standard estimator, both in theoretical computations of mean squared error and in data analysis. We thus propose that \tilde{H} is a useful tool in characterizing gene diversity in samples of arbitrary ploidy that contain related or inbred individuals.

[73] TJ Pemberton, C Wang, JZ Li, NA Rosenberg (2010) Inference of unexpected genetic relatedness among individuals in HapMap Phase III. American Journal of Human Genetics 87: 457-464. [PDF] [Supplement]

The International Haplotype Map Project (HapMap) has provided an essential database for studies of human population genetics and genome-wide association. Phases I and II of the HapMap project generated genotype data across ~3 million SNP loci in 270 individuals representing four populations. Phase III provides dense genotype data on ~1.5 million SNPs, generated by Illumina and Affymetrix platforms in a larger set of individuals. Release 3 of phase III of the HapMap contains 1397 individuals from 11 populations, including 250 of the original 270 phase I and phase II individuals and 1147 additional individuals. Although some known relationships among the phase III individuals have been described in the data release, the genotype data that are currently available provide an opportunity to empirically ascertain previously unknown relationships. We performed a systematic analysis of genetic relatedness and were able not only to confirm the reported relationships, but also to detect numerous additional, previously unidentified pairs of close relatives in the HapMap sample. The inferred relative pairs make it possible to propose standardized subsets of unrelated individuals for use in future studies in which relatedness needs to be clearly defined.

[72] E Borràs*, M Pineda*, I Blanco, EM Jewett, F Wang, A Teulé, T Caldés, M Urioste, C Martínez-Bouzas, J Brunet, J Balmaña, A Torres, T Ramón y Cajal, J Sanz, L Pérez-Cabornero, S Castellví-Bel, A Alonso, A Lanas, S González, V Moreno, SB Gruber, NA Rosenberg, B Mukherjee, C Lázaro, G Capellá (2010) MLH1 founder mutations with moderate penetrance in Spanish Lynch syndrome families. Cancer Research 70: 7379-7391. [PDF] [Supplementary Figure 1] [Supplementary Table 1] [Supplementary Table 2] [Supplementary Methods]

The variants c.306+5G>A and c.1865T>A (p.Leu622His) of the DNA repair gene MLH1 occur frequently in Spanish Lynch syndrome families. To understand their ancestral history and clinical effect, we performed functional assays and a penetrance analysis and studied their genetic and geographic origins. Detailed family histories were taken from 29 carrier families. Functional analysis included in silico and in vitro assays at the RNA and protein levels. Penetrance was calculated using a modified segregation analysis adjusted for ascertainment. Founder effects were evaluated by haplotype analysis. The identified MLH1 c.306+5G>A and c.1865T>A (p.Leu622His) variants are absent in control populations and segregate with the disease. Tumors from carriers of both variants show microsatellite instability and loss of expression of the MLH1 protein. The c.306+5G>A variant is a pathogenic mutation affecting mRNA processing. The c.1865T>A (p.Leu622His) variant causes defects in MLH1 expression and stability. For both mutations, the estimated penetrance is moderate (age-cumulative colorectal cancer risk by age 70 of 20.1% and 14.1% for c.306+5G>A and of 6.8% and 7.3% for c.1865T>A in men and women carriers, respectively) in the lower range of variability estimated for other pathogenic Spanish MLH1 mutations. A common haplotype was associated with each of the identified mutations, confirming their founder origin. The ages of c.306+5G>A and c.1865T>A mutations were estimated to be 53 to 122 and 12 to 22 generations, respectively. Our results confirm the pathogenicity, moderate penetrance, and founder origin of the MLH1 c.306+5G>A and c.1865T>A mutations. These findings have important implications for genetic counseling and molecular diagnosis of Lynch syndrome.

[71] NA Rosenberg (2010) Review of First Peoples in a New World: Colonizing Ice Age America by DJ Meltzer. Quarterly Review of Biology 85: 380-381. [PDF]

(No abstract)

[70] I Jankovic, BM vonHoldt, NA Rosenberg (2010) Heterozygosity of the Yellowstone wolves. Molecular Ecology 19: 3246-3249. [PDF]

(No abstract)

[69] NA Rosenberg, L Huang*, EM Jewett*, ZA Szpiech*, I Jankovic*, M Boehnke (2010) Genome-wide association studies in diverse populations. Nature Reviews Genetics 11: 356-366. [PDF]

Genome-wide association (GWA) studies have identified a large number of SNPs associated with disease phenotypes. As most GWA studies have been performed in populations of European descent, this Review examines the issues involved in extending the consideration of GWA studies to diverse worldwide populations. Although challenges exist with issues such as imputation, admixture and replication, investigation of a greater diversity of populations could make substantial contributions to the goal of mapping the genetic determinants of complex diseases for the human population as a whole.

[68] NA Rosenberg, JH Degnan (2010) Coalescent histories for discordant gene trees and species trees. Theoretical Population Biology 77: 145-151. [PDF]

Given a gene tree and a species tree, a coalescent history is a list of the branches of the species tree on which coalescences in the gene tree take place. Each pair consisting of a gene tree topology and a species tree topology has some number of possible coalescent histories. Here we show that, for each n ≥ 7, there exist a species tree topology S and a gene tree topology G ≠ S, both with n leaves, for which the number of coalescent histories exceeds the corresponding number of coalescent histories when the species tree topology is S and the gene tree topology is also S. This result has the interpretation that the gene tree topology G discordant with the species tree topology S can be produced by the evolutionary process in more ways than can the gene tree topology that matches the species tree topology, providing further insight into the surprising combinatorial properties of gene trees that arise from their joint consideration with species trees.

[67] M DeGiorgio, JH Degnan (2010) Fast and consistent estimation of species trees using supermatrix rooted triples. Molecular Biology and Evolution 27: 552-569. [PDF]

Concatenated sequence alignments are often used to infer species-level relationships. Previous studies have shown that analysis of concatenated data using maximum likelihood (ML) can produce misleading results when loci have differing gene tree topologies due to incomplete lineage sorting. Here, we develop a polynomial time method that utilizes the modified mincut supertree algorithm to construct an estimated species tree from inferred rooted triples of concatenated alignments. We term this method SuperMatrix Rooted Triple (SMRT) and use the notation SMRT-ML when rooted triples are inferred by ML. We use simulations to investigate the performance of SMRT-ML under Jukes-Cantor and general time-reversible substitution models for four- and five-taxon species trees and also apply the method to an empirical data set of yeast genes. We find that SMRT-ML converges to the correct species tree in many cases in which ML on the full concatenated data set fails to do so. SMRT-ML can be conservative in that its output tree is often partially unresolved for problematic clades. We show analytically that when the species tree is clocklike and mutations occur under the Cavender-Farris-Neyman substitution model, as the number of genes increases, SMRT-ML is increasingly likely to infer the correct species tree even when the most likely gene tree does not match the species tree. SMRT-ML is therefore a computationally efficient and statistically consistent estimator of the species tree when gene trees are distributed according to the multispecies coalescent model.

[66] C Wang, ZA Szpiech, JH Degnan, M Jakobsson, TJ Pemberton, JA Hardy, AB Singleton, NA Rosenberg (2010) Comparing spatial maps of human population-genetic variation using Procrustes analysis. Statistical Applications in Genetics and Molecular Biology 9: 13. [PDF]

Recent applications of principal components analysis (PCA) and multidimensional scaling (MDS) in human population genetics have found that "statistical maps" based on the genotypes in population-genetic samples often resemble geographic maps of the underlying sampling locations. To provide formal tests of these qualitative observations, we describe a Procrustes analysis approach for quantitatively assessing the similarity of population-genetic and geographic maps. We confirm in two scenarios, one using single-nucleotide polymorphism (SNP) data from Europe and one using SNP data worldwide, that a measurably high level of concordance exists between statistical maps of population-genetic variation and geographic maps of sampling locations. Two other examples illustrate the versatility of the Procrustes approach in population-genetic applications, verifying the concordance of SNP analyses using PCA and MDS, and showing that statistical maps of worldwide copy-number variants (CNVs) accord with statistical maps of SNP variation, especially when CNV analysis is limited to samples with the highest-quality data. As statistical maps with PCA and MDS have become increasingly common for use in summarizing population relationships, our examples highlight the potential of Procrustes-based quantitative comparisons for interpreting the results in these maps.

[65] JT Mosher, TJ Pemberton, K Harter, C Wang, EO Buzbas, P Dvorak, C Simon, SJ Morrison, NA Rosenberg (2010) Lack of population diversity in commonly used human embryonic stem-cell lines. New England Journal of Medicine 362: 183-185. [PDF] [Supplement]

(No abstract)

[64] TJ Pemberton, CI Sandefur, M Jakobsson, NA Rosenberg (2009) Sequence determinants of human microsatellite variability. BMC Genomics 10: 612. [Full text at journal website] [PDF] [Supplementary table 1 (XLS)] [Supplementary table 2 (XLS)] [Supplementary tables 3-6 (PDF)] [Data]

BACKGROUND. Microsatellite loci are frequently used in genomic studies of DNA sequence repeats and in population studies of genetic variability. To investigate the effect of sequence properties of microsatellites on their level of variability we have analyzed genotypes at 627 microsatellite loci in 1,048 worldwide individuals from the HGDP-CEPH cell line panel together with the DNA sequences of these microsatellites in the human RefSeq database.
RESULTS. Calibrating PCR fragment lengths in individual genotypes by using the RefSeq sequence enabled us to infer repeat number in the HGDP-CEPH dataset and to calculate the mean number of repeats (as opposed to the mean PCR fragment length), under the assumption that differences in PCR fragment length reflect differences in the numbers of repeats in the embedded repeat sequences. We find the mean and maximum numbers of repeats across individuals to be positively correlated with heterozygosity. The size and composition of the repeat unit of a microsatellite are also important factors in predicting heterozygosity, with tetra-nucleotide repeat units high in G/C content leading to higher heterozygosity. Finally, we find that microsatellites containing more separate sets of repeated motifs generally have higher heterozygosity.
CONCLUSIONS. These results suggest that sequence properties of microsatellites have a significant impact in determining the features of human microsatellite variability.

[63] NM Kopelman, L Stone, C Wang, D Gefel, MW Feldman, J Hillel, NA Rosenberg (2009) Genomic microsatellites identify shared Jewish ancestry intermediate between Middle Eastern and European populations. BMC Genetics 10: 80. [Full text at journal website] [PDF]

BACKGROUND. Genetic studies have often produced conflicting results on the question of whether distant Jewish populations in different geographic locations share greater genetic similarity to each other or instead, to nearby non-Jewish populations. We perform a genome-wide population-genetic study of Jewish populations, analyzing 678 autosomal microsatellite loci in 78 individuals from four Jewish groups together with similar data on 321 individuals from 12 non-Jewish Middle Eastern and European populations.
RESULTS. We find that the Jewish populations show a high level of genetic similarity to each other, clustering together in several types of analysis of population structure. Further, Bayesian clustering, neighbor-joining trees, and multidimensional scaling place the Jewish populations as intermediate between the non-Jewish Middle Eastern and European populations.
CONCLUSION. These results support the view that the Jewish populations largely share a common Middle Eastern ancestry and that over their history they have undergone varying degrees of admixture with non-Jewish populations of European descent.

[62] L Huang, C Wang, NA Rosenberg (2009) The relationship between imputation error and statistical power in genetic association studies in diverse populations. American Journal of Human Genetics 85: 692-698. [PDF]

Genotype-imputation methods provide an essential technique for high-resolution genome-wide association (GWA) studies with millions of single-nucleotide polymorphisms. For optimal design and interpretation of imputation-based GWA studies, it is important to understand the connection between imputation error and power to detect associations at imputed markers. Here, using a 2x3 chi-square test, we describe a relationship between genotype-imputation error rates and the sample-size inflation required for achieving statistical power at an imputed marker equal to that obtained if genotypes at the marker were known with certainty. Surprisingly, typical imputation error rates (~2%-6%) lead to a large increase in the required sample size (~10%-60%) and in some African populations whose genotypes are particularly difficult to impute, the required sample-size increase is as high as ~30%-150%). In most populations, each 1% increase in imputation error leads to an increase of ~5%-13% in the sample size required for maintaining power. These results imply that in GWA sample-size calculations investigators will need to account for a potentially considerable loss of power from even low levels of imputation error and that development of additional genomic resources that decrease imputation error will translate into substantial reduction in the sample sizes needed for imputation-based detection of the variants that underlie complex human diseases.

[61] M DeGiorgio, M Jakobsson, NA Rosenberg (2009) Explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proceedings of the National Academy of Sciences USA 106: 16057-16062. [PDF] [Supplement]

Studies of worldwide human variation have discovered three trends in summary statistics as a function of increasing geographic distance from East Africa: a decrease in heterozygosity, an increase in linkage disequilibrium (LD), and a decrease in the slope of the ancestral allele frequency spectrum. Forward simulations of unlinked loci have shown that the decline in heterozygosity can be described by a serial founder model, in which populations migrate outward from Africa through a process where each of a series of populations is formed from a subset of the previous population in the outward expansion. Here, we extend this approach by developing a retrospective coalescent-based serial founder model that incorporates linked loci. Our model both recovers the observed decline in heterozygosity with increasing distance from Africa and produces the patterns observed in LD and the ancestral allele frequency spectrum. Surprisingly, although migration between neighboring populations and limited admixture between modern and archaic humans can be accommodated in the model while continuing to explain the three trends, a competing model in which a wave of outward modern human migration expands into a series of preexisting archaic populations produces nearly opposite patterns to those observed in the data. We conclude by developing a simpler model to illustrate that the feature that permits the serial founder model but not the archaic persistence model to explain the three trends observed with increasing distance from Africa is its incorporation of a cumulative effect of genetic drift as humans colonized the world.

[60] NA Rosenberg, JM VanLiere (2009) Replication of genetic associations as pseudoreplication due to shared genealogy. Genetic Epidemiology 33: 479-487. [PDF]

The genotypes of individuals in replicate genetic association studies have some level of correlation due to shared descent in the complete pedigree of all living humans. As a result of this genealogical sharing, replicate studies that search for genotype-phenotype associations using linkage disequilibrium between marker loci and disease-susceptibility loci can be considered as pseudoreplicates rather than true replicates. We examine the size of the pseudoreplication effect in association studies simulated from evolutionary models of the history of a population, evaluating the excess probability that both of a pair of studies detect a disease association compared to the probability expected under the assumption that the two studies are independent. Each of nine combinations of a demographic model and a penetrance model leads to a detectable pseudoreplication effect, suggesting that the degree of support that can be attributed to a replicated genetic association result is less than that which can be attributed to a replicated result in a context of true independence.

[59] JH Degnan, M DeGiorgio, D Bryant, NA Rosenberg (2009) Properties of consensus methods for inferring species trees from gene trees. Systematic Biology 58: 35-54. [PDF]

Consensus methods provide a useful strategy for summarizing information from a collection of gene trees. An important application of consensus methods is to combine gene trees to estimate a species tree. To investigate the theoretical properties of consensus trees that would be obtained from large numbers of loci evolving according to a basic evolutionary model, we construct consensus trees from rooted gene trees that occur in proportion to gene-tree probabilities derived from coalescent theory. We consider majority-rule, rooted triple (R*), and greedy consensus trees obtained from known, rooted gene trees, both in the asymptotic case as numbers of gene trees approach infinity and for finite numbers of genes. Our results show that for some combinations of species-tree branch lengths, increasing the number of independent loci can make the rooted majority-rule consensus tree more likely to be at least partially unresolved. However, the probability that the R* consensus tree has the species-tree topology approaches 1 as the number of gene trees approaches infinity. Although the greedy consensus algorithm can be the quickest to converge on the correct species-tree topology when increasing the number of gene trees, it can also be positively misleading. The majority-rule consensus tree is not a misleading estimator of the species-tree topology, and the R* consensus tree is a statistically consistent estimator of the species-tree topology. Our results therefore suggest a method for using multiple loci to infer the species-tree topology, even when it is discordant with the most likely gene tree.

[58] JH Degnan, NA Rosenberg (2009) Gene tree discordance, phylogenetic inference, and the multispecies coalescent. Trends in Ecology and Evolution 24: 332-340. [PDF] [Supplement]

The field of phylogenetics is entering a new era in which trees of historical relationships between species are increasingly inferred from multilocus and genomic data. A major challenge for incorporating such large amounts of data into inference of species trees is that conflicting genealogical histories often exist in different genes throughout the genome. Recent advances in genealogical modeling suggest that resolving close species relationships is not quite as simple as applying more data to the problem. Here we discuss the complexities of genealogical discordance and review the issues that new methods for multilocus species tree inference will need to address to account successfully for naturally occurring genomic variability in evolutionary histories.

[57] KB Schroeder, M Jakobsson, MH Crawford, TG Schurr, SM Boca, DF Conrad, RY Tito, LP Osipova, LA Tarskaia, SI Zhadanov, JD Wall, JK Pritchard, RS Malhi, DG Smith, NA Rosenberg (2009) Haplotypic background of a private allele at high frequency in the Americas. Molecular Biology and Evolution 26: 995-1016. [PDF] [Supplement]

Recently, the observation of a high-frequency private allele, the 9-repeat allele at microsatellite D9S1120, in all sampled Native American and Western Beringian populations has been interpreted as evidence that all modern Native Americans descend primarily from a single founding population. However, this inference assumed that all copies of the 9-repeat allele were identical by descent and that the geographic distribution of this allele had not been influenced by natural selection. To investigate whether these assumptions are satisfied, we genotyped 34 single nucleotide polymorphisms across ~500 kilobases (kb) around D9S1120 in 21 Native American and Western Beringian populations and 54 other worldwide populations. All chromosomes with the 9-repeat allele share the same haplotypic background in the vicinity of D9S1120, suggesting that all sampled copies of the 9-repeat allele are identical by descent. Ninety-one percent of these chromosomes share the same 76.26 kb haplotype, which we call the "American Modal Haplotype" (AMH). Three observations lead us to conclude that the high frequency and widespread distribution of the 9-repeat allele are unlikely to be the result of positive selection: 1) aside from its association with the 9-repeat allele, the AMH does not have a high frequency in the Americas, 2) the AMH is not unusually long for its frequency compared with other haplotypes in the Americas, and 3) in Latin American mestizo populations, the proportion of Native American ancestry at D9S1120 is not unusual compared with that observed at other genomewide microsatellites. Using a new method for estimating the time to the most recent common ancestor (MRCA) of all sampled copies of an allele on the basis of an estimate of the length of the genealogy descended from the MRCA, we calculate the mean time to the MRCA of the 9-repeat allele to be between 7,325 and 39,900 years, depending on the demographic model used. The results support the hypothesis that all modern Native Americans and Western Beringians trace a large portion of their ancestry to a single founding population that may have been isolated from other Asian populations prior to expanding into the Americas.

[56] M DeGiorgio, NA Rosenberg (2009) An unbiased estimator of gene diversity in samples containing related individuals. Molecular Biology and Evolution 26: 501-512. [PDF]

Gene diversity is sometimes estimated from samples that contain inbred or related individuals. If inbred or related individuals are included in a sample, then the standard estimator for gene diversity produces a downward bias caused by an inflation of the variance of estimated allele frequencies. We develop an unbiased estimator for gene diversity that relies on kinship coefficients for pairs of individuals with known relationship and that reduces to the standard estimator when all individuals are noninbred and unrelated. Applying our estimator to data simulated based on allele frequencies observed for microsatellite loci in human populations, we find that the new estimator performs favorably compared with the standard estimator in terms of bias and similarly in terms of mean squared error. For human population-genetic data, we find that a close linear relationship previously seen between gene diversity and distance from East Africa is preserved when adjusting for the inclusion of close relatives.

[55] L Huang, Y Li, AB Singleton, JA Hardy, G Abecasis, NA Rosenberg, P Scheet (2009) Genotype imputation accuracy across worldwide human populations. American Journal of Human Genetics 84: 235-250. [PDF] [Supplement]

A current approach to mapping complex-disease-susceptibility loci in genome-wide association (GWA) studies involves leveraging the information in a reference database of dense genotype data. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and tested for disease association. This imputation strategy has been successful for GWA studies in populations well represented by existing reference panels. We used genotypes at 513,008 autosomal single-nucleotide polymorphism (SNP) loci in 443 unrelated individuals from 29 worldwide populations to evaluate the "portability" of the HapMap reference panels for imputation in studies of diverse populations. When a single HapMap panel was leveraged for imputation of randomly masked genotypes, European populations had the highest imputation accuracy, followed by populations from East Asia, Central and South Asia, the Americas, Oceania, the Middle East, and Africa. For each population, we identified "optimal" mixtures of reference panels that maximized imputation accuracy, and we found that in most populations, mixtures including individuals from at least two HapMap panels produced the highest imputation accuracy. From a separate survey of additional SNPs typed in the same samples, we evaluated imputation accuracy in the scenario in which all genotypes at a given SNP position were unobserved and were imputed on the basis of data from a commercial "SNP chip," again finding that most populations benefited from the use of combinations of two or more HapMap reference panels. Our results can serve as a guide for selecting appropriate reference panels for imputation-based GWA analysis in diverse populations.

[54] S Ramachandran, NA Rosenberg, MW Feldman, J Wakeley (2008) Population differentiation and migration: coalescence times in a two-sex island model for autosomal and X-linked loci. Theoretical Population Biology 74: 281-291. [PDF]

Evolutionists have debated whether population-genetic parameters, such as effective population size and migration rate, differ between males and females. In humans, most analyses of this problem have focused on the Y chromosome and the mitochondrial genome, while the X chromosome has largely been omitted from the discussion. Past studies have compared FST values for the Y chromosome and mitochondrion under a model with migration rates that differ between the sexes but with equal male and female population sizes. In this study we investigate rates of coalescence for X-linked and autosomal lineages in an island model with different population sizes and migration rates for males and females, obtaining the mean time to coalescence for pairs of lineages from the same deme and for pairs of lineages from different demes. We apply our results to microsatellite data from the Human Genome Diversity Panel, and we examine the male and female migration rates implied by observed FST values.

[53] ZA Szpiech, M Jakobsson, NA Rosenberg (2008) ADZE: a rarefaction approach for counting alleles private to combinations of populations. Bioinformatics 24: 2498-2504 [ Full text at journal website] [PDF] [Software]

Motivation: Analysis of the distribution of alleles across populations is a useful tool for examining population diversity and relationships. However, sample sizes often differ across populations, sometimes making it difficult to assess allelic distributions across groups.
Results: We introduce a generalized rarefaction approach for counting alleles private to combinations of populations. Our method evaluates the number of alleles found in each of a set of populations but absent in all remaining populations, considering equal-sized subsamples from each population. Applying this method to a worldwide human microsatellite dataset, we observe a high number of alleles private to the combination of African and Oceanian populations. This result supports the possibility of a migration out of Africa into Oceania separate from the migrations responsible for the majority of the ancestry of the modern populations of Asia, and it highlights the utility of our approach to sample size correction in evaluating hypotheses about population history.
Availability: We have implemented our method in the computer pro-gram ADZE, which is available for download at http://rosenberglab.bioinformatics.med.umich.edu/adze.html.
Contact: szpiechz@umich.edu

[52] NA Rosenberg, M Jakobsson (2008) The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179: 2027-2036. [PDF]

Homozygosity is a commonly used summary of allele-frequency distributions at polymorphic loci. Because high-frequency alleles contribute disproportionately to the homozygosity of a locus, it often occurs that most homozygotes are homozygous for the most frequent allele. To assess the relationship between homozygosity and the highest allele frequency at a locus, for a given homozygosity value, we determine the lower and upper bounds on the frequency of the most frequent allele. These bounds suggest tight constraints on the frequency of the most frequent allele as a function of homozygosity, differing by at most 1/4 and having an average difference of 2/3 - &pi² / 18 &cong 0.1184. The close connection between homozygosity and the frequency of the most frequent allele --- which we illustrate using allele frequencies from human populations --- has the consequence that when one of these two quantities is known, considerable information is available about the other quantity. This relationship also explains the similar performance of statistical tests of population-genetic models that rely on homozygosity and those that rely on the frequency of the most frequent allele, and it provides a basis for understanding the utility of extended homozygosity statistics in identifying haplotypes that have been elevated to high frequency as a result of positive selection.

[51] JM VanLiere, NA Rosenberg (2008) Mathematical properties of the r² measure of linkage disequilibrium. Theoretical Population Biology 74: 130-137. [PDF]

Statistics for linkage disequilibrium (LD), the non-random association of alleles at two loci, depend on the frequencies of the alleles at the loci under consideration. Here, we examine the r² measure of LD and its mathematical relationship to allele frequencies, quantifying the constraints on its maximum value. Assuming independent uniform distributions for the allele frequencies of two biallelic loci, we find that the mean maximum value of r² is ~0.43051, and that r² can exceed a threshold of 4/5 in only ~14.232% of the allele frequency space. If one locus is assumed to have known allele frequencies --- the situation in an association study in which LD between a known marker locus and an unknown trait locus is of interest --- we find that the mean maximum value of r² is greatest when the known locus has a minor allele frequency of ~0.30131. We find that in 1/4 of the space of allowed values of minor allele frequencies and haplotype frequencies at a pair of loci, the unconstrained maximum r² allowing for the possibility of recombination between the loci exceeds the constrained maximum assuming that no recombination has occurred. Finally, we use r²_max to examine the connection between r² and the D' measure of linkage disequilibrium, finding that r²/r²_max = D'² for ~72.683% of the space of allowed values of (p_a,p_b,p_ab). Our results concerning the properties of r² have the potential to inform the interpretation of unusual LD behavior and to assist in the design of LD-based association-mapping studies.

[50] TJ Pemberton*, M Jakobsson*, DF Conrad, G Coop, JD Wall, JK Pritchard, PI Patel, NA Rosenberg (2008) Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India. Annals of Human Genetics 72: 535-546. [PDF] [Data]

When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis --- such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.

[49] O François, MGB Blum, M Jakobsson, NA Rosenberg (2008) Demographic history of European populations of Arabidopsis thaliana. PLoS Genetics 4: e1000075. [Full text at journal website] [PDF] [Supplement]

The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east ~10,000 years ago, with a rate of westward spread of ~0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.

[48] JM Macpherson, J Gonzalez, DM Witten, JC Davis, NA Rosenberg, AE Hirsh, DA Petrov (2008) Nonadaptive explanations for signatures of partial selective sweeps in Drosophila. Molecular Biology and Evolution 25: 1025-1042. [PDF]

A beneficial mutation that has nearly but not yet fixed in a population produces a characteristic haplotype configuration, called a partial selective sweep. Whether nonadaptive processes might generate similar haplotype configurations has not been extensively explored. Here, we consider 5 population genetic data sets taken from regions flanking high-frequency transposable elements in North American strains of Drosophila melanogaster, each of which appears to be consistent with the expectations of a partial selective sweep. We use coalescent simulations to explore whether incorporation of the species' demographic history, purifying selection against the element, or suppression of recombination caused by the element could generate putatively adaptive haplotype configurations. Whereas most of the data sets would be rejected as nonneutral under the standard neutral null model, only the data set for which there is strong external evidence in support of an adaptive transposition appears to be nonneutral under the more complex null model and in particular when demography is taken into account. High-frequency, derived mutations from a recently bottlenecked population, such as we study here, are of great interest to evolutionary genetics in the context of scans for adaptive events; we discuss the broader implications of our findings in this context.

[47] NA Rosenberg, R Tao (2008) Discordance of species trees with their most likely gene trees: the case of five taxa. Systematic Biology 57: 131-140. [PDF] [Supplement]

Under a coalescent model for within-species evolution, gene trees may differ from species trees to such an extent that the gene tree topology most likely to evolve along the branches of a species tree can disagree with the species tree topology. Gene tree topologies that are more likely to be produced than the topology that matches that of the species tree are termed anomalous, and the region of branch-length space that gives rise to anomalous gene trees (AGTs) is the anomaly zone. We examine the occurrence of anomalous gene trees for the case of five taxa, the smallest number of taxa for which every species tree topology has a nonempty anomaly zone. Considering all sets of branch lengths that give rise to anomalous gene trees, the largest value possible for the smallest branch length in the species tree is greater in the five-taxon case (0.1934 coalescent time units) than in the previously studied case of four taxa (0.1568). The five-taxon case demonstrates the existence of three phenomena that do not occur in the four-taxon case. First, anomalous gene trees can have the same unlabeled topology as the species tree. Second, the anomaly zone does not necessarily enclose a ball centered at the origin in branch-length space, in which all branches are short. Third, as a branch length increases, it is possible for the number of AGTs to increase rather than decrease or remain constant. These results, which help to describe how the properties of anomalous gene trees increase in complexity as the number of taxa increases, will be useful in formulating strategies for evading the problem of anomalous gene trees during species tree inference from multilocus data.

[46] M Jakobsson*, SW Scholz*, P Scheet*, JR Gibbs, JM VanLiere, H-C Fung, ZA Szpiech, JH Degnan, K Wang, R Guerreiro, JM Bras, JC Schymick, DG Hernandez, BJ Traynor, J Simon-Sanchez, M Matarin, A Britton, J van de Leemput, I Rafferty, M Bucan, HM Cann, JA Hardy, NA Rosenberg, AB Singleton (2008) Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451: 998-1003. [PDF] [Supplement]

Genome-wide patterns of variation across individuals provide a powerful source of data for uncovering the history of migration, range expansion, and adaptation of the human species. However, high-resolution surveys of variation in genotype, haplotype and copy number have generally focused on a small number of population groups. Here we report the analysis of high-quality genotypes at 525,910 single-nucleotide polymorphisms (SNPs) and 396 copy-number-variable loci in a worldwide sample of 29 populations. Analysis of SNP genotypes yields strongly supported fine-scale inferences about population structure. Increasing linkage disequilibrium is observed with increasing geographic distance from Africa, as expected under a serial founder effect for the out-of-Africa spread of human populations. New approaches for haplotype analysis produce inferences about population structure that complement results based on unphased SNPs. Despite a difference from SNPs in the frequency spectrum of the copy-number variants (CNVs) detected --- including a comparatively large number of CNVs in previously unexamined populations from Oceania and the Americas --- the global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. Our results produce new inferences about inter-population variation, support the utility of CNVs in human population-genetic research, and serve as a genomic resource for human-genetic studies in diverse worldwide populations.

[45] K Zhang, NA Rosenberg (2007) On the genealogy of a duplicated microsatellite. Genetics 177: 2109-2122. [PDF]

When a microsatellite locus is duplicated in a diploid organism, a single pair of PCR primers may amplify as many as four distinct alleles. To study the evolution of a duplicated microsatellite, we consider a coalescent model with symmetric stepwise mutation. Conditional on the time of duplication and a mutation rate, both in a model of completely unlinked loci and in a model of completely linked loci, we compute the probabilities for a sampled diploid individual to amplify one, two, three, or four distinct alleles with one pair of microsatellite PCR primers. These probabilities are then studied to examine the nature of their dependence on the duplication time and the mutation rate. The mutation rate is observed to have a stronger effect than the duplication time on the four probabilities, and the unlinked and linked cases are seen to behave similarly. Our results can be useful for helping to interpret genetic variation at microsatellite loci in species with a very recent history of gene and genome duplication.

[44] S Wang*, CM Lewis Jr*, M Jakobsson*, S Ramachandran, N Ray, G Bedoya, W Rojas, MV Parra, JA Molina, C Gallo, G Mazzotti, G Poletti, K Hill, AM Hurtado, D Labuda, W Klitz, R Barrantes, MC Bortolini, FM Salzano, ML Petzl-Erler, LT Tsuneto, E Llop, F Rothhammer, L Excoffier, MW Feldman, NA Rosenberg, A Ruiz-Linares (2007) Genetic variation and population structure in Native Americans. PLoS Genetics 3: 2049-2067. [PDF] [Supplement] [Data] [Readme for datafile]

We examined genetic diversity and population structure in the American landmass using 678 autosomal microsatellite markers genotyped in 422 individuals representing 24 Native American populations sampled from North, Central, and South America. These data were analyzed jointly with similar data available in 54 other indigenous populations worldwide, including an additional five Native American groups. The Native American populations have lower genetic diversity and greater differentiation than populations from other continental regions. We observe gradients both of decreasing genetic diversity as a function of geographic distance from the Bering Strait and of decreasing genetic similarity to Siberians --- signals of the southward dispersal of human populations from the northwestern tip of the Americas. We also observe evidence of: (1) a higher level of diversity and lower level of population structure in western South America compared to eastern South America, (2) a relative lack of differentiation between Mesoamerican and Andean populations, (3) a scenario in which coastal routes were easier for migrating peoples to traverse in comparison with inland routes, and (4) a partial agreement on a local scale between genetic similarity and the linguistic classification of populations. These findings offer new insights into the process of population dispersal and differentiation during the peopling of the Americas.

[43] M Jakobsson, NA Rosenberg (2007) CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23: 1801-1806. [Full text at journal website] [PDF] [Software]

Motivation: Clustering of individuals into populations on the basis of multilocus genotypes is informative in a variety of settings. In population-genetic clustering algorithms, such as BAPS, STRUCTURE and TESS, individual multilocus genotypes are partitioned over a set of clusters, often using unsupervised approaches that involve stochastic simulation. As a result, replicate cluster analyses of the same data may produce several distinct solutions for estimated cluster membership coefficients, even though the same initial conditions were used. Major differences among clustering solutions have two main sources: (1) `label switching' of clusters across replicates, caused by the arbitrary way in which clusters in an unsupervised analysis are labeled, and (2) `genuine multimodality,' truly distinct solutions across replicates.
Results: To facilitate the interpretation of population-genetic clustering results, we describe three algorithms for aligning multiple replicate analyses of the same data set. We have implemented these algorithms in the computer program CLUMPP (CLUster Matching and Permutation Program). We illustrate the use of CLUMPP by aligning the cluster membership coefficients from 100 replicate cluster analyses of 600 chickens from 20 different breeds.
Availability: CLUMPP is freely available at http://rosenberglab.bioinformatics.med.umich.edu/clumpp.html
Contact: Mattias Jakobsson

[42] MGB Blum, NA Rosenberg (2007) Estimating the number of ancestral lineages using a maximum likelihood method based on rejection sampling. Genetics 176: 1741-1757. [PDF]

Estimating the number of ancestral lineages of a sample of DNA sequences at time t in the past can be viewed as a variation on the problem of estimating the time to the most recent common ancestor. To estimate the number of ancestral lineages, we develop a maximum-likelihood approach that takes advantage of a prior model of population demography, in addition to the molecular data summarized by the pattern of polymorphic sites. The method relies on a rejection sampling algorithm that is introduced for simulating conditional coalescent trees given a fixed number of ancestral lineages at time t. Computer simulations show that the number of ancestral lineages can be estimated accurately, provided that the number of mutations that occurred since time t is sufficiently large. The method is applied to 986 present-day human sequences located in hypervariable region 1 of the mitochondrion to estimate the number of ancestral lineages of modern humans at the time of potential admixture with the Neanderthal population. Our estimates support a view that the proportion of the modern population consisting of Neanderthal contributions must be relatively small, less than ~5%, if the admixture happened as recently as 30,000 years ago.

[41] NA Rosenberg (2007) Counting coalescent histories. Journal of Computational Biology 14: 360-377. [PDF]

Given a species tree and a gene tree, a valid coalescent history is a list of the branches of the species tree on which coalescences in the gene tree take place. I develop a recursion for the number of valid coalescent histories that exist for an arbitrary gene tree/species tree pair, when one gene lineage is studied per species. The result is obtained by defining a concept of m-extended coalescent histories, enumerating and counting these histories, and taking the special case of m=1. As a sum over valid coalescent histories appears in a formula for the probability that a random gene tree evolving along the branches of a fixed species tree has a specified labeled topology, the enumeration of valid coalescent histories can considerably reduce the effort required for evaluating this formula.

[40] NA Rosenberg, MGB Blum (2007) Sampling properties of homozygosity-based statistics for linkage disequilibrium. Mathematical Biosciences 208: 33-47. [PDF]

Homozygosity-based statistics such as Ohta's identity-in-state (IIS) excess offer the potential to measure linkage disequilibrium for multiallelic loci in small samples. However, previous observations have suggested that for independent loci, in small samples these statistics might produce values that more frequently lie on one side rather than on the other side of zero. Here we investigate the sampling properties of the IIS excess. We find that for any pair of independent polymorphic loci, as sample size n approaches infinity, the sampling distribution of the IIS excess approaches a normal distribution. For large samples, the IIS excess tends towards symmetry around zero, and the probabilities of positive and of negative IIS excess both approach 1/2. Surprisingly, however, we also find that for sufficiently large n, independent loci can be chosen so that the probability of a sample having positive IIS excess is arbitrarily close to either 0 or 1. The results are applied to interpretation of data from human populations, and we conclude that before employing homozygosity-based statistics to measure LD in a particular sample, especially for loci with either very small or very large homozygosities, it is useful to verify that loci with the observed homozygosity values are not likely to produce a large bias in IIS excess in samples of the given size.

[39] M Jakobsson, NA Rosenberg (2007) The probability distribution under a population divergence model of the number of genetic founding lineages of a population or species. Theoretical Population Biology 71: 502-523. [PDF]

The composition of genetic variation in a population or species is shaped by the number of events that led to the founding of the group. We consider a neutral coalescent model of two populations, where a derived population is founded as an offshoot of an ancestral population. For a given locus, using both recursive and nonrecursive approaches, we compute the probability distribution of the number of genetic founding lineages that have given rise to the derived population. This number of genetic founding lineages is defined as the number of ancestral individuals that contributed at the locus to the present-day derived population, and is formulated in terms of interspecific coalescence events. The effects of sample size and divergence time on the probability distribution of the number of founding lineages are studied in detail. For 99.99% of the loci in the derived population to each have one founding lineage, the two populations must be separated for >=9.9N generations. However, only ~0.87N generations must pass since divergence for 99.99% of the loci to have <6 founding lineages. Our results are useful as a prior expectation on the number of founding lineages in scenarios that involve the evolution of one population from the splitting of an ancestral group, such as in the colonization of islands, the formation of polyploid species, and the domestication of crops and livestock from wild ancestors.

[38] L David, NA Rosenberg, U Lavi, MW Feldman, J Hillel (2007). Genetic diversity and population structure inferred from the partially duplicated genome of domesticated carp, Cyprinus carpio L. Genetics Selection Evolution 39: 319-340. [PDF]

Genetic relationships among eight populations of domesticated carp (Cyprinus carpio L.), a species with a partially duplicated genome, were studied using 12 microsatellites and 505 AFLP bands. The populations included three aquacultured carp strains and five ornamental carp (koi) variants. Grass carp (Ctenopharyngodon idella) was used as an outgroup. AFLP-based gene diversity varied from 5% (grass carp) to 32% (koi) and reflected the reasonably well understood histories and breeding practices of the populations. A large fraction of the molecular variance was due to differences between aquacultured and ornamental carps. Further analyses based on microsatellite data, including cluster analysis and neighbor-joining trees, supported the genetic distinctiveness of aquacultured and ornamental carps, despite the recent divergence of the two groups. In contrast to what was observed for AFLP-based diversity, the frequency of heterozygotes based on microsatellites was comparable among all populations. This discrepancy can potentially be explained by duplication of some loci in Cyprinus carpio L., and a model that shows how duplication can increase heterozygosity estimates for microsatellites but not for AFLP loci is discussed. Our analyses in carp can help in understanding the consequences of genotyping duplicated loci and in interpreting discrepancies between dominant and co-dominant markers in species with recent genome duplication.

[37] KB Schroeder, TG Schurr, JC Long, NA Rosenberg, MH Crawford, LA Tarskaia, LP Osipova, SI Zhadanov, DG Smith (2007). A private allele ubiquitous in the Americas. Biology Letters 3: 218-223. [PDF] [Supplement]

The three-wave migration hypothesis of Greenberg et al. has permeated the genetic literature on the peopling of the Americas. Greenberg et al. proposed that Na-Dene, Aleut-Eskimo and Amerind are language phyla which represent separate migrations from Asia to the Americas. We show that a unique allele at autosomal microsatellite locus D9S1120 is present in all sampled North and South American populations, including the Na-Dene and Aleut-Eskimo, and in related Western Beringian groups, at an average frequency of 31.7%. This allele was not observed in any sampled putative Asian source populations or in other worldwide populations. Neither selection nor admixture explains the distribution of this regionally specific marker. The simplest explanation for the ubiquity of this allele across the Americas is that the same founding population contributed a large fraction of ancestry to all modern Native American populations.

[36] NA Rosenberg (2007) Statistical tests for taxonomic distinctiveness from observations of monophyly. Evolution 61: 317-323. [PDF]

The observation of monophyly for a specified set of genealogical lineages is often used to place the lineages into a distinctive taxonomic entity. However, it is sometimes possible that monophyly of the lineages can occur by chance as an outcome of the random branching of lineages within a single taxon. Thus, especially for small samples, an observation of monophyly for a set of lineages --- even if strongly supported statistically --- does not necessarily indicate that the lineages are from a distinctive group. Here I develop a test of the null hypothesis that monophyly is a chance outcome of random branching. I also compute the sample size required so that the probability of chance occurrence of monophyly of a specified set of lineages lies below a prescribed tolerance. Under the null model of random branching, the probability that monophyly of the lineages in an index group occurs by chance is substantial if the sample is highly asymmetric, that is, if only a few of the sampled lineages are from the index group, or if only a few lineages are external to the group. If sample sizes are similar inside and outside the group of interest, however, chance occurrence of monophyly can be rejected at stringent significance levels (P < 10^{-5}) even for quite small samples (~20 total lineages). For a fixed total sample size, rejection of the null hypothesis of random branching in a single taxon occurs at the most stringent level if samples of nearly equal size inside and outside the index group --- with a slightly greater size within the index group --- are used. Similar results apply, with smaller sample sizes needed, when reciprocal monophyly of two groups, rather than monophyly of a single group, is of interest. The results suggest minimal sample sizes required for inferences to be made about taxonomic distinctiveness from observations of monophyly.

[35] NA Rosenberg, S Mahajan, C Gonzalez-Quevedo, MGB Blum, L Nino-Rosales, V Ninis, P Das, M Hegde, L Molinari, G Zapata, JL Weber, JW Belmont, PI Patel (2006) Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genetics 2: 2052-2061. [PDF] [Supplementary Tables 1-3 (DOC)] [Supplementary Tables 1-3 (PDF)]

Ongoing modernization in India has elevated the prevalence of many complex genetic diseases associated with a western lifestyle and diet to near-epidemic proportions. However, although India comprises more than one sixth of the world's human population, it has largely been omitted from genomic surveys that provide the backdrop for association studies of genetic disease. Here, by genotyping India-born individuals sampled in the United States, we carry out an extensive study of Indian genetic variation. We analyze 1,200 genome-wide polymorphisms in 432 individuals from 15 Indian populations. We find that populations from India, and populations from South Asia more generally, constitute one of the major human subgroups with increased similarity of genetic ancestry. However, only a relatively small amount of genetic differentiation exists among the Indian populations. Although caution is warranted due to the fact that United States-sampled Indian populations do not represent a random sample from India, these results suggest that the frequencies of many genetic variants are distinctive in India compared to other parts of the world and that the effects of population heterogeneity on the production of false positives in association studies may be smaller in Indians (and particularly in Indian-Americans) than might be expected for such a geographically and linguistically diverse subset of the human population.

[34] DF Conrad*, M Jakobsson*, G Coop*, X Wen, JD Wall, NA Rosenberg, JK Pritchard (2006) A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics 38: 1251-1260. [PDF] [Supplement (methods, note, and figures)] [Supplementary Table 1] [Data]

Recent genomic surveys have produced high-resolution haplotype information, but only in a small number of human populations. We report haplotype structure across 12 Mb of DNA sequence in 927 individuals representing 52 populations. The geographic distribution of haplotypes reflects human history, with a loss of haplotype diversity as distance increases from Africa. Although the extent of linkage disequilibrium (LD) varies markedly across populations, considerable sharing of haplotype structure exists, and inferred recombination hotspot locations generally match across groups. The four samples in the International HapMap Project contain the majority of common haplotypes found in most populations: averaging across populations, 83% of common 20-kb haplotypes in a population are also common in the most similar HapMap sample. Consequently, although the portability of tag SNPs based on the HapMap is reduced in low-LD Africans, the HapMap will be helpful for the design of genome-wide association mapping studies in nearly all human populations.

[33] NA Rosenberg (2006) Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Annals of Human Genetics 70: 841-847. [PDF] [Supplement] [Data] [Spreadsheet with recommended subsets (txt format)] [Spreadsheet with recommended subsets (xls format)]

The HGDP-CEPH Human Genome Diversity Cell Line Panel is a widely-used resource for studies of human genetic variation. Here, pairs of close relatives that have been included in the panel are identified. Together with information on atypical and duplicated samples, the inferred relative pairs suggest standardized subsets of the panel for use in future population-genetic studies.

[32] NA Rosenberg (2006) Gene genealogies. Chapter 12 in CW Fox and JB Wolf, eds. Evolutionary Genetics: Concepts and Case Studies. Oxford: Oxford University Press, pp. 173-189. [PDF of final version]

(No abstract)

[31] NA Rosenberg, M Nordborg (2006) A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations. Genetics 173: 1665-1678. [PDF]

In linkage disequilibrium mapping of genetic variants causally associated with phenotypes, spurious associations can potentially be generated by any of a variety of types of population structure. However, mathematical theory of the production of spurious associations has largely been restricted to population structure models that involve the sampling of individuals from a collection of discrete subpopulations. Here, we introduce a general model of spurious association in structured populations, appropriate whether the population structure involves discrete groups, admixture among such groups, or continuous variation across space. Under the assumptions of the model, we find that a single common principle — applicable to both the discrete and admixed settings as well as to spatial populations — gives a necessary and sufficient condition for the occurrence of spurious associations. Using a mathematical connection between the discrete and admixed cases, we show that in admixed populations, spurious associations are less severe than in corresponding mixtures of discrete subpopulations, especially when the variance of admixture across individuals is small. This observation, together with the results of simulations that examine the relative influences of various model parameters, has important implications for the design and analysis of genetic association studies in structured populations.

[30] JH Degnan, NA Rosenberg (2006) Discordance of species trees with their most likely gene trees. PLoS Genetics 2: 762-768. [PDF]

Because of the stochastic way in which lineages sort during speciation, gene trees may differ in topology from each other and from species trees. Surprisingly, assuming that genetic lineages follow a coalescent model of within-species evolution, we find that for any species tree topology with five or more species, there exist branch lengths for which gene tree discordance is so common that the most likely gene tree topology to evolve along the branches of a species tree differs from the species phylogeny. This counterintuitive result implies that in combining data on multiple loci, the straightforward procedure of using the most frequently observed gene tree topology as an estimate of the species tree topology can be asymptotically guaranteed to produce an incorrect estimate. We conclude with suggestions that can aid in overcoming this new obstacle to accurate genomic inference of species phylogenies.

[29] NA Rosenberg (2006) The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees. Annals of Combinatorics 10: 129-146. [PDF]

The Yule model is a frequently-used evolutionary model that can be utilized to generate random genealogical trees. Under this model, using a backwards counting method differing from the approach previously employed by Heard (Evolution 46: 1818-1826), for a genealogical tree of n lineages, the mean number of nodes with exactly r descendants is computed (2 ≤ r ≤ n-1). The variance of the number of r-pronged nodes is also obtained, as are the mean and variance of the number of r-caterpillars. These results generalize computations of McKenzie and Steel for the case of r=2 (Math. Biosci. 164: 81-92, 2000). For a given n, the two means are largest at r=2, equaling 2n/3 for n ≥ 5. However, for n ≥ 9, the variances are largest at r=3, equaling 23n/420 for n ≥ 7. As n → ∞, the fraction of internal nodes that are r-caterpillars for some r approaches (e²-5)/4 ≈ 0.59726.

[28] NA Rosenberg, S Mahajan, S Ramachandran, C Zhao, JK Pritchard, MW Feldman (2005) Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genetics 1: 660-671. [PDF] [Data]

Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables — sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample — on the ``clusteredness'' of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.

[27] NA Rosenberg (2005) Algorithms for selecting informative marker panels for population assignment. Journal of Computational Biology 12: 1183-1201. [PDF]

Given a set of potential source populations, genotypes of an individual of unknown origin at a collection of markers can be used to predict the correct source population of the individual. For improved efficiency, informative markers can be chosen from a larger set of markers to maximize the accuracy of this prediction. However, selecting the loci that are individually most informative does not necessarily produce the optimal panel. Here, using genotypes from eight species — carp, cat, chicken, dog, fly, grayling, human, and maize — this univariate accumulation procedure is compared to new multivariate "greedy" and "maximin" algorithms for choosing marker panels. The procedures generally suggest similar panels, although the greedy method often recommends inclusion of loci that are not chosen by the other algorithms. In seven of the eight species, when applied to five or more markers, all methods achieve at least 94% assignment accuracy on simulated individuals, with one species — dog — producing this level of accuracy with only three markers, and the eighth species — human — requiring ~13-16 markers. The new algorithms produce substantial improvements over use of randomly selected markers; where differences among the methods are noticeable, the greedy algorithm leads to slightly higher probabilities of correct assignment. Although none of the approaches necessarily chooses the panel with optimal performance, the algorithms all likely select panels with performance near enough to the maximum that they all are suitable for practical use.

[26] S Ramachandran, O Deshpande, CC Roseman, NA Rosenberg, MW Feldman, LL Cavalli-Sforza (2005) Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proceedings of the National Academy of Sciences USA 102: 15942-15947. [PDF] [Supplementary Figure 6] [Supplementary Table 2] [Supplementary text] [Data]

Equilibrium models of isolation by distance predict an increase in genetic differentiation with geographic distance. Here we find a linear relationship between genetic and geographic distance in a worldwide sample of human populations, with major deviations from the fitted line explicable by admixture or extreme isolation. A close relationship is shown to exist between the correlation of geographic distance and genetic differentiation (as measured by Fst) and the geographic pattern of heterozygosity across populations. Considering a worldwide set of geographic locations as possible sources of the human expansion, we find that heterozygosities in the globally distributed populations of the data set are best explained by an expansion originating in Africa and that no geographic origin outside of Africa accounts as well for the observed patterns of genetic diversity. Although the relationship between Fst and geographic distance has been interpreted in the past as the result of an equilibrium model of drift and dispersal, simulation shows that the geographic pattern of heterozygosities in this data set is consistent with a model of a serial founder effect starting at a single origin. Given this serial-founder scenario, the relationship between genetic and geographic distance allows us to derive bounds for the effects of drift and natural selection on human genetic variation.

[25] NA Rosenberg (2005) A sharp minimum on the mean number of steps taken in adaptive walks. Journal of Theoretical Biology 237: 17-22. [PDF]

It was recently conjectured by H.A. Orr [2003. A minimum on the mean number of steps taken in adaptive walks. J. Theor. Biol. 220, 241-247] that from a random initial point on a random fitness landscape of alphabetic sequences with one-mutation adjacency, chosen from a larger class of landscapes, no adaptive algorithm can arrive at a local optimum in fewer than on average e-1 steps. Here, using an example in which the mean number of steps to a local optimum equals (A-1)/A, where A is the number of distinct "letters" in the "alphabet" from which sequences are constructed, it is shown that as originally stated, the conjecture does not hold. It is also demonstrated that (A-1)/A is a sharp minimum on the mean number of steps taken in adaptive walks on fitness landscapes of alphabetic sequences with one-mutation adjacency. As the example that achieves the new lower bound has properties that are not often considered as potential attributes for fitness landscapes --- non-identically distributed fitnesses and negative fitness correlations for adjacent points --- a weaker set of conditions characteristic of more commonly studied fitness landscapes is proposed under which the lower bound on the mean length of adaptive walks is conjectured to equal e-1.

[24] M Nordborg, TT Hu, Y Ishino, J Jhaveri, C Toomajian, H Zheng, E Bakker, P Calabrese, J Gladstone, R Goyal, M Jakobsson, S Kim, Y Morozov, B Padhukasahasram, V Plagnol, NA Rosenberg, C Shah, JD Wall, J Wang, K Zhao, T Kalbfleisch, V Schulz, M Kreitman, J Bergelson (2005) The pattern of polymorphism in Arabidopsis thaliana. PLoS Biology 3: 1289-1299. [PDF]

We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that included stock center accessions as well as a hierarchical sample from natural populations. Although A. thaliana is a selfing weed, the pattern of polymorphism in general agrees with what is expected for a widely distributed, sexually reproducing species. Linkage disequilibrium decays rapidly, within 50 kb. Variation is shared worldwide, although population structure and isolation by distance are evident. The data fail to fit standard neutral models in several ways. There is a genome-wide excess of rare alleles, at least partially due to selection. There is too much variation between genomic regions in the level of polymorphism. The local level of polymorphism is negatively correlated with gene density and positively correlated with segmental duplications. Because the data do not fit theoretical null distributions, attempts to infer natural selection from polymorphism data will require genome-wide surveys of polymorphism in order to identify anomalous regions. Despite this, our data support the utility of A. thaliana as a model for evolutionary functional genomics.

[23] H Innan, K Zhang, P Marjoram, S Tavaré, NA Rosenberg (2005) Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 169: 1763-1777. [PDF] [Software]

Several tests of neutral evolution employ the observed number of segregating sites and properties of the haplotype frequency distribution as summary statistics and use simulations to obtain rejection probabilities. Here we develop a "haplotype configuration test" of neutrality (HCT) based on the full haplotype frequency distribution. To enable exact computation of rejection probabilities for small samples, we derive a recursion under the standard coalescent model for the joint distribution of the haplotype frequencies and the number of segregating sites. For larger samples, we consider simulation-based approaches. The utility of the HCT is demonstrated in simulations of alternative models and in application to data from Drosophila melanogaster.

[22] NA Rosenberg, PP Calabrese (2004) Polyploid and multilocus extensions of the Wahlund inequality. Theoretical Population Biology 66: 381-391. [PDF]

Wahlund's inequality informally states that if a structured and an unstructured population have the same allele frequencies at a locus, the structued population contains more homozygotes. We show that this inequality holds generally for ploidy level P, that is, the structured population has more P-polyhomozygotes. Further, for M randomly chosen loci (M greater than or equal to 2), the structured population is also expected to contain more M-multihomozygotes than an unstructured population with the same single-locus homozygosities. The extended inequalities suggest multilocus identity coefficients analogous to FST. Using microsatellite genotypes from human populations, we demonstrate that the multilocus Wahlund inequality can explain a positive bias in "identity-in-state excess."

[21] MM Tanaka, NA Rosenberg, PM Small (2004) The control of copy number of IS6110 in Mycobacterium tuberculosis. Molecular Biology and Evolution 21: 2195-2201. [PDF]

Insertion sequence (IS) elements are bacterial genes that are able to transpose to different locations in the genome. These elements are often used in molecular epidemiology as genetic markers that track the spread of pathogens. Transposable elements have frequently been described as "selfish DNA" because they facilitate their own transposition, causing damage when they insert into coding regions, while contributing little if anything to the bacterial host. According to this hypothesis, the expansion of copy number of insertion sequences is opposed by negative selection against high copy numbers. From an alternative point of view, we might expect IS elements to intrinsically regulate transposition within cells, thereby limiting damage to their bacterial host. Here, we report evidence that the copy number of IS6110 in Mycobacterium tuberculosis is controlled by selection against the element. We first construct 12 different models of marker change resulting from a combination of possible transposition functions and selective regimes. We then compute the Akaike Information Criterion for each model to identify the models that best explain data consisting of serial isolates of M. tuberculosis genotyped with IS6110. We find that the best performing models all include selection against the accumulation of copies. Specifically, our analysis points to the interaction of separate copies of the element causing lethal effects. We discuss the implications of these findings for genome evolution and molecular epidemiology.

[20] NA Rosenberg (2004) Review of Probability Models for DNA Sequence Evolution by R Durrett. Journal of the American Statistical Association 99: 560-561. [PDF]

(No abstract)

[19] S Ramachandran, NA Rosenberg, LA Zhivotovsky, MW Feldman (2004) Robustness of the inference of human population structure: a comparison of X-chromosomal and autosomal microsatellites. Human Genomics 1: 87-97 (2004). [PDF]

In this paper, data on 20 X-chromosomal microsatellite polymorphisms from the HGDP-CEPH cell line panel are used to infer human population structure. Inferences from these data are compared to those obtained from autosomal microsatellites. Some of the major features of the structure seen with 377 autosomal markers are generally visible with the X-linked markers, although the latter provide less resolution. Differences between the X-chromosomal and autosomal results can be explained without requiring major differences in demographic parameters between males and females. The dependence of the partitioning on the number of individuals sampled from each region and on the number of markers used is discussed.

[18] NA Rosenberg (2004) Distruct: a program for the graphical display of population structure. Molecular Ecology Notes 4: 137-138. [PDF] [Software]

In analysis of multilocus genotypes from structured populations, individual coefficients of membership in subpopulations are often estimated using programs such as structure. Distruct provides a general method for visualizing these estimated membership coefficients. Subpopulations are represented as colours, and individuals are depicted as bars partitioned into coloured segments that correspond to membership coefficients in the subgroups. Distruct, available at http://www.cmb.usc.edu/~noahr/distruct.html, can also be used to display subpopulation assignment probabilities when individuals are assumed to have ancestry in only one group.

[17] NA Rosenberg, LM Li, R Ward, JK Pritchard (2003) Informativeness of genetic markers for inference of ancestry. American Journal of Human Genetics 73: 1402-1422. [PDF] [Supplement] [Microsatellite data] [SNP data] [SNP data readme] [Solution to Problem 11039 required in appendix of paper (American Mathematical Monthly 112: 572-573, 2005)] [Software]

Inference of individual ancestry is useful in various applications, such as admixture mapping and structured-association mapping. Using information-theoretic principles, we introduce a general measure, the informativeness for assignment (I_n), applicable to any number of potential source populations, for determining the amount of information that multiallelic markers provide about individual ancestry. In a worldwide human microsatellite data set, we identify markers of highest informativeness for inference of regional ancestry and for inference of population ancestry within regions; these markers, which are listed in online-only tables in our article, can be useful both in testing for and in controlling the influence of ancestry on case-control genetic association studies. Markers that are informative in one collection of source populations are generally informative in others. Informativeness of random dinucleotides, the most informative class of microsatellites, is five to eight times that of random single-nucleotide polymorphisms (SNPs), but 2%-12% of SNPs have higher informativeness than the median for dinucleotides. Our results can aid in decisions about the type, quantity, and specific choice of markers for use in studies of ancestry.

[16] NA Rosenberg, AE Hirsh (2003) On the use of star-shaped genealogies in inference of coalescence times. Genetics 164: 1677-1682. [PDF]

Genealogies from rapidly growing populations have approximate "star" shapes. We study the degree to which this approximation holds in the context of estimating the time to the most recent common ancestor (TMRCA) of a set of lineages. In an exponential growth scenario, we find that unless the product of population size (N) and growth rate (r) is at least 105, the "pairwise comparison estimator" of TMRCA that derives from the star genealogy assumption has bias of 10-50%. Thus, the estimator is appropriate only for large populations that have grown very rapidly. The "tree-length estimator" of TMRCA is more biased than the pairwise comparison estimator, having low bias only for extremely large values of Nr.

[15] NA Rosenberg (2003) The shapes of neutral gene genealogies in two species: probabilities of monophyly, paraphyly, and polyphyly in a coalescent model. Evolution 57: 1465-1477. [PDF]

The genealogies of samples of orthologous regions from multiple species can be classified by their shapes. Using a neutral coalescent model of two species, I give exact probabilities of each of four possible genealogical shapes — reciprocal monophyly, two types of paraphyly, and polyphyly. After the divergence that forms two species, each of which has population size N, polyphyly is the most likely genealogical shape for the lineages of the two species. At ~1.300N generations after divergence, paraphyly becomes most likely, and reciprocal monophyly becomes most likely at ~1.665N generations. For a given species, the time at which 99% of its loci acquire monophyletic genealogies is ~5.298N generations, assuming all loci in its sister species are monophyletic. The probability that all lineages of two species are reciprocally monophyletic given that a sample from the two species has a reciprocally monophyletic genealogy increases rapidly with sample size, as does the probability that the most recent common ancestor (MRCA) for a sample is also the MRCA for all lineages from the two species. The results have potential applications for the testing of evolutionary hypotheses.

[14] NA Rosenberg, JK Pritchard, JL Weber, HM Cann, KK Kidd, LA Zhivotovsky, MW Feldman (2003) Response to comment on "Genetic structure of human populations." Science 300: 1877. [PDF] [Data]

Our higher within-group variance component estimate in relation to comparable past studies is due to our use of allelic indicator variables, inclusion of tetranucleotide loci, and analysis of a sample that contained proportionately fewer geographically well-separated populations. The 83.4% estimate of Excoffier and Hamilton employs a subset of groups that are nearly maximally differentiated within regions, and it can therefore be regarded as a lower bound.

[13] NA Rosenberg, AG Tsolaki, MM Tanaka (2003) Estimating change rates of genetic markers using serial samples: applications to the transposon IS6110 in Mycobacterium tuberculosis. Theoretical Population Biology 63: 347-363. [PDF]

In infectious disease epidemiology, it is useful to know how quickly genetic markers of pathogenic agents evolve while inside hosts. We propose a modular framework with which these genotype change rates can be estimated. The estimation scheme requires a model of the underlying process of genetic change, a detection scheme that filters this process into observable quantities, and a monitoring scheme that describes the timing of observations. We study a linear "birth-shift-death" model for change in transposable element genotypes, obtaining maximum-likelihood estimators for various detection and monitoring schemes. The method is applied to serial genotypes of the transposon IS6110 in Mycobacterium tuberculosis. The estimated birth rate of 0.0161 (events per copy of the transposon per year) and death rate of 0.0108 are both significantly larger than the estimated shift rate of 0.0018. The sum of these estimates, which corresponds to a "half-life" of 2.4 years for a typical strain that has 10 copies of the element, substantially exceeds a previous estimate of 0.0135 total changes per copy per year. We consider experimental design issues that enable the precision of estimates to be improved. We also discuss extensions to other markers and implications for molecular epidemiology.

[12] LA Zhivotovsky, NA Rosenberg, MW Feldman (2003) Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. American Journal of Human Genetics 72: 1171-1186. [PDF] [Data]

We study data on variation in 52 worldwide populations at 377 autosomal short tandem repeat loci, to infer a demographic history of human populations. Variation at di-, tri-, and tetranucleotide repeat loci is distributed differently, although each class of markers exhibits a decrease of within-population genetic variation in the following order: sub-Saharan Africa, Eurasia, East Asia, Oceania, and America. There is a similar decrease in the frequency of private alleles. With multidimensional scaling, populations belonging to the same major geographic region cluster together, and some regions permit a finer resolution of populations. When a stepwise mutation model is used, a population tree based on T_D estimates of divergence time suggests that the branches leading to the present sub-Saharan African populations of hunter-gatherers were the first to diverge from a common ancestral population (~71-142 thousand years ago). The branches corresponding to sub-Saharan African farming populations and those that left Africa diverge next, with subsequent splits of branches for Eurasia, Oceania, East Asia, and America. African hunter-gatherer populations and populations of Oceania and America exhibit no statistically significant signature of growth. The features of population subdivision and growth are discussed in the context of the ancient expansion of modern humans.

[11] NA Rosenberg, JK Pritchard, JL Weber, HM Cann, KK Kidd, LA Zhivotovsky, MW Feldman (2002) Genetic structure of human populations. Science 298: 2381-2385. [Full Text at Science website] [PDF] [Supplement] [Data in structure and NEXUS formats] [Software for drawing figures] [Español]

We studied human population structure using genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 populations. Within-population differences among individuals account for 93 to 95% of genetic variation; differences among major groups constitute only 3 to 5%. Nevertheless, without using prior information about the origins of individuals, we identified six main genetic clusters, five of which correspond to major geographic regions, and subclusters that often correspond to individual populations. General agreement of genetic and predefined populations suggests that self-reported ancestry can facilitate assessments of epidemiological risks but does not obviate the need to use genetic information in genetic association studies.

[10] NA Rosenberg, D Nettle (2002) Joining forces to uncover human evolutionary history. Trends in Ecology and Evolution 17: 301-302. [PDF]

(No abstract)

[9] NA Rosenberg, M Nordborg (2002) Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms. Nature Reviews Genetics 3: 380-390. [PDF]

Improvements in genotyping technologies have led to the increased use of genetic polymorphism for inference about population phenomena, such as migration and selection. Such inference presents a challenge, because polymorphism data reflect a unique, complex, non-repeatable evolutionary history. Traditional analysis methods do not take this into account. A stochastic process known as "the coalescent" presents a coherent statistical framework for analysis of genetic polymorphisms.

[8] NA Rosenberg (2002) The probability of topological concordance of gene trees and species trees. Theoretical Population Biology 61: 225-247. [PDF]

The concordance of gene trees and species trees is reconsidered in detail, allowing for samples of arbitrary size to be taken from the species. A sense of concordance for gene tree and species tree topologies is clarified, such that if the "collapsed gene tree" produced by a gene tree has the same topology as the species tree, the gene tree is said to be topologically concordant with the species tree. The term speciodendric is introduced to refer to genes whose trees are topologically concordant with species trees. For a given three-species topology, probabilities of each of the three possible collapsed gene tree topologies are given, as are probabilities of monophyletic concordance and concordance in the sense of N. Takahata (1989), Genetics 122, 957-966. Increasing the sample size is found to increase the probability of topological concordance, but a limit exists on how much the topological concordance probability can be increased. Suggested sample sizes beyond which this probability can be increased only minimally are given. The results are discussed in terms of implications for molecular studies of phylogenetics and speciation.

[7] NA Rosenberg, MW Feldman (2002) The relationship between coalescence times and population divergence times. Chapter 9 in M Slatkin and M Veuille, eds. Modern Developments in Theoretical Population Genetics. Oxford: Oxford University Press, pp. 130-164. [PDF of final version]

The divergence time of two populations is the amount of time that has elapsed since the populations arose from an ancestral group, while the coalescence time of a set of copies of a gene is the amount of time that has elapsed since the most recent common ancestor of the gene copies lived. We briefly review the methods that have been used to infer divergence times and coalescence times from genetic data. We then consider the relationship between divergence times and coalescence times in a population genetic model that includes divergence followed by migration between two descendant populations, paying particular attention to the fact that migration can cause coalescence to occur more recently than divergence. Insights gained from the model and its special cases are applied to four examples: the divergences of humans and chimpanzees, modern humans and Neanderthals, Africans and non-Africans, and Native Americans and Asians. For each example, we discuss the connection between hypothesized divergence times and estimated coalescence times.

[6] NA Rosenberg, T Burke, MW Feldman, P Friedlin, MAM Groenen, J Hillel, A Mäki-Tanila, M Tixier-Boichard, A Vignal, K Wimmers, S Weigend (2001) Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159: 699-713. [PDF] [Data] [Photo]

We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently ~98%. When markers of highest expected heterozygosity were used, genotypes that included at least 8-10 highly variable markers from among the 27 markers genotyped also achieved >95% clustering success. When 12-15 highly variable markers and only 15-20 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units useful in problems of conservation.

[5] MM Tanaka, NA Rosenberg (2001) Optimal estimation of transposition rates of insertion sequences for molecular epidemiology. Statistics in Medicine 20: 2409-2420. [PDF]

Outbreaks of infectious disease can be confirmed by identifying clusters of DNA fingerprints among bacterial isolates from infected individuals. This procedure makes assumptions about the underlying properties of the genetic marker used for fingerprinting. In particular, it requires that each fingerprint changes sufficiently slowly within an individual that isolates from separate individuals infected by the same strain will exhibit similar or identical fingerprints. We propose a model for the probability that an individual's fingerprint will change over a given period of time. We use this model together with published data in order to estimate the fingerprint change rate for IS6110 in human tuberculosis, obtaining a value of 0.0139 changes per copy per year. Although we focus on insertion sequences (IS), our method applies to other fingerprinting techniques such as pulsed-field gel electrophoresis (PFGE). We suggest sampling intervals that produce the least error in estimates of the fingerprint change rate, as well as sample sizes that achieve specified levels of error in the estimate.

[4] NA Rosenberg, E Woolf, JK Pritchard, T Schaap, D Gefel, I Shpirer, U Lavi, B Bonné-Tamir, J Hillel, MW Feldman (2001) Distinctive genetic signatures in the Libyan Jews. Proceedings of the National Academy of Sciences, USA 98: 858-863. [PDF] [Data]

Unlinked autosomal microsatellites in six Jewish and two non-Jewish populations were genotyped, and the relationships among these populations were explored. Based on considerations of clustering, pairwise population differentiation, and genetic distance, we found that the Libyan Jewish group retains genetic signatures distinguishable from those of the other populations, in agreement with some historical records on the relative isolation of this community. Our methods also identified evidence of some similarity between Ethiopian and Yemenite Jews, reflecting possible migration in the Red Sea region. We suggest that high-resolution statistical methods that use individual multilocus genotypes may make it practical to distinguish related populations of extremely recent common ancestry.

[3] L Jin, ML Baskett, LL Cavalli-Sforza, LA Zhivotovsky, MW Feldman, NA Rosenberg (2000) Microsatellite evolution in modern humans: a comparison of two data sets from the same populations. Annals of Human Genetics 64: 117-134. [PDF] [Data]

We genotyped 64 dinucleotide microsatellite repeats in individuals from populations that represent all inhabited continents. Microsatellite summary statistics are reported for these data, as well as for a data set that includes 28 out of 30 loci studied by Bowcock (1994) in the same individuals. For both data sets, diversity statistics such as heterozygosity, number of alleles per locus, and number of private alleles per locus produced the highest values in Africans, intermediate values in Europeans and Asians, and low values in Americans. Evolutionary trees of populations based on genetic distances separated groups from different continents. Corresponding trees were topologically similar for the two data sets, with the exception that the (&delta&mu)² genetic distance reliably distinguished groups from different continents for the larger data set, but not for the smaller one. Consistent with our results from diversity statistics and from evolutionary trees, population growth statistics S_k and &beta, which seem particularly useful for indicating recent and ancient population size changes, confirm a model of human evolution in which human populations expand in size and through space following the departure of a small group from Africa.

[2] JK Pritchard, M Stephens, NA Rosenberg, P Donnelly (2000) Association mapping in structured populations. American Journal of Human Genetics 67: 170-181. [PDF]

The use in association studies of the forthcoming dense genome-wide collection of SNPs has been heralded as a potential breakthrough in studying the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favour of family-based tests of association, such as the TDT, but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings, and may substantially outperform it if there are conflicting associations in different subpopulations.

[1] JK Pritchard, NA Rosenberg (1999) Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics 65: 220-228. [PDF]

We examine the issue of population stratification in association mapping studies. In case-control studies of association, population subdivision or recent admixture of populations can lead to spurious associations between a phenotype and unlinked candidate loci. Using a model of sampling from a structured population, we show that if population stratification exists, it can be detected using unlinked marker loci. We show that the case-control study design using unrelated control individuals is a valid approach for association mapping, provided that marker loci unlinked to the candidate locus are included in the study in order to test for stratification. We suggest guidelines for how many unlinked marker loci should be used.