next up previous contents
Next: Isolation by distance data Up: Estimation of (the number Previous: Mild departures from the   Contents

Informal pointers for choosing $ K$; is the structure real?

There are a couple of informal pointers which might be helpful in selecting $ K$. The first is that it's often the situation that $ {\rm Pr}(K)$ is very small for $ K$ less than the appropiate value (effectively zero), and then more-or-less plateaus for larger $ K$, as in the example of Data Set 2A shown above. In this sort of situation where several values of $ K$ give similar estimates of log $ {\rm Pr}(X\vert K)$, it seems that the smallest of these is often ``correct''. It is a bit difficult to provide a firm rule for what we mean by a ``more-or-less plateaus''. For small data sets, this might mean that the values of log $ {\rm Pr}(X\vert K)$ are within 5-10, but our colleague Daniel Falush writes that ``in very big datasets, the difference between $ K=3$ and $ K=4$ may be 50, but if the difference between $ K=3$ and $ K=2$ is 5,000, then I would definitely choose $ K=3$.'' I think that a sensible way to think about this is in terms of model choice. That is, we may not always be able to know the TRUE value of $ K$, but we should aim for the smallest value of $ K$ that captures the major structure in the data. The second pointer is that if there really are separate populations, there is typically a lot of information about the value of $ \alpha$, and once the Markov chain converges, $ \alpha$ will normally settle down to be relatively constant (usually with a range of perhaps 0.2 or less in examples I have looked at). However, if there isn't any real structure, $ \alpha$ will usually vary greatly during the course of the run. A corrollary of this is that when there is no population structure, you will typically see that the proportion of the sample assigned to each population is roughly symmetric ($ \sim 1/K$ in each population), and most individuals will be fairly admixed. If some individuals are strongly assigned to one population or another, and if the proportions assigned to each group are asymmetric, then this is a strong indication that you have real population structure. Suppose that you have a situation with two clear populations, but you are trying to decide whether one of these is further subdivided (ie, the value of $ {\rm Pr}(X\vert K=3)$ is similar to, or perhaps a little larger than $ P(X\vert K=2)$). Then one thing you could try is to run structure using only the individuals in the population that you suspect might be subdivided, and see whether there is a strong signal as described above. In summary, you should be skeptical about population structure inferred on the basis of small differences in $ {\rm Pr}(K)$ if (1) there is no clear biological interpretation for the assignments, and (2) the assignments are roughly symmetric to all populations and no individuals are strongly assigned.
next up previous contents
Next: Isolation by distance data Up: Estimation of (the number Previous: Mild departures from the   Contents
William Wen 2002-07-18