next up previous contents
Next: Modelling decisions for the Up: Format for the data Previous: Missing genotype data   Contents

Formatting errors.

We have implemented reasonably careful error checking to make sure that the data set is in the correct format, and the program will attempt to provide some indication about the nature of any problems that exist. The front end requires returns at the ends of each row, and does not allow returns within rows; the command-line version of structure treats returns in the same way as spaces or tabs. One problem that can arise is that editing programs used to assemble the data prior to importing them into structure can introduce hidden formatting characters, often at the ends of lines, or at the end of the file. The front end can remove many of these automatically, but this type of problem may be responsible for errors when the data file seems to be in the right format. If you are importing data to a UNIX system, the dos2unix function can be helpful for cleaning these up.

Table 2: Format of the data file, in two-row format. Most of these components are optional (see text for details). $ M_l$ is an identifier for marker $ l$. $ D_{i,i+1}$ is the distance between markers $ i$ and $ i+1$. $ ID^{(i)}$ is the label for individual $ i$, $ g^{(i)}$ is the geographic origin of individual $ i$ (PopData); $ f^{(i)}$ is a flag used to incorporate learning samples (PopFlag); $ \phi ^{(i)}$ can store a phenotype for individual $ i$; $ y_1^{(i)},...,y_n^{(i)}$ are for storing extra data (ignored by the program); $ (x^{i,1}_l,x^{i,2}_l)$ stores the genotype of individual $ i$ at locus $ l$. $ p_i^{(l)}$ is the phase information for marker $ l$ in individual $ i$.
Label Pop Flag Phen ExtraCols Loc 1 Loc 2 Loc 3 .... Loc $ L$  
                     
          $ M_1$ $ M_2$ $ M_3$ .... $ M_L$  
          -1 $ D_{1,2}$ $ D_{2,3}$ .... $ D_{L-1,L}$  
                     
$ ID^{(1)}$ $ g^{(1)}$ $ f^{(1)}$ $ \phi^{(1)}$ $ y_1^{(1)},...,y_n^{(1)}$ $ x^{(1,1)}_1$ $ x^{(1,1)}_2$ $ x^{(1,1)}_3$ .... $ x^{(1,1)}_L$  
$ ID^{(1)}$ $ g^{(1)}$ $ f^{(1)}$ $ \phi^{(1)}$ $ y_1^{(1)},...,y_n^{(1)}$ $ x^{(1,2)}_1$ $ x^{(1,2)}_2$ $ x^{(1,2)}_3$ .... $ x^{(1,2)}_L$  
          $ p_1^{(1)}$ $ p_2^{(1)}$ $ p_3^{(1)}$ .... $ p_L^{(1)}$  
                     
$ ID^{(2)}$ $ g^{(2)}$ $ f^{(2)}$ $ \phi^{(2)}$ $ y_1^{(2)},...,y_n^{(2)}$ $ x^{(2,1)}_1$ $ x^{(2,1)}_2$ $ x^{(2,1)}_3$ .... $ x^{(2,1)}_L$  
$ ID^{(2)}$ $ g^{(2)}$ $ f^{(2)}$ $ \phi^{(2)}$ $ y_1^{(2)},...,y_n^{(2)}$ $ x^{(2,2)}_1$ $ x^{(2,2)}_2$ $ x^{(2,2)}_3$ .... $ x^{(2,2)}_L$  
          $ p_1^{(2)}$ $ p_2^{(2)}$ $ p_3^{(2)}$ .... $ p_L^{(2)}$  
....                    
                     
$ ID^{(i)}$ $ g^{(i)}$ $ f^{(i)}$ $ \phi ^{(i)}$ $ y_1^{(i)},...,y_n^{(i)}$ $ x^{(i,1)}_1$ $ x^{(i,1)}_2$ $ x^{(i,1)}_3$ .... $ x^{(i,1)}_L$  
$ ID^{(i)}$ $ g^{(i)}$ $ f^{(i)}$ $ \phi ^{(i)}$ $ y_1^{(i)},...,y_n^{(i)}$ $ x^{(i,2)}_1$ $ x^{(i,2)}_2$ $ x^{(i,2)}_3$ .... $ x^{(i,2)}_L$  
          $ p_1^{(3)}$ $ p_2^{(3)}$ $ p_3^{(3)}$ .... $ p_L^{(3)}$  
....                    
                     
$ ID^{(N)}$ $ g^{(N)}$ $ f^{(N)}$ $ \phi^{(N)}$ $ y_1^{(N)},...,y_n^{(N)}$ $ x^{(N,1)}_1$ $ x^{(N,1)}_2$ $ x^{(N,1)}_3$ .... $ x^{(N,1)}_L$  
$ ID^{(N)}$ $ g^{(N)}$ $ f^{(N)}$ $ \phi^{(N)}$ $ y_1^{(N)},...,y_n^{(N)}$ $ x^{(N,2)}_1$ $ x^{(N,2)}_2$ $ x^{(N,2)}_3$ .... $ x^{(N,2)}_L$  
          $ p_1^{(L)}$ $ p_2^{(L)}$ $ p_3^{(L)}$ .... $ p_L^{(1)}$  



next up previous contents
Next: Modelling decisions for the Up: Format for the data Previous: Missing genotype data   Contents
Jonathan Pritchard 2003-07-10