Welcome to the Kingsley Lab `gasAcu1-4` assembly hub

The assembly gasAcu1-4 is a new version of the threespine stickleback reference genome. It only minorly differs from the 2017 Hi-C guided assembly generated by Peichel, Sullivan, Liachko, and White to improve the subtelomeric Pitx1 locus and the mitochondrial genome.

The subtelomeric region of chrVII is of great biological interest due to its well-documented role in controlling pelvic spine development (Shapiro et al. 2004). However, due to its highly repetitive nature, it is extremely difficult to assemble and many important sequences, including the key gene Pitx1, are missing entirely from existing genome assemblies, while other sequences in the region are scattered in small unassembled scaffolds. We address this issue by including the sequence from Salmon River BAC clones (Genbank GU130435) (Chan et al. 2010) as chrP and removing overlapping fragmented sequences from chrUn and the end of chrVII. We note that chrP is derived from a marine population, while the rest of the genome is from a freshwater population (Bear Paw [BEPA] Lake), so all analyses concerning this chromosome must be interpreted as such.

The mitochondrial genome was previously split into two fragments buried within chrUn, while a separate mitochondrial genome sequence from Northern Japan was added as chrM. We corrected these issues by removing the duplicated Bear Paw Lake mitochondrial genome from chrUn and using it to replace the exogenous chrM sequence, resulting in a single copy of the mitochondrial genome derived entirely from Bear Paw Lake.

Aside from these changes, the Peichel et al. 2017 reference genome was not modified in any other ways.

Methods

Please see the individual track description pages for details on the data within each track, or consult Roberts Kingman et al. 2021.

Acknowledgements

We thank A. Hinrichs, H. Clawson, K. Smith, and D. Karolchik for contribution to the UCSC Stickleback Genome Browser annotations for gasAcu1, which were lifted to gasAcu1-4 for this study.

Data Availability

A copy of the files underlying this assembly hub can also be downloaded from FigShare project # 162634.

Contact

Please email David Kingsley, the Kingsley Lab, or Krishna Veeramah with questions about the hub or the associated manuscript (Roberts Kingman et al. 2021).

-----

Additional Information

What is a genome browser and how do I make it work?

We live at a remarkable time when the complete DNA sequence of many living organisms is being decoded. To facilitate the study of genome-wide information, several groups have developed powerful software tools that make it possible to search, visualize, interpret, and download the genetic instruction set of sequenced species. For basic information about how the browser works, see Karolchik et al. 2011, or click on the Help menu available in the blue bar at the top of the browser window.

How do I find a particular gene in the stickleback genome?

You can enter a gene symbol, or keyword, or chromosome coordinate position in the search box at the top of the browser. Examples: ectodysplasin, EDA, chrIV:12,800,220-12,810,446

What are those complicated looking names found in the gene prediction tracks?

Ensembl has a very useful computational pipeline for predicting likely genes and gene products in large-scale genome assemblies (Curwen et al. 2004). The predictions for sticklebacks take into account sequence motifs and information from >300,000 sequenced ESTs from a variety of stickleback cDNA libraries (Kingsley et al. 2004). Each of the resulting gene predictions are assigned unique IDs made up of the following letters and numbers: ENS for Ensembl; GAC for Gasterosteus aculeatus: the single letter G or T or P to denote either a predicted gene, transcript or protein; and then a string of numbers unique to each prediction (e.g.: ENSGACT00000026917). The Ensembl designations work as search terms in the browser. All Ensembl genes can also be viewed and intersected with other data using the "Ensembl genes" track found beneath the browser window.

How can I find informative genetic markers for studying genomic regions of interest?

If you want to design new microsatellite markers to tag a specific gene or genomic region, you can visualize promising locations where simple repeat sequences have been identified in the reference stickleback genome. If not already visible, display the "Simple Repeats" and "Microsatellites" tracks by clicking the "Repeats" section beneath the genome browser window. DNA primers made to unique sequences flanking di- and tri-nucleotide repeats have a high likelihood of revealing size polymorphism in other individuals and populations (Peichel et al. 2001).

Finally, you can also use browser to view the positions of millions of genomic variants predicted from re-sequencing data in 227 natural stickleback populations from the Pacific and Atlantic ocean basins (Jones et al. 2012b and Roberts Kingman et al. 2021).

Toggle the "227 Genomes Variants" track beneath the browser window to full view. The 227 visual genotype tracks that appear show the predicted value of these polymorphic loci in the 227 different marine and freshwater individuals that were sequenced. If you are zoomed in far enough in any genomic region, you can hover your pointer above the position of any colored variant position and see the exact genome position, as well as the value of the corresponding position in the reference Bear Paw (BEPA) genome, and the alternative allele detected in reads from other genomes.

Can I download the genome-wide variant information that comes from large-scale stickleback re-sequencing?

Single nucleotide polymorphisms (SNPs) and other short sequence variants identified by Roberts Kingman et al. are contained in the track called "227 Genomes Variants." You can use the Table Browser tool in the browser to download all variants in particular regions, as well as the corresponding genotype calls in different populations. Click on the "Table Browser" heading found under the "Tools" section in the blue banner at the top of the browser window. In the new table browser page that opens up, select clade: "kingsleyAssemblyHub", genome: "G. aculeatus (BEPA with SALR chrP)", assembly: "gasAcu1-4", group: "All Tracks", track: "227 Genomes Variants", table: "hub_3964941_227genomes_variants", and enter the coordinates of your favorite region (or choose the entire gasAcu1-4 genome). Note that by also clicking on the intersection button in the table browser, you can intersect the sequence variant information with many other interesting tracks available in the browser. For example, to get a list of the subset of variants occurring within predicted genes, start a table search beginning with the "227 Genomes Variants" track as described above, then click the intersect button, and then choose track: "Ensembl genes". Additional controls in the table browser make it possible to restrict your search to particular fields within a track, get summary statistics for your search, and choose different formats for exporting the data. A copy of the Variant Call File (VCF) underlying this track is available at FigShare.

How do I find genomic regions that show repeated differentiation between sequenced marine and freshwater populations?

In 2005, Colosimo et al. showed that repeated evolution of armor plate changes in many different freshwater sticklebacks occurred via an ancient shared haplotype at the Ectodysplasin locus, which encodes a key developmental signaling molecule. The repeated use of ancient haplotypes gives rise to a distinctive pattern of allele sharing: all fish that share the same armor phenotypes also share similar distinctive sequences at the key locus controlling the armor trait, a pattern that is dramatically different than the phylogeographic patterns seen at other neutral loci (Colosimo et al. 2005). Jones et al. 2012b looked for this distinctive pattern throughout the genome using two different computational methods. You can visualize the other repeatedly used regions found by marine-freshwater Cluster Separation Scores (CSS) by turning on the corresponding tracks listed under "Jones et al. 2012" controls beneath the browser window. Click on the underlined text above any track button to get a brief summary of the type of information presented in the track, and consult the Jones et al. 2012b paper for detailed information about the CSS methods. The default window in the browser opens to the region surrounding the prototypical Ectodysplasin gene, and illustrates the characteristic sequence patterns now being used to recover many other loci that also underlie repeated evolution in natural environments.

An Excel table listing key regions recovered by the CSS approach is available as part of the supplementary information in Jones et al. 2012b. These 81 regions can be viewed in bed format by turning on the track "CSS_0.02" in the "Jones et al. 2012" Track Group at the bottom of the page. You can also do your own custom searches and downloads of CSS data using the "Table Browser" tool. Navigate to "Tools>Table Browser" in the blue banner at the top of the browser. In the new Table Browser window that opens up, select group: "Jones et al. 2012" and set track: to any particular CSS analysis you want to look at in greater detail. You can intersect these tracks with any other tracks in the browser, get summary statistics for your search, and download data in a variety of other formats by choosing the appropriate buttons at the bottom of the Table Browser.

What is chromosome "Un"?

Genome sequencing projects are jigsaw puzzles with millions of pieces. The primary sequence data are short sequence reads made up of strings of A, C, G, and Ts. Many of these sequences overlap with each other, and can be aligned to produce longer continuous strings of sequence, called "contigs". Adjacent "contigs" can be bridged into larger "scaffolds", based on bridging clones whose end sequences map to different contigs. Finally, the larger "scaffolds" can be roughly positioned on chromosomes by following the inheritance pattern of genetic sequences in families. However, the process of connecting, lining up, and ordering all pieces is still incomplete. Most of the large sequence scaffolds have been tied to previously mapped chromosome linkage groups. In some of these cases, there was only a single informative genetic marker within a scaffold, or no recombination between internal markers within the scaffold in the small genetic mapping cross used. In these cases, the scaffold has been correctly assigned to a particular linkage group, but the orientation of the scaffold within that linkage group was arbitrary and may be changed by further mapping. In addition, many of the smaller scaffolds in the genome assembly did not contain any mapped genetic markers. These scaffolds have been concatenated together into an artificial "Unmapped" chromosome, so that they can still be searched and analyzed in the browser. As sequence and mapping information continues to grow in the future, more and more Unmapped scaffolds will be linked to sequences on other chromosomes and the large scale order and orientation of all scaffolds will continue to improve.

Is anything missing from the stickleback genome?

Yes!! Sequence reads in highly repetitive regions are difficult to align and assemble. For example, the Pitx1 gene maps near a chromosome end, and the reads in this highly repetitive sub-telomeric region failed to incorporate with the rest of the stickleback assembly (Chan et al. 2010). We think that this problem affects a relatively small number of genes, since more than 97% of cloned RNAs from stickleback tissues (ESTs), DO have corresponding genes in the reference assembly. If you don't find a stickleback ortholog of your favorite gene in the reference assembly, you can try searching against raw reads in the trace archives as well, setting the database to "Gasterosteus aculeatus".

Although Roberts Kingman et al. predict millions of genome wide sequence variants, many additional polymorphisms have undoubtedly been missed because of the requirement that all predicted variants be supported by a minimum number of reads. This criterion helps minimize false positives in high throughput sequencing data, and works well for finding regions of shared divergence between many different marine and freshwater populations. However, the same method will under-recover variants that are unique to single individuals or populations.

Finally, any automated annotation of genes and transcripts across an entire genome is likely to have many local prediction errors, including: undetected genes, incorrect start or termination sites, missed exons, concatenation of exons from separate genes into single predictions, poor recovery of small genes or non-coding RNAs, etc. Comparison of automated gene predictions with the actual expressed sequences identified in stickleback tissues or other organisms may help sort out some of these prediction errors. However, the gold standard for confirming interesting observations will always be actual experiments. The genome browser can serve as a convenient starting point for exploring massive amounts of genetic information about the rapidly evolving stickleback species complex. We hope you will find it useful as you get you started on your own questions and research!

Welcome to the Kingsley Lab gasAcu1-4 assembly hub