Using novel sequence assembly strategies to resolve the transformer gene family
Transformer (Trf; formerly 185/333) is an immune-related multigene family found exclusively in sea urchins. Empirical experiments have shown Trf to be unusually diversified in Strongylocentrotus purpuratus (SpTrf), with its genome encoding an estimated 80-120 alleles. However, the published S. purpuratus genome assembly contained only 2 Trf genes, with successive revisions raising that number to 6. This major discrepancy has plagued the study of SpTrf, making experimental findings difficult to interpret.
In this thesis I develop bioinformatic tools and methods aimed to resolve problematic genomic loci with demonstrable confidence. These tools include novel algorithms to efficiently index and query large raw sequencing datasets, allowing the tools to be run on a desktop computer with minimal memory use. I then use these methods to resolve Trf in the local sea urchin Heliocidaris tuberculata (HtTrf), to find its genome encodes only four such genes in a single cluster. In the process, I uncover an experimental artefact in the methods used to characterize SpTrf diversity that would dramatically inflate perceived diversity when applied to HtTrf. Combined, these findings imply that SpTrf may have been substantially mischaracterized as a hypervariable gene family.
To revise and expand current understanding of Trf, I characterize Trf genes in a wide variety of sea urchin species, most of which were previously unstudied. As part of this process, I resolve SpTrf to be far less numerous (15 alleles possessed by an individual) than empirically estimated, but still misrepresented in genome assemblies. The survey of species reveals a clear phylogenetic division where all 19 echinidean species possessed Trf, while all non-echinideans did not - implying the relatively recent de novo evolution of Trf in a common ancestor of echinideans. In each echinidean species, Trf is a multigene family, but sequence similarity is unexpectedly high among paralogs. This lack of divergence among paralogs appears to be explained by the arrangement of Trf genes in genomic clusters, where evidence of frequent duplications, deletions and gene conversion events can be observed. These observations fit with Trf being a multigene family that evolves by concerted evolution, where gene conversion acts to homogenize paralogs.
Overall, this thesis serves to identify and resolve the mischaracterization of Trf genes, culminating in a comprehensive investigation into Trf diversity and evolution. It also represents a case study in how multigene families can be misrepresented in sequence assembly and how targeted analysis methods can provide a high-confidence resolution.