Advanced bioinformatics approaches for hybrid de novo whole-genome assembly
Over the last two decades, there has been an explosion of genomic data, allowing researchers the opportunity to view the mechanics of life in much greater detail. This rush of data has also spurred the development of a multitude of tools to both generate it and analyse it. While this has undoubtedly facilitated advances hitherto only dreamed of, it has also created a maze of possible tool combinations to achieve particular goals, such as in the case of genome assembly. Genomes of model and novel organisms are being sequenced at an increasing rate, whether for conservation, industrial application, or pest control. Choices of sequencing technology, such as Illumina short reads, and Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) long reads, combined with dozens of different algorithmic choices for initial assembly, draft polishing, and post-processing steps, have created a daunting mountain of decisions that must be made before commencing a new genome assembly project, with many unable to be based on any clear rationale or criteria.
This is particularly true for eukaryotic organisms, as most comprehensive, cross-technology benchmarking work in this area to date has focused on bacterial genomes. In order to provide a resource for researchers in the field, I undertook a comprehensive benchmark of assembly and polishing tools for eukaryotic genome construction, using two organisms as gold standards (C. elegans and D. melanogaster), and three different next-generation and third-generation sequencing technologies (Illumina, ONT, and PacBio) at varying levels of coverage of input sequencing data. The draft genomes resulting from these tools were assessed against best-practice community metrics, such as contiguity, gene completion, and accuracy compared to reference, while the tools themselves were compared for computational performance, namely speed and memory requirements, determining a set of recommended tools for genome assembly and polishing for eukaryotic organisms.
To further assist researchers in their endeavours to assemble novel eukaryotic genomes, I have transformed the workflow of algorithms and packages required for the benchmarking work into a Singularity container pipeline, Pyro. This pipeline takes approximate initial information about the organism under consideration such as genome size estimates, as well as the sequencing technology used and the computational resources available, and produces a number of high-quality draft genomes for evaluation. It is also configurable for more advanced users to assemble genomes with their assemblers and polishers of choice and provides comprehensive analytics about the draft genome upon completion.
The production of a high quality, chromosomal genome assembly for Australian insect pests has been a high priority for many researchers, in order to further facilitate biological control of their populations, better detect possible invasions across borders, and more deeply understand their evolution and genetics. I have applied the Pyro pipeline to generate an initial draft assembly for an Australian fruit fly pest, Bactrocera jarvisi, and have improved the output assembly with Hi-C sequencing to produce an annotated assembly with chromosome-level resolution. Comparative genomic analysis has then been conducted of this genome with its close Bactrocera relatives to construct a phylogenetic tree from highly conserved orthologues. Synteny analysis between Bactrocera jarvisi and two other dipteran species, Bactrocera tryoni and Drosophila melanogaster, reveals substantial levels of chromosomal conservation between Drosophila and Bactrocera, opening the door to further functional exploitation of results from the model higher dipteran Drosophila melanogaster in bactroceran studies.
This thesis helps to further the field of genomics by providing benchmarking of computational tools for eukaryotic genome assembly as well as a novel configurable pipeline, Pyro, of assembly and polishing tools, in order to facilitate other researchers’ efforts in the area and therefore aid in accelerating progress the field. The benchmarking study is the first one on only eukaryotic genomes, to the best of my knowledge. I have also generated the first chromosome-level genome assembly of the insect pest, Bactrocera jarvisi, confirming its phylogeny within Bactrocera species, and have reported substantial conservation of chromosomal organisation across taxa, providing for a firm basis for future bactroceran research.