Protocols for dimension reduction of transcriptomic data
thesisposted on 2022-03-29, 01:09 authored by Timothy John Peters
The elucidation of the complex aetiology of human disease has been accelerated by recent advances in systems biology. Generation of high-dimensional datasets through gene expression profiling is an inevitable component of this research. Bioinformaticians are presented with this unabridged data by scientists seeking biological insights. Their role is that of a software engineer as well as a scientist; they are needed to facilitate the analysis by building software that performs dimension reduction. The desired outcome of dimension reduction is to find a handful of genes whose expression values reliably diagnose unlabelled samples. This thesis discusses the issues faced in bioinformatic classification and feature selection, and culminates in the development of a protocol to generate groups of genes that illuminate the nature of human disease. -- The literature review describes how microarray technology facilitates the analysis of gene expression profiling, and charts the journey from hybridization to a normalised dataset. It then follows the development of dimension reduction techniques over the last 20 to 30 years. Moving from early techniques, it covers the three main strains of data mining algorithms: Discriminant Analysis, Decision Trees and the shrinkage family. -- This thesis contains three articles (two of which have been published), each describing a statistical concept in need of consideration each time a dimension reduction is performed. By way of example, supervised learning of the transcriptomes of lymphoma patients is carried out in each study. The first shows that the common practice of scoring features individually and ranking them by these scores is too superficial a method of assessing their degrees of biological relevance. We show the need to assess the gestalt discriminatory power of feature sets, the implications of this power in algorithm design and optimisation, and the intuitive relationship of this concept to biological phenomena. -- The second article describes the need for regularisation of the linear model. We discuss how the searches for a workable compromise between model bias and variance within each of the three main data-mining strains are performed in quite different ways, yet possess a common theoretical background and yield similar predictive results on validation procedures. We show that a simple forward selection technique that adds features to a model based on the maximisation of the penalised margin width of its regularised Support Vector Machine formation performs competitively against, and in some cases outperforms Random Forest and Least Angle Regression with respect to classifying unlabelled data points. -- No feature selection technique has proven to be superior to all others. Since there currently seems to be no 'silver bullet' method for extracting the most telling biomarkers from a transcriptome, we develop and test a novel ensemble feature selection method in the third and most ambitious article. We rigorously build an inventory, through sound selection of both regularisation and penalty parameters, of all three major machine learning families, and construct a validation architecture that involves bootstrapping for stability purposes. Testing this selection suite across a range of high dimensional datasets, including some publicly available ones, lends further weight to a broad range of previous statistical and biological findings. We find significant overlap between the features found using our method and the ones identified as putative biomarkers in the original studies accompanying the publicly available datasets, some whose implication in disease has more recently been further explicated. We also apply our method to an in-house lymphoma gene expression dataset, independently identifying features that are already identified in other studies as biomarkers for the disease subtype in question, as well as discovering some novel putative ones. -- Supplementary material and appendices detail studies ancillary to the core of the thesis and may provide starting points for further research.
Table of Contents1. Biological background and introduction -- 2. The evolution of dimension reduction techniques -- 3. Two-step cross-entropy feature selection for microarrays - power through complementarity -- 4. Cancer microarray feature selection using support vector machines: comparing regularisation techniques -- 5. Relieving feature selection AECS and pains; a consensus approach to identifying biomarkers -- 6. Practical considerations, conclusions and future directions.
NotesThesis by publication. Bibliography: p. -162
Awarding InstitutionMacquarie University
Degree TypeThesis PhD
DegreeThesis (PhD), Macquarie University, Faculty of Science, Dept. of Statistics
Department, Centre or SchoolDepartment of Statistics
Year of Award2012
Principal SupervisorDavid Bulger
RightsCopyright disclaimer: http://www.copyright.mq.edu.au Copyright Timothy John Peters 2012.
Extent1 online resource (xvi, 162 p.) col. ill
Former Identifiersmq:26698 http://hdl.handle.net/1959.14/224712 1911929
Data miningLymphomas -- Mathematical modelsData reductiondimension reductionGene expression -- Mathematical modelsSystems biologyBioinformatics.Cancer -- Mathematical modelsLymphomasfeature selectionGene expressionbioinformaticslymphomagene expression profilingData mining -- Statistical methodsGene expression -- Statistical methodsCancer