The cross-entropy method and multiple change-point detection in genomic sequences
thesisposted on 28.03.2022, 19:40 by Madawa Priyadarshana Weerasinghe Jayawardana Rathambalage
Heterogeneity in observed data is a common feature that statisticians have to deal with when analyzing data. Estimating these changes in an observed process not only helps to better model the underlying phenomena, but also facilitates the process of making more informed decisions. In health informatics, when analyzing patients’ genomes with complex diseases, it is a pivotal step in finding disease-causing genes or active regions of the genome that has functional importance when characterizing these diseases. Changepoint analysis methods are among the best approaches that can be used to address this problem of locating important genomic variations in genomes. Detection of these variations helps researchers and practitioners to assess disease progression, prognosis and efficacy of treatments. Thus, at patient level it helps to provide more improved personalized medicine to alleviate a disease. The overall research aim of this thesis is to introduce the Cross-Entropy (CE) method, a model-based stochastic optimization procedure that nests under the branch of evolutionary computing techniques, to establish both the number of change-points and their locations in biological sequences. Particularly we focused on analyzing array comparative genomic hybridization (aCGH) data and DNA read count data obtained through next generation sequencing (NGS) methods. Several variants of the CE method are proposed in this work to detect change-point locations in both continuous and discrete (count) data. Di↵erent model selection criteria are used in the CE method to estimate the optimal number of change-points. It is known that evolutionary computing methods consume more computational resources due to the nature of their implementation. In this thesis we propose two alternative solutions to ameliorate this efficiency issue of the general CE algorithm. At first, we develop a multi-core parallel implementation of the CE algorithm in the R statistical computing environment. Later, for the first time in the literature, we combine two powerful sequential detection techniques with the CE method to further increase its efficiency. We further explore the feasibility of incorporating auxiliary information to the process of change-point detection in the CE method with the use of generalized additive model for location, scale and shape (GAMLSS). A series of extensive simulations were performed in multiple publications to establish the procedures and to ascertain their efficacy. We apply the proposed variants of the CE method to both aCGH and DNAread count data obtained through NGS methods to detect copy number variations. The methods discussed in this thesis are freely available as an R package named “breakpoint” at the website http://cran.r-project.org/web/packages/breakpoint/index.html. This thesis contains four peer-reviewed publications, which include a book chapter, a journal article and two conference papers. It further includes details of an R package developed to detect multiple change-points in continuous and count data based on the methods developed in this thesis.