The cross-entropy method and multiple change-point detection in genomic sequences

Rathambalage, Madawa Priyadarshana Weerasinghe Jayawardana

doi:10.25949/19439210.v1

01whole.pdf (30.38 MB)

The cross-entropy method and multiple change-point detection in genomic sequences

thesis

posted on 2022-03-28, 19:40 authored by Madawa Priyadarshana Weerasinghe Jayawardana Rathambalage

Heterogeneity in observed data is a common feature that statisticians have to deal with when analyzing data. Estimating these changes in an observed process not only helps to better model the underlying phenomena, but also facilitates the process of making more informed decisions. In health informatics, when analyzing patients’ genomes with complex diseases, it is a pivotal step in finding disease-causing genes or active regions of the genome that has functional importance when characterizing these diseases. Changepoint analysis methods are among the best approaches that can be used to address this problem of locating important genomic variations in genomes. Detection of these variations helps researchers and practitioners to assess disease progression, prognosis and efficacy of treatments. Thus, at patient level it helps to provide more improved personalized medicine to alleviate a disease. The overall research aim of this thesis is to introduce the Cross-Entropy (CE) method, a model-based stochastic optimization procedure that nests under the branch of evolutionary computing techniques, to establish both the number of change-points and their locations in biological sequences. Particularly we focused on analyzing array comparative genomic hybridization (aCGH) data and DNA read count data obtained through next generation sequencing (NGS) methods. Several variants of the CE method are proposed in this work to detect change-point locations in both continuous and discrete (count) data. Di↵erent model selection criteria are used in the CE method to estimate the optimal number of change-points. It is known that evolutionary computing methods consume more computational resources due to the nature of their implementation. In this thesis we propose two alternative solutions to ameliorate this efficiency issue of the general CE algorithm. At first, we develop a multi-core parallel implementation of the CE algorithm in the R statistical computing environment. Later, for the first time in the literature, we combine two powerful sequential detection techniques with the CE method to further increase its efficiency. We further explore the feasibility of incorporating auxiliary information to the process of change-point detection in the CE method with the use of generalized additive model for location, scale and shape (GAMLSS). A series of extensive simulations were performed in multiple publications to establish the procedures and to ascertain their efficacy. We apply the proposed variants of the CE method to both aCGH and DNAread count data obtained through NGS methods to detect copy number variations. The methods discussed in this thesis are freely available as an R package named “breakpoint” at the website http://cran.r-project.org/web/packages/breakpoint/index.html. This thesis contains four peer-reviewed publications, which include a book chapter, a journal article and two conference papers. It further includes details of an R package developed to detect multiple change-points in continuous and count data based on the methods developed in this thesis.

History

Alternative Title

CE method and multiple change-point detection.

1. Introduction and thesis outline -- 2. Methods -- 3. Multiple break-points detection in array CGH data via the cross-entropy method -- 4. Hybrid algorithms for multiple change-point detection biological sequences -- 5. The cross entropy method for detecting multiple change points in DNA read count data -- 6. GAMLSS and extended cross-entropy method to detect multiple change-points in DNA read count data -- 7. Breakpoint : an R package to detect multiple change-points via the CE method -- 8. Discussion and future directions.

Notes

Includes bibliographic references Thesis by publication. Spine title: The CE method and multiple change-point detection.

Awarding Institution

Macquarie University

Degree Type

Thesis PhD

Degree

PhD, Macquarie University, Faculty of Science and Engineering, Department of Statistics

Department, Centre or School

Department of Statistics

Year of Award

2015

Principal Supervisor

Georgy Sofronov

Additional Supervisor 1

David Bulger

Rights

Copyright Madawa Priyadarshana, W.J.R. 2015. Copyright disclaimer: http://www.copyright.mq.edu.au

Language

English

Extent

1 online resource (xxi 188 pages) illustrations (some colour)

Former Identifiers

mq:44227 http://hdl.handle.net/1959.14/1067413

Usage metrics

Keywords

cross-entropy method Genomics -- Statistical methods Stochastic sequences Mathematical optimization Genomics change-point detection next generation sequencing aCGH data DNA sequences Cross-entropy method

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

The cross-entropy method and multiple change-point detection in genomic sequences

History

Alternative Title

Table of Contents

Notes

Awarding Institution

Degree Type

Degree

Department, Centre or School

Year of Award

Principal Supervisor

Additional Supervisor 1

Rights

Language

Extent

Former Identifiers

Usage metrics

Categories

Keywords

Licence

Exports