A Markov chain Monte Carlo algorithm for change-point detection in nanopore sequencing data
Understanding the genetic makeup of organisms is a very important goal in bioinformatics. DNA sequencing, the process of determining the order of the nucleotide bases in DNA, can now be performed quickly and cheaply with commercially available devices no bigger than a USB stick. The latest DNA sequencers use nanopore technologies to capture long, repetitive DNA structures with great success, however, the reported reading accuracy needs improving. One main source of error occurs during the basecalling process when raw nanopore signals outputted by the sequencers are being translated into genetic codes. The difficulty of basecalling lies in that not only do the nanopore signals need to be segmented, but they also need be grouped into four types, each representing a genetic code. In this thesis, we propose a novel algorithm using change-point detection methods and Markov chain Monte Carlo (MCMC) sampling techniques. We use real and simulated data to demonstrate the effectiveness of the proposed algorithm and compare it with other change-point detection packages.