Biomarker discovery using bioinformatics methods
thesisposted on 2022-03-28, 21:46 authored by Md Tawhidul Islam
A biomarker is a biochemical indicator of a biologic state that may serve as an indicator or predictor of a disease. Biomarker is used to measure presence, risk, progress or the effect of treatment of a disease rather than measuring the disease itself. Biomarkers act as a basis for the selection of lead candidates for clinical trials. Scientists have been searching for biomarkers for decades. Methods of discovery have developed as the technology emerges. Advances in genomics and proteomics have made it easier to interrogate hundreds or thousands of potential markers at a time and produced an unprecedented growth in the volume of new data in the field of biomarker, drug discovery and patient care. However success and progress of such work is very much dependent on prior knowledge and experience with the potential markers of interest. The diverse data generated by high-throughput biotechnology is an ideal starting point for gaining knowledge in system bioinformatics. This information is only useful if it is easily accessible. However, majority of them are presented in free-text format that are not readily available for automatic computerized analysis. In this thesis we present a novel knowledge aggregation approach based on statistical, user-defined structural rules, machine learning, text mining and Natural Language Processing (NLP) techniques to automatically extract biomarker related information from scientific literatures. Our knowledge aggregation approach combines of two major tasks namely, Information Extraction and Relationship Extraction. Therefore the thesis first presents an automatic information retrieval, summarization and extraction (mExtract) tool. Built on statistical and pattern matching NLP technique our intelligent agent system, mExtract is capable of retrieving most relevant documents from the web based on user queries. Once the documents are retrieved, system then uses its underlying techniques to extract biomarker specific information (i.e. protein, gene, genome, disease) from the text by finding out the focal topic of the document and extracting the most relevant properties of that topic and also generates a summary of the topic. Secondly, we present our extended system namely Biomarker Information Extraction Tool (BIET), that is capable of extracting biomarker relationship within disease, gene and protein. For a given set of oncology related texts (i.e., Abstract), BIET extracts biomarker relationship namely, is biomarker of (disease, gene/protein) from the texts. Built on state-of-the-art statistical models and machine learning techniques BIET consists of three major components; Semantic Category Recognition to identify the evaluative sentences among other sentences by recognizing words and phrases in the text belonging to semantic categories of interest to bio-medical entities, Assertion Classification to determine whether the statement refers to biomarker entity (protein, gene and disease) relationship and Semantic Relationship Classification to identify the biomarker relationship among the biomedical entities. The diverse applications presented in this thesis demonstrate that our new knowledge aggregation approach is practical, effective in the sense it utilizes a series of statistical models that are heavily reliant on local lexical and syntactic context and achieve competitive results compared to more complex NLP solutions; versatile as it is easily extendable to similar or more complex relation extraction task and represents an important contribution to bioinformatics and to the fields of biomedical research in which it is applied.