Advanced bioinformatics approaches for proteomics data analysis
Mass spectrometry (MS) coupled with liquid chromatography (LC-MS/MS) has become a well-known technique for the discovery of a large number of proteins from complex biological samples. More recently, the technique called Data Independent Acquisition (DIA) has been developed for MS-based proteomics studies and is emerging as a highly accurate and reproducible method for quantitative proteomics. Data extraction and analysis for DIA-based studies require reference spectral data as a pre-requisite, generated using MS in data-dependent acquisition (DDA) mode. This has been partially addressed recently with the availability of spectral data in public repositories. However, the incorporation of these datasets to create a complete reference proteome for DIA-based proteomics studies, remains a challenge. The overall objective of this thesis is to establish advanced bioinformatics approaches which facilitate and maximise the utilisation of DDA data, to improve the identification and quantitation of proteins from individual DIA experiments. The generation, availability and compatibility of DIA reference libraries is a laborious task, requiring a significant expenditure of computational and experimental resources. In the first phase, I developed and deployed a platform to integrate spectral data stored in proteomics databases, using the R programming language and the R Shiny package. This open-source web-based interactive user interface 'iSwathX' provides fully-automated processing of reference assay libraries by normalizing and combining the spectra from different DDA-based datasets to generate extended libraries. The interface also provides novel functions to analyse the multiple DDA libraries simultaneously for quick and efficient data analysis. In the next phase, I extended the integrated libraries approach to design cross-sample libraries in order to examine the complex human plasma proteome. Plasma-based DIA studies have always been affected by the lack of comprehensive and in-depth DDA libraries which are also dominated by high abundant proteins. In this study, I have designed the integrated cross-sample library by incorporating DDA data from cell samples, in addition to plasma samples. This greatly enhanced the library size and search space for DIA data extraction and analysis. As a result, I was able to identify and quantify a larger number of proteins from human plasma, which could potentially lead to the discovery of new disease biomarkers. Separately, I developed a strategy for utilising proteomics data from homologous species to generate a cross-species reference proteome. For this, I created libraries from the plasma proteomes of domestic animals and applied these not only to study these domestic bovids but also leveraged them to study a wild bovid species. The cross-species libraries were also scrutinized to study the proteome of the distant-related species whose genome sequences are not known. This innovative analysis approach successfully led to the identification and quantification of proteins from multiple species through comparative proteomics analysis. In conclusion, this thesis demonstrates the development and application of novel bioinformatics approaches which support extensive and dynamic analysis of protein data generated by different mass spectrometry techniques. Additionally, enhanced use of DIA-MS methods in large-scale diverse proteomics studies presented novel biological findings. The methods presented in this thesis can potentially accelerate the discovery of previously inaccessible proteomics data, leading to new insights for biomedical and therapeutic studies as well as conservation and biodiversity studies.