Bioinformatics analysis of proteins and proteomes
thesisposted on 2022-03-28, 12:53 authored by Islam Mohammad
The advancement of next-generation proteomics methodologies has led to an explosion in proteomics data. However, the analysis and interpretation of this data remains a challenge, as several proteins remain unannotated and uncharacterised for many organisms. Despite the presence of the large volume of mass spectrometry (MS) data in various datasets, over 10% human proteins are still considered 'missing'. Bioinformatics techniques can be used to provide comprehensive annotations for entire proteomes to provide valuable information regarding putative functions of proteins that can be validated and or supplemented with experimental data. The aims of this thesis are to tackle some of these challenges, firstly to develop a generic in silico bioinformatics pipeline to identify homologues and map putative functional signatures, gene ontology terms and biochemical pathways of novel organisms, or "missing'' proteins. This pipeline was used to identify homologues for 2,587 proteins and functional annotation for 2,486 proteins from black Périgord truffle (Tuber melanosporum Vittad), followed by MS-based shotgun proteomics to validate 836 proteins. The same pipeline was then used to annotate the human "missing" protein sequences on each human chromosome available through the ProtAnnotator web portal, with homologues from the mammalian kingdom for 2538 (66.2%, based on September 2013 data). ProtAnnotator also functionally annotated 1945 (50.8%) "missing" human proteins. ProtAnnotator 2.0 automated the process and provides an update to the annotation of the truffle proteome. The lack of coherency between the proteomics data submitted to various databases, processed by different search engines has limited their integration in the quest for uncovering human "missing" proteins. To this end, a scheme was worked out for comparing proteomics data from different sources, looking at proteotypicity and search engine scores, with guidelines on spectral quality analysis as well. Finally, ProtAnnotator and the proteomics data integration strategy above were integrated, to create a novel integrated web platform (MissingProteinPedia) to define, collate and make serviceable all available data (including single proteotypic MS spectra) from various databases and web platforms for human "missing" proteins. The MissingProteinPedia (MPP) platform comprises a freely available web interface for datamining, collaboration and validation of MS and publication data. MPP permits protein-level identification of proteins that have very short tryptic peptides, such as interleukin-9, proteins traditionally known but without proteomic or antibody data as well as those that are carefully identified by our integrated computational workflow followed by expert spectral analysis.The tools developed in this thesis provide data integration to accelerate the annotation of novel proteomes and the discovery of human missing proteins.