Molecular similarity and diversity analysis of bioactive small molecules using chemoinformatics approaches
thesisposted on 2022-03-28, 13:59 authored by Varun Khanna
The search for pharmaceutically interesting compounds using computational methods is the core idea in chemoinformatics. With the advent of combinatorial synthesis and highthroughput screening (HTS), researchers and drug industries are currently able to screen millions of compounds each day. However, improvements in screening capabilities have failed to yield a proportionate increase in novel chemotypes. Given the magnitude of compounds in one of the most popular chemistry databases, PubChem, it is irrational to experimentally screen all compounds for a potential target. This thesis aims to study the property space occupied by therapeutic compounds of economic importance obtained from public datasets, using chemoinformatics tools and computational technologies. With this objective in mind, a comprehensive review of current chemoinformatics research, with a particular emphasis on drug discovery was carried out. In addition, the most commonly used, freely available small molecule databases and algorithms for small molecule analysis were also reviewed. Further, recent developments in computational library design techniques were summarized in a separate review article. For web-based analysis and visualization of small molecules, I have developed the chemoinformatics analysis module for the Customary Medicinal Knowledgebase (CMKb; http://www.biolinfo.org/cmkb) which has served as a prototype to integrate the use of medicinal plant among Australian Aboriginals with bioactives, for identifying potential lead compounds. In order to examine the similarity of current drug molecules with human metabolites and toxics, a preliminary comparative study based on several computed physicochemical properties and functional groups was carried out. We established that searching against complete datasets was comparable to results obtained from clustered data. We then used a multi-criteria approach to analyse physicochemical properties, scaffold architecture and fragment occurrence among large public datasets of biological interest viz. drugs, metabolites, toxics, natural products, lead compounds and the ChEMBL dataset. Fragments are often dependent on each other and therefore, fragment co-occurrences were further assessed by association analysis. Going beyond the general datasets, a nematode-specific anthelmintic dataset was also analysed. Machine learning methods were used to screen potential anthelmintic compounds from public collections and novel anthelmintics have been identified. From our preliminary analysis, it was established that although the physicochemical property space occupied by the drugs, human metabolites and toxics was distinct, presentday drugs are more akin to toxic compounds than to metabolites. This result was in accordance with high attrition rates in drug discovery projects. Furthermore, we concluded that empirical rules such as Lipinski’s “rule of five” can be supplemented to include toxicity information. Following preliminary study on physicochemical properties, we corroborated our earlier finding that metabolites are least similar to current day drugs in our subsequent comprehensive analysis. However, in scaffold analysis we found that over 42.0% of the non-redundant metabolite scaffolds are represented among drugs which suggest that drugs and metabolites largely differ in side chains and linkers but vastly share the scaffold space. Additionally, a robust statistical technique known as association analysis was explored for the first time in chemoinformatics to carry out efficient mining and fragment co-occurrence analysis.