Abstract: Gardiasis is a waterborne disease and a widespread eukaryotic parasite that contributes to diarrheal diseases in humans and animals. Today, treatment of giardia lamblia infections, include the usage of nitroimidazoles, benzimidazoles, quinacrine, nitazoxanide, furazolidone, and paromomycin but they impose serious limitations making research in this field vital for anti-giardial compounds discovery. Screening of chemical libraries have revealed that Class II (Zn2+ functions in electrophilic catalysis) fructose-1,6-bisphosphate aldolase (FBPA) as an excellent potential for drug targeting in inhibition of Giardia lamblia growth. FBPA is essential for G. lamblia survival. In the current study, three supervised state-of-art classifiers (J48, Random Forest, and Naïve Bayes) have been used to generate predictive models of the assay for FBPA in order to analyze growth inhibition of Giardia trophozoites. Virtual screening has been employed to prioritize molecules and ease the drug discovery process by focusing on a small subset of the database for experimental screening. The high-throughput screen datasets for the inhibition of Giardia lamblia growth have been studied by employing a machine learning approach in order to model the activity of FBPA and help in the discovery of anti-giardial compounds. Highly accurate classifiers have been developed based on molecular descriptors of the molecules. J48 was found to be the most accurate in comparison to Random Forest and Naïve Base as assessed from the ROC curve analysis.
1. Biological assay data: Dataset AID2451 has been obtained from PubChem database maintained by National Center for Biotechnology Information (NCBI) to analyze tested compounds. Dataset 2451 consists of 287,744 tested compounds segregated into 2,033 active, 273,075 inactive, and 12,878 inconclusive. Compounds having a PubChem activity score ranging from 40 to 100 were concluded as active, while compounds with a score of 0 were concluded as inactive. Aside from actives and inactives, there were inconclusive compounds with an activity score ranging from 1 to 39, which have been excluded in this study due to the uncertainty in their bioactivities. Hence, compounds from active and inactive datasets inclusively have been downloaded in Structural Data Format (SDF) for analysis.
2. Molecular descriptors and data pre-processing: 2D molecular descriptors of active and inactive datasets have been generated using PowerMV. PowerMV is a readily available windows software environment for descriptor generation, molecular viewing, similarity search, and statistical analysis. The inactive dataset used in this study was split into smaller SDF files using a Perl code provided by MayaChemTools to overcome the out-of-memory error generated by PowerMV. Each of the files, inactive and active were loaded in PowerMV and a set of 179 descriptors was computed corresponding to all three datasets. The descriptors generated were segregated into, 147 pharmacophore fingerprints, 24 weighted burden numbers, and 8 property descriptors. All descriptors generated were in binary format, with each bit being either ‘1’ if a feature is present or ‘0’ if it’s absent. In order to increase the efficiency of the dataset, any attribute having only one value (either 0 or 1) throughout the dataset was removed. A custom code was run to randomly split the dataset into 80% train-cum-validation set and a 20% independent test set. Finally, a 5-fold cross validation technique was used for the training and test sets.
3. Machine learning of dataset: Classification is one of the main tasks of inductive learning in machine learning. Many effective inductive learning techniques have been developed, such as naïve bayes, decision trees, neural networks, etc . The three classifiers used in this research are Naïve Bayes, J48, and Random Forest. Naive Bayes assumes that each descriptor is statistically independent as it determines the conditional probability of each descriptor given the class label. Classification is performed by computing the probability of a class in a particular instance of descriptors and estimating the class with highest posterior probability. While, J48 follows the decision tree learner algorithm C4.5. It generates a tree data structure that can be used to classify new instances. Lastly, Random Forest is a combination of tree predictors, whereby multiple classification trees are generated from an independent duplicate distributed random input vector .
4. Building Classification models: In the current study only two misclassification costs have been deployed: MetaCost and CostSensitiveClassifier (CSC). MetaCost is grounded on relabeling training instances with minimum expected cost class and then applying the error-based learner to the new training set, while creating reliable probability estimates on training examples. It has been used for J48 with unpruned option set to true. On the other hand, CostSensitiveClassifier was employed for the rest; with default options being used for Naïve Bayes and Random Forest. Basically, CSC uses two methods to introduce cost-sensitivity: reweighting training instances according to the total cost assigned to each class and predicting the class with minimum expected misclassification cost . Weka, a windows software by the University of Waikato for solving data mining problems was used to run classification calculations.
5. Performance measures: Several performance measures were used to evaluate the effectiveness of the models. A set of equations along with the ROC plot is a sufficient measure to evaluate the performance of the results. Sensitivity is the proportion of the correctly predicted positives, while Specificity is the percentage of actual negative instances predicted as negatives. Moreover, Accuracy is the overall percentage of correct predictions, while Balanced Classification Rate (BCR) is the balance in a classification.
 Sheng VS, Ling CX: Thresholding for making classifiers cost-sensitive. In: Proceedings of the 21st national conference on Artificial intelligence – Volume 1. Boston, Massachusetts: AAAI Press; 2006: 476-481.
 Periwal V, Kishtapuram S, Consortium OSDD, Scaria V: Computational models for in-vitro anti-tubercular activity of molecules based on high-throughput chemical biology screening datasets. BMC Pharmacology 2012, 12(1):1.
Important Remarks: This page only includes Project Abstract and Methodology. Research Results can be obtained upon request after approval from project supervisors. [Project Duration: March 2013 – May 2013]
Acknowledgement: This work was fully guided by Dr. Vinod Scaria and Ms. Salma Jamal (project supervisors) with the support from Institute of Genomics and Integrative Biology, India.