(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Computer Research (IJACR)

ISSN (Print):2249-7277    ISSN (Online):2277-7970
Volume-10 Issue-47 March-2020
Full-Text PDF
Paper Title : Traditional machine learning and big data analytics in virtual screening: a comparative study
Author Name : Sahar K. Hussin, Yasser M. Omar, Salah M. Abdelmageid and Mahmoud I. Marie
Abstract :

Nowadays, the massive amount of data that needs to be processed is increased. High-performance computing (HPC) and big data analytics are required. In the identical context, research on drug discovery has reached an area where it has no preference, but the use of HPC and huge data processing systems to perform its targets at a reasonable time. Virtual screen (VS) is one of the costliest tasks in terms of computation requirements. It is considered as an intensive and heavy task. At the same time, it plays an essential role in new drug design. This research investigates machine learning and big data analytics in VS. It tries to use a ligand base and a structural base and rank molecular databases as active against a specific target protein. The machine learning algorithms, including random forests, naive Bayesian classifiers, nerve networks, decision trees, support vector machines, and deep-learning strategies have been developed for both Ligand-based and structure-based docking. Also, this paper introduces a review of previous research conducted on the utilization of machine learning as well as big data analytics framework in VS. The paper outlines the current progress in the use of traditional methods for machine learning and massive data analytic applications in a multi-node dataset. This article compares the estimation of machine learning approaches and broad ligand-base theoretical system. It also explores how machine learning approaches can improve the performance of various problems of virtual screening classification in broad repositories. Finally, various challenges and solutions of the virtual screening dataset in the machine learning and big data analytics are discussed.

Keywords : Drug discovery, Virtual screening, Descriptors, Machine learning and Big data analytics frameworks.
Cite this article : Hussin SK, Omar YM, Abdelmageid SM, Marie MI. Traditional machine learning and big data analytics in virtual screening: a comparative study. International Journal of Advanced Computer Research. 2020; 10(47):72-88. DOI:10.19101/IJACR.2019.940150.
References :
[1]Ross K. Protein bioinformatics: from protein modifications and networks to proteins. Humana Press. 2017.
[Google Scholar]
[2]Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, Lewell XQ, et al. Evaluation of machine-learning methods for ligand-based virtual screening. Journal of Computer-Aided Molecular Design. 2007; 21(1-3):53-62.
[Crossref] [Google Scholar]
[3]Yang H, Chen J, Tang S, Li Z, Zhen Y, Huang L, et al. New drug R&D of traditional Chinese medicine: role of data mining approaches. Journal of Biological Systems. 2009; 17(3):329-47.
[Crossref] [Google Scholar]
[4]Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research. 2012; 40(D1):D1100-7.
[Crossref] [Google Scholar]
[5]Maltarollo VG, Kronenberger T, Espinoza GZ, Oliveira PR, Honorio KM. Advances with support vector machines for novel drug discovery. Expert Opinion on Drug Discovery. 2019; 14(1):23-33.
[Crossref] [Google Scholar]
[6]Shoichet BK. Virtual screening of chemical libraries. Nature. 2004; 432:862-5.
[Crossref] [Google Scholar]
[7]Afolabi LT, Saeed F, Hashim H, Petinrin OO. Ensemble learning method for the prediction of new bioactive molecules. PloS One. 2018; 13(1):1-14.
[Crossref] [Google Scholar]
[8]Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN, Andrade CH. QSAR-based virtual screening: advances and applications in drug discovery. Frontiers in Pharmacology. 2018; 9:1-7.
[Crossref] [Google Scholar]
[9]Huang HJ, Yu HW, Chen CY, Hsu CH, Chen HY, Lee KJ, et al. Current developments of computer-aided drug design. Journal of the Taiwan Institute of Chemical Engineers. 2010; 41(6):623-35.
[Crossref] [Google Scholar]
[10]Liu X, Xu Y, Li S, Wang Y, Peng J, Luo C, et al. In silicotarget fishing: addressing a “Big Data” problem by ligand-based similarity rankings with data fusion. Journal of Cheminformatics. 2014; 6:1-14.
[Crossref] [Google Scholar]
[11]Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discovery Today. 2015; 20(3):318-31.
[Crossref] [Google Scholar]
[12]Ahmed L, Edlund A, Laure E, Spjuth O. Using iterative MapReduce for parallel virtual screening. In 5th international conference on cloud computing technology and science 2013 (pp. 27-32). IEEE.
[Crossref] [Google Scholar]
[13]Ballester PJ, Mitchell JB. A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics. 2010; 26(9):1169-75.
[Crossref] [Google Scholar]
[14]Thai KM, Nguyen TQ, Ngo TD, Tran TD, Huynh TN. A support vector machine classification model for benzo [c] phenathridine analogues with topoisomerase-I inhibitory activity. Molecules. 2012; 17(4):4560-82.
[Crossref] [Google Scholar]
[15]Lionta E, Spyrou G, K Vassilatis D, Cournia Z. Structure-based virtual screening for drug discovery: principles, applications and recent advances. Current Topics in Medicinal Chemistry. 2014; 14(16):1923-38.
[Google Scholar]
[16]https://en.wikipedia.org/wiki/Virtual_screening. Accessed 21 November 2019.
[17]Banerjee P, Preissner R. BitterSweetForest: a random forest based binary classifier to predict bitterness and sweetness of chemical compounds. Frontiers in Chemistry. 2018; 6:1-10.
[Crossref] [Google Scholar]
[18]Xiong Y, Qiao Y, Kihara D, Zhang HY, Zhu X, Wei DQ. Survey of machine learning techniques for prediction of the isoform specificity of cytochrome P450 substrates. Current Drug Metabolism. 2019; 20(3):229-35.
[Crossref] [Google Scholar]
[19]Ponzoni I, Sebastián-Pérez V, Martínez MJ, Roca C, De la Cruz Pérez C, Cravero F, et al. QSAR classification models for predicting the activity of inhibitors of beta-secretase (BACE1) associated with alzheimer’s disease. Scientific Reports. 2019; 9:1-13.
[Crossref] [Google Scholar]
[20]Muegge I, Mukherjee P. An overview of molecular fingerprint similarity search in virtual screening. Expert Opinion on Drug Discovery. 2016; 11(2):137-48.
[Crossref] [Google Scholar]
[21]Korkmaz S, Zararsiz G, Goksuluk D. Drug/nondrug classification using support vector machines with various feature selection strategies. Computer Methods and Programs in Biomedicine. 2014; 117(2):51-60.
[Crossref] [Google Scholar]
[22]Li Y, Kong Y, Zhang M, Yan A, Liu Z. Using support vector machine (SVM) for classification of selectivity of H1N1 neuraminidase inhibitors. Molecular Informatics. 2016; 35(3‐4):116-24.
[Crossref] [Google Scholar]
[23]Kumar A, Verma DK, Purohit R. Conceptual modelling of telapathic network. Metabolomics. 2012; 2(5).
[Crossref] [Google Scholar]
[24]Ani R, Manohar R, Anil G, Deepa OS. Virtual screening of drug likeness using tree based ensemble classifier. Biomedical and Pharmacology Journal. 2018; 11(3):1513-9.
[Crossref] [Google Scholar]
[25]Yosipof A, Guedes RC, García-Sosa AT. Data mining and machine learning models for predicting drug likeness and their disease or organ category. Frontiers in Chemistry. 2018; 6:1-11.
[Crossref] [Google Scholar]
[26]Bahi M, Batouche M. Deep semi-supervised learning for virtual screening based on big data analytics. In international conference on big data, cloud and applications 2018 (pp. 173-84). Springer, Cham.
[Crossref] [Google Scholar]
[27]Bahi M, Batouche M. Drug-target interaction prediction in drug repositioning based on deep semi-supervised learning. In international conference on computational intelligence and its applications 2018 (pp. 302-13). Springer, Cham.
[Crossref] [Google Scholar]
[28]Khan A, Kaushik AC, Ali SS, Ahmad N, Wei DQ. Deep-learning-based target screening and similarity search for the predicted inhibitors of the pathways in Parkinsons disease. RSC Advances. 2019; 9:10326-39.
[Crossref] [Google Scholar]
[29]Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072. 2015.
[Google Scholar]
[30]Inglese P, McKenzie JS, Mroz A, Kinross J, Veselkov K, Holmes E, et al. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chemical Science. 2017; 8:3500-11.
[Crossref] [Google Scholar]
[31]Constantine RM, Batouche M. Drug discovery for breast cancer based on big data analytics techniques. In international conference on information & communication technology and accessibility 2015 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[32]Sid K, Batouche M. Ensemble learning for large scale virtual screening on apache spark. In IFIP international conference on computational intelligence and its applications 2018 (pp. 244-56). Springer, Cham.
[Crossref] [Google Scholar]
[33]Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. Journal of Chemical Information and Computer Sciences. 2003; 43(6):1882-9.
[Crossref] [Google Scholar]
[34]Zernov VV, Balakin KV, Ivaschenko AA, Savchuk NP, Pletnev IV. Drug discovery using support vector machines, the case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. Journal of Chemical Information and Computer Sciences. 2003; 43(6):2048-56.
[Crossref] [Google Scholar]
[35]Warmuth MK, Liao J, Rätsch G, Mathieson M, Putta S, Lemmen C. Active learning with support vector machines in the drug discovery process. Journal of Chemical Information and Computer Sciences. 2003; 43(2):667-73.
[Crossref] [Google Scholar]
[36]Jorissen RN, Gilson MK. Virtual screening of molecular databases using a support vector machine. Journal of Chemical Information and Modeling. 2005; 45(3):549-61.
[Crossref] [Google Scholar]
[37]Podolyan Y, Walters MA, Karypis G. Assessing synthetic accessibility of chemical compounds using machine learning methods. Journal of Chemical Information and Modeling. 2010; 50(6):979-91.
[Crossref] [Google Scholar]
[38]Cheng T, Li Q, Wang Y, Bryant SH. Binary classification of aqueous solubility using support vector machines with reduction and recombination feature selection. Journal of Chemical Information and Modeling. 2011; 51(2):229-36.
[Crossref] [Google Scholar]
[39]Camps-Valls G, Bruzzone L. Kernel-based methods for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2005; 43(6):1351-62.
[Crossref] [Google Scholar]
[40]Sato T, Honma T, Yokoyama S. Combining machine learning and pharmacophore-based interaction fingerprint for in silico screening. Journal of Chemical Information and Modeling. 2010; 50(1):170-85.
[Crossref] [Google Scholar]
[41]Von Korff M, Sander T. Toxicity-indicating structural patterns. Journal of Chemical Information and Modeling. 2006; 46(2):536-44.
[Crossref] [Google Scholar]
[42]Abdo A, Chen B, Mueller C, Salim N, Willett P. Ligand-based virtual screening using bayesian networks. Journal of Chemical Information and Modeling. 2010; 50(6):1012-20.
[Crossref] [Google Scholar]
[43]Gleeson MP, Waters NJ, Paine SW, Davis AM. In silico human and rat V ss quantitative structure−activity relationship models. Journal of Medicinal Chemistry. 2006; 49(6):1953-63.
[Google Scholar]
[44]Ai S, Bai Y, Liu X. Virtual screening for COX-2 inhibitors with random forest algorithm and feature selection. In proceedings of the international conference on bioinformatics research and applications 2017 (pp. 9-14).
[Crossref] [Google Scholar]
[45]Lee K, Lee M, Kim D. Utilizing random forest QSAR models with optimized parameters for target identification and its application to target-fishing server. BMC Bioinformatics. 2017; 18(16):75-86.
[Crossref] [Google Scholar]
[46]Kauffman GW, Jurs PC. QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. Journal of Chemical Information and Computer Sciences. 2001; 41(6):1553-60.
[Crossref] [Google Scholar]
[47]Itskowitz P, Tropsha A. k nearest neighbors QSAR modeling as a variational problem: theory and applications. Journal of Chemical Information and Modeling. 2005; 45(3):777-85.
[Crossref] [Google Scholar]
[48]Patel JL, Patel LD. Artificial neural networks and their applications in pharmaceutical research. Pharmabuzz. 2007; 2:8-17.
[Google Scholar]
[49]Soyguder S. Intelligent control based on wavelet decomposition and neural network for predicting of human trajectories with a novel vision-based robotic. Expert Systems with Applications. 2011; 38(11):13994-4000.
[Crossref] [Google Scholar]
[50]Behrmann J, Etmann C, Boskamp T, Casadonte R, Kriegsmann J, Maaβ P. Deep learning for tumor classification in imaging mass spectrometry. Bioinformatics. 2018; 34(7):1215-23.
[Crossref] [Google Scholar]
[51]Pérez-Sianes J, Pérez-Sánchez H, Díaz F. Virtual screening meets deep learning. Current Computer-aided Drug Design. 2019; 15(1):6-28.
[Crossref] [Google Scholar]
[52]Koutsoukas A, Lowe R, KalantarMotamedi Y, Mussa HY, Klaffke W, Mitchell JB, et al. In silico target predictions: defining a benchmarking data set and comparison of performance of the multiclass Naïve Bayes and Parzen-Rosenblatt window. Journal of Chemical Information and Modeling. 2013; 53(8):1957-66.
[Crossref] [Google Scholar]
[53]https://pubchem.ncbi.nlm.nih.gov. Accessed 14 November 2019.
[54]https://spark.apache.org/. Accessed 14 November 2019.
[55]Fathima AJ, Murugaboopathi G. A novel customized big data analytics framework for drug discovery. Journal of Cyber Security and Mobility. 2018; 7(1):145-60.
[Crossref] [Google Scholar]
[56]García-Sosa AT, Oja M, Hetényi C, Maran U. DrugLogit: logistic discrimination between drugs and nondrugs including disease-specificity by assigning probabilities based on molecular properties. Journal of Chemical Information and Modeling. 2012; 52(8):2165-80.
[Crossref] [Google Scholar]
[57]Khaldy MA, Kambhampati C. Resampling imbalanced class and the effectiveness of feature selection methods for heart failure dataset. International Robotics & Automation Journal. 2018; 4(1):37-45.
[Crossref] [Google Scholar]
[58]Jahan S, Shatabda S, Farid DM. Active learning for mining big data. In international conference of computer and information technology (ICCIT) 2018 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]