Development of machine learning models for predicting mutation likelihood in SARS-CoV-2 Mᴾᴿᴼ and the NSP10-NSP16 complex using molecular dynamics simulation data
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Rhodes University
Faculty of Science, Biochemistry, Microbiology and Bioinformatics
Faculty of Science, Biochemistry, Microbiology and Bioinformatics
Abstract
The COVID-19 pandemic, caused by the SARS-CoV-2 virus, highlights the critical need for innovative methods to understand virus evolution and develop effective treatments. Mutations in SARS-CoV-2 proteins can increase virulence, prevent virus detection, and reduce the efficacy of treatments and vaccines. While SARS-CoV-2 mutation research generally focuses on the spike protein, some non-structural proteins (NSPs) warrant attention, such as NSP10, NSP16 and main protease (Mpro), also known as the 3C-like protease (3CLpro). These proteins are essential to the replication and immune capabilities of viruses, making them valuable targets for viral therapies. This study begins with an extension of the residue mutation predictions performed in Barozi et al. (2024), where the Python artificial neural network (ANN) and random forest (RF) models we had developed were fine-tuned and additional support vector machine (SVM) models were produced. All models were trained using the original Mpro dataset from Barozi et al. (2024), achieving moderate performance with an average accuracy of up to 76% on test subsets. In an attempt to improve the mutation prediction performance, an alternative dataset using raw Mpro MD trajectory coordinates was processed using convolutional neural networks (CNNs). However, the CNNs performed worse than the models trained on the processed Mpro trajectory data. Finally, the generalisibility of the ANN, RF and SVM models when applied to other SARS-CoV-2 protein data was investigated using the NSP10-NSP16 complex. To obtain a comparable dataset, molecular dynamics (MD) simulations of the NSP10-NSP16 complex were conducted with and without the SAM ligand. Stable trajectories were analysed through dynamic residue network (DRN) analysis, root mean square fluctuation (RMSF), solvent accessible surface area (SASA), B-factor, and BLOcks SUbstitution Matrix (BLOSUM) metrics to create machine learning (ML) input datasets for NSP10 and NSP16. These datasets were tested using the Mpro-trained models, resulting in a decline in performance compared to the Mpro test sets, indicating limited transferability. This study identified critical ML-based residue mutation prediction limitations, including small datasets, class imbalances, and structural instabilities during molecular dynamics simulations. However, it established a foundation for further research by demonstrating the importance of feature selection and the potential of ML models to predict viral residue mutations.