Apply Machine Learning Algorithms for Classification Drug Discovery Data
Updated: 05/13/2024
1. Overview
2. Research Question
3. Nonstructural Protein 5B Dataset Selection
4. PaDELPy Molecular Fingerprints Calculation
5. Data Class Pipeline Implementation
6. Applying Random Forest Classifier Model
7. Target Imbalanced Activity Classes
8. Detecting Machine Learning Model Overfitting
9. The Lazy Predict Library for Machine Learning Algorithms
10. Conclusions
1. Overview
Drug Discovery is the process by which new medications are identified and developed. It involves a series of steps aimed at finding compounds that have the potential to treat or prevent diseases. These steps typically include:
· Target identification: This involves identifying a specific molecule, such as a protein or enzyme, that plays a key role in a disease process and could be targeted by a drug.
· Target validation: Once a potential target is identified, researchers conduct experiments to confirm that modulating this target can indeed affect the disease.
· Lead compound identification: Researchers search for compounds, often through high-throughput screening or computer-aided drug design, that have the potential to interact with the target and modify its activity.
· Lead optimization: The most promising compounds are further refined and optimized to improve their efficacy, safety, and pharmacokinetic properties.
· Preclinical studies: Before a drug candidate can be tested in humans, it must undergo extensive testing in laboratory and animal models to evaluate its safety and efficacy.
· Clinical trials: If a drug candidate shows promise in preclinical studies, it can advance to clinical trials, which involve testing the drug in human volunteers to assess its safety and effectiveness.
· Regulatory approval: After successful completion of clinical trials, the drug developer submits an application to regulatory agencies, such as the Food and Drug Administration (FDA) in the United States, for approval to market the drug.
· Post-marketing surveillance: Once a drug is approved and, on the market, ongoing monitoring is conducted to ensure its safety and effectiveness in real-world settings.
Machine Learning (ML) can play a significant role in various stages of the drug discovery process, enhancing efficiency, accuracy, and speed. Here are some ways in which machine learning can help:
· Target identification and validation: ML algorithms can analyze large-scale biological data, such as genomics, proteomics, and metabolomics data, to identify potential drug targets and validate their relevance to specific diseases. By analyzing patterns in these data sets, ML can help uncover novel insights into disease mechanisms and identify promising therapeutic targets.
· Drug repurposing: ML algorithms can analyze vast amounts of data from existing drugs, including their chemical structures, biological activities, and clinical outcomes, to identify potential candidates for repurposing. By repurposing existing drugs for new indications, drug discovery can be accelerated and costs reduced.
· Compound screening: High-throughput screening (HTS) assays generate massive amounts of data on the biological activity of compounds. ML algorithms can analyze these data to identify compounds with desired properties more efficiently than traditional methods. Furthermore, ML can help prioritize compounds for further testing based on predicted efficacy, safety, and other factors.
· Lead optimization: ML algorithms can assist in the optimization of lead compounds by predicting their physicochemical properties, pharmacokinetics, and potential toxicity. By analyzing structure-activity relationships (SAR) in large compound libraries, ML can guide the design of new compounds with improved potency and selectivity.
· Clinical trial optimization: ML can help optimize clinical trial design and patient selection by analyzing patient data, including genomics, demographics, and clinical outcomes. By identifying biomarkers and patient subpopulations that are most likely to respond to treatment, ML can improve the efficiency and success rates of clinical trials.
· Drug safety and pharmacovigilance: ML algorithms can analyze real-world data, such as electronic health records and adverse event reports, to identify potential safety concerns associated with drugs. By detecting adverse drug reactions earlier and more accurately, ML can improve drug safety monitoring and pharmacovigilance efforts.
Overall, ML has the potential to revolutionize the drug discovery process by enabling the analysis of large and complex data sets, uncovering hidden patterns and relationships, and accelerating the identification and development of new therapeutic agents.
The paper is based on the information presented in “Calculating molecular fingerprints using padelpy” webpage and a Jupyter Notebook file padelpy.ipynb. Some important updates will be provided using required best practices of ML algorithms and projects workflow. Some of the latest and best practices of ML algorithms applied in genomics Life Sciences have been published on Medium.com in the paper ‘ “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.”
· “DNA Sequences Preprocessing Using PySpark Library”. Ernest Bonat, Ph.D., Feb 24, 2024.
· “ELT Package Development with 3-Tier Architecture for Data Engineering”. Ernest Bonat, Ph.D., Jan 23, 2024.
· “Advanced DNA Sequence Text Classification Using Natural Language Processing”. Ernest Bonat, Ph.D., Dec 22, 2023.
· “Fast DNA Sequence Data Loading — List vs. NumPy Array”. Ernest Bonat, Ph.D., Aug 9, 2023.
· “DNA Sequence String Conversion to Label Encoder for Machine Learning Algorithms”. Ernest Bonat, Ph.D., April 18, 2023.
· “Advanced DNA Sequences Preprocessing for Deep Learning Networks”. Ernest Bonat, Ph.D., Mar 29, 2023.
· “Building Machine Learning Clustering Models for Gene Expression RNA-Seq Data”. Ernest Bonat, Ph.D., Dec 28, 2022.
· “RNA-Seq Gene Expression Classification Using Machine Learning Algorithms”. Ernest Bonat, Ph.D., Aug 6, 2022.
· “Web Deployment of Genomics Machine Learning Models Using Flask Web Framework”. Ernest Bonat, Ph.D., June 18, 2022.
2. Research Question
The main task of the research is to train a classifier Machine Learning model to predict a compound’s bioactivity (active or inactive against a specific target), and then use the trained model to screen potential drug candidates. The hepatitis C virus infection (HCV) will be used as an example. Hepatitis C is a liver infection caused by the HCV. Hepatitis C is spread through contact with blood from an infected person. Today, most people become infected with the hepatitis C virus by sharing needles or other equipment used to prepare and inject drugs. This type of research represents a Machine Learning binary classification (1 — active compound, 0 — inactive compound) unsupervised project.
3. Nonstructural Protein 5B Dataset Selection
The dataset used can be download from the following link ‘https://raw.githubusercontent.com/dataprofessor/data/master/HCV_NS5B_Curated.csv'. This dataset was downloaded from the ChEMBL database for nonstructural protein 5B (NS5B). ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
Nonstructural protein 5B (NS5B) is an RNA-dependent RNA polymerase (RdRp) that plays a critical role in HCV replication. The function of this enzyme is to catalyze the polymerization of ribonucleoside triphosphates (rNTP) during viral RNA replication. There are two main subclasses of NS5B polymerase inhibitors: (1) nucleotide analogues that mimic the natural substrate and induce chain termination when incorporated into the new RNA and (2) non-nucleotide inhibitors that bind to the allosteric sites on the enzyme and impair its function.
By loading a pandas dataframe,
https_hcv_ns5b_curated = 'https://raw.githubusercontent.com/dataprofessor/data/master/HCV_NS5B_Curated.csv'
df_hcv_ns5b = pd.read_csv(https_hcv_ns5b_curated)
df_hcv_ns5b.info()
the metadata info can be determined.
4. PaDELPy Molecular Fingerprints Calculation
As you can see, the ‘CANONICAL_SMILES’ column contains text data. This data type cannot be used with ML algorithms. To numerically encode this data, the PaDELPy library was utilized to calculate the molecular fingerprints. PaDELPy is a Python wrapper for the PaDEL-Descriptor software. The ‘padeldescriptor()’ function from this wrapper will be employed. To generate the output fingerprints file, this function requires two arguments: the ‘molecule.msi’ file and the XML fingerprint descriptor type file.
To create the ‘molecule.msi’ file, two columns are required from the dataframe: ‘CANONICAL_SMILES’ and ‘CMPD_CHEMBLID’. The generic function ‘create_molecule_msi_file()’ below creates the ‘molecule.msi’ file for any dataframe.
@staticmethod
def create_molecule_msi_file(df, molecule_msi_file_path):
""" create the molecule msi file
args:
df (dataframe): bioactive molecule dataframe
msi_file_path (string): molecule msi file path
"""
try:
df = pd.concat( [df['CANONICAL_SMILES'], df['CMPD_CHEMBLID']], axis=1)
df.to_csv(molecule_msi_file_path, sep='\t', index=False, header=False)
except:
tb.print_exc()
The XML fingerprint descriptor types files can be download from GitHub link https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip. Here is a sorted list of twelve of them.
['AtomPairs2DFingerprintCount.xml', 'AtomPairs2DFingerprinter.xml', 'EStateFingerprinter.xml', 'ExtendedFingerprinter.xml', 'Fingerprinter.xml', 'GraphOnlyFingerprinter.xml', 'KlekotaRothFingerprintCount.xml', 'KlekotaRothFingerprinter.xml', 'MACCSFingerprinter.xml', 'PubchemFingerprinter.xml', 'SubstructureFingerprintCount.xml', 'SubstructureFingerprinter.xml']
In our case, we’ll be using the ‘SubstructureFingerprinter.xml’ file shown below.
<Root>
<Group name="2D">
<Descriptor name="AcidicGroupCount" value="false"/>
<Descriptor name="ALOGP" value="false"/>
<Descriptor name="AminoAcidCount" value="false"/>
<Descriptor name="APol" value="false"/>
<Descriptor name="AromaticAtomsCount" value="false"/>
<Descriptor name="AromaticBondsCount" value="false"/>
<Descriptor name="AtomCount" value="false"/>
<Descriptor name="Autocorrelation" value="false"/>
<Descriptor name="BaryszMatrix" value="false"/>
<Descriptor name="BasicGroupCount" value="false"/>
<Descriptor name="BCUT" value="false"/>
<Descriptor name="BondCount" value="false"/>
<Descriptor name="BPol" value="false"/>
<Descriptor name="BurdenModifiedEigenvalues" value="false"/>
<Descriptor name="CarbonTypes" value="false"/>
<Descriptor name="ChiChain" value="false"/>
<Descriptor name="ChiCluster" value="false"/>
<Descriptor name="ChiPathCluster" value="false"/>
<Descriptor name="ChiPath" value="false"/>
<Descriptor name="Constitutional" value="false"/>
<Descriptor name="Crippen" value="false"/>
<Descriptor name="DetourMatrix" value="false"/>
<Descriptor name="EccentricConnectivityIndex" value="false"/>
<Descriptor name="EStateAtomType" value="false"/>
<Descriptor name="ExtendedTopochemicalAtom" value="false"/>
<Descriptor name="FMF" value="false"/>
<Descriptor name="FragmentComplexity" value="false"/>
<Descriptor name="HBondAcceptorCount" value="false"/>
<Descriptor name="HBondDonorCount" value="false"/>
<Descriptor name="HybridizationRatio" value="false"/>
<Descriptor name="InformationContent" value="false"/>
<Descriptor name="IPMolecularLearning" value="false"/>
<Descriptor name="KappaShapeIndices" value="false"/>
<Descriptor name="KierHallSmarts" value="false"/>
<Descriptor name="LargestChain" value="false"/>
<Descriptor name="LargestPiSystem" value="false"/>
<Descriptor name="LongestAliphaticChain" value="false"/>
<Descriptor name="MannholdLogP" value="false"/>
<Descriptor name="McGowanVolume" value="false"/>
<Descriptor name="MDE" value="false"/>
<Descriptor name="MLFER" value="false"/>
<Descriptor name="PathCount" value="false"/>
<Descriptor name="PetitjeanNumber" value="false"/>
<Descriptor name="RingCount" value="false"/>
<Descriptor name="RotatableBondsCount" value="false"/>
<Descriptor name="RuleOfFive" value="false"/>
<Descriptor name="Topological" value="false"/>
<Descriptor name="TopologicalCharge" value="false"/>
<Descriptor name="TopologicalDistanceMatrix" value="false"/>
<Descriptor name="TPSA" value="false"/>
<Descriptor name="VABC" value="false"/>
<Descriptor name="VAdjMa" value="false"/>
<Descriptor name="WalkCount" value="false"/>
<Descriptor name="Weight" value="false"/>
<Descriptor name="WeightedPath" value="false"/>
<Descriptor name="WienerNumbers" value="false"/>
<Descriptor name="XLogP" value="false"/>
<Descriptor name="ZagrebIndex" value="false"/>
</Group>
<Group name="3D">
<Descriptor name="Autocorrelation3D" value="false"/>
<Descriptor name="CPSA" value="false"/>
<Descriptor name="GravitationalIndex" value="false"/>
<Descriptor name="LengthOverBreadth" value="false"/>
<Descriptor name="MomentOfInertia" value="false"/>
<Descriptor name="PetitjeanShapeIndex" value="false"/>
<Descriptor name="RDF" value="false"/>
<Descriptor name="WHIM" value="false"/>
</Group>
<Group name="Fingerprint">
<Descriptor name="Fingerprinter" value="false"/>
<Descriptor name="ExtendedFingerprinter" value="false"/>
<Descriptor name="EStateFingerprinter" value="false"/>
<Descriptor name="GraphOnlyFingerprinter" value="false"/>
<Descriptor name="MACCSFingerprinter" value="false"/>
<Descriptor name="PubchemFingerprinter" value="false"/>
<Descriptor name="SubstructureFingerprinter" value="true"/>
<Descriptor name="SubstructureFingerprintCount" value="false"/>
<Descriptor name="KlekotaRothFingerprinter" value="false"/>
<Descriptor name="KlekotaRothFingerprintCount" value="false"/>
<Descriptor name="AtomPairs2DFingerprinter" value="false"/>
<Descriptor name="AtomPairs2DFingerprintCount" value="false"/>
</Group>
</Root>
Now that the ‘molecule.msi’ and ‘SubstructureFingerprinter.xml’ files are defined, the ‘generate_fingerprint_output_file()’ function can be applied to generate the final molecule fingerprint output file ‘substructure.csv’.
@staticmethod
def generate_fingerprint_output_file(smi_file_path, fingerprint_output_file, fingerprint_descriptor_type):
"""generate molecule fingerprint output file
args:
smi_file_path (string): molecule smi file path
fingerprint_output_file (string): molecule fingerprint output file path
fingerprint_descriptor_type (string): xml fingerprint descriptor type file path
returns:
None
"""
try:
padeldescriptor(
mol_dir= molecule_smi_file,
d_file=fingerprint_output_file,
descriptortypes= fingerprint_descriptor_type,
detectaromaticity=True,
standardizenitro=True,
standardizetautomers=True,
threads=2,
removesalt=True,
log=True,
fingerprints=True
)
except:
tb.print_exc()
Here is an example of file ‘substructure.csv’ data.
The ‘padeldescriptor()’ function in generate_fingerprint_output_file() function requires some time to run. This function generates a log file to verify and confirm the completion of the descriptors’ calculations.
5. Data Class Pipeline Implementation
Data pipelines automate the process of data extraction, transformation, and loading (ETL). They reduce manual intervention and save time. Bioinformatics deals with large datasets from various sources such as genomic sequencing, proteomic analyses, and clinical trials. Data pipelines help in processing this raw data efficiently, cleaning it, and preparing it for analysis, including ML algorithms. In general, data pipelines are designed and developed by Data Engineers using ETL packages and technologies.
Now that the fingerprint descriptor ‘substructure.csv’ file has been defined, we can determine the final CSV to be used with ML algorithms. I would like to provide a simple Python pipeline class object to demonstrate how to develop one for our drug discovery dataset. In this case, the input CSV file will be ‘substructure.csv’, and the output file will be ‘hcv_ns5b_substructure_final.csv’. Here is the code for the ‘DrugDiscoveryDataPipeline()’ class object.
class DrugDiscoveryDataPipeline(object):
def __init__(self, input_data_path, output_data_path):
self.input_data_path = input_data_path
self.output_data_path = output_data_path
self.data = None
self.X = None
self.y = None
self.X_y_concat = None
def data_input(self):
try:
self.data = pd.read_csv(self.input_data_path)
except:
tb.print_exc()
def X_select(self):
try:
self.X = self.data.values
self.X = self.X[0:self.X.shape[0], 1:self.X.shape[1]]
except:
tb.print_exc()
def y_select(self):
try:
self.y = self.X[0:self.X.shape[0], self.X.shape[1] - 1]
except:
tb.print_exc()
def y_encoder(self):
try:
self.y = self.y.map({'active': 1, 'inactive': 0})
except:
tb.print_exc()
def X_y_concatenate(self):
try:
self.X_y_concat = pd.concat([self.X, self.y], axis=1)
except:
tb.print_exc()
def data_output(self):
try:
self.X_y_concat.to_csv(self.output_data_path, index=False)
except:
tb.print_exc()
def run_data_pipeline(self):
self.data_input()
self.X_select()
self.y_select()
self.y_encoder()
self.X_y_concatenate()
self.data_output()
Below is shown the code demonstrating how the ‘run_data_pipeline()’ function executes the data pipeline process. At the end of this process, the ‘hcv_ns5b_substructure_final.csv’ file is generated and ready to be used for ML algorithms.
input_data_path = "\folder_path\substructure.csv"
output_data_path = "\folder_path\hcv_ns5b_substructure_final.csv"
data_pipeline = DrugDiscoveryDataPipeline(input_data_path, output_data_path)
data_pipeline.run_data_pipeline()
The static function ‘df_concat_column()’ below can be used to concatenate a list of dataframe columns to generate the CSV file.
@staticmethod
def df_concat_column(df_list_column, file_name, file_extension):
"""concatenate dataframe columns and create a CSV file
args:
df_list_column (list): column list to be concatenated
file_name (string): file name
file_extension (string): file exension
returns:
dataframe: concatenated dataframe
"""
try:
df_concat = pd.concat(df_list_column, axis=1)
file_name_path = os.path.join(config.PROJECT_FOLDER, file_extension, file_name)
df_concat.to_csv(file_name_path, sep=config.FIELD_DELIMITER, index=False, header=False)
except:
tb.print_exc()
return df_concat
As you can see, the Python code should be encapsulated in generic functions (or class objects) with docstring comments and error handling. These are simple software deployment and testing best practices. Data Scientists, could you please stop writing Python ‘spaghetti code’ everywhere in Jupyter Notebook files? Take some time to read the paper Refactoring Python Code for Machine Learning Projects. Python “Spaghetti Code” Everywhere!.
6. Applying Random Forest Classifier Model
As you can see, this ML project is a simple unsupervised learning with a binary classification dataset. The ‘info()’ method of the dataframe shows the following results.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 578 entries, 0 to 577
Columns: 308 entries, SubFP1 to Activity
dtypes: int64(308)
memory usage: 1.4 MB
Here are 578 rows and 308 columns in this dataset. We have two concerns: not enough rows, for sure, and too many columns. As a Data Scientist, logically by default, I will not apply any ML algorithm to this specific dataset. Let’s apply some best practices and experienced ML ‘magic’.
As you can see, the molecular descriptor data are very sparse, with values mostly consisting of 0s and 1s. The first consideration that comes to mind is removing the columns with low variance. However, that doesn’t necessarily mean we should do so. Everything in ML needs to be tested to achieve the most accurate model as the final product. Let’s determine if it’s necessary in this case.
Why is it a very smart idea to start with Random Forest model selection (‘RandomForestRegressor()’, ‘RandomForestClassifier()’) for many ML regression and classification projects? I really want you to research this question and find the answer yourself — it’s very simple and important for Data Scientists.
The remove_low_variance() function below removes the dataframe columns with low variance based on the threshold value.
@staticmethod
def remove_low_variance(df, threshold=0.1):
"""remove dataframe low variance columns
args:
df (dataframe): dataframe (X features)
threshold (float, optional): defaults to 0.1.
returns:
dataframe: updated dataframe
"""
try:
variance_threshold = VarianceThreshold(threshold)
variance_threshold.fit(df)
df = df[df.columns[variance_threshold.get_support(indices=True)]]
except:
tb.print_exc()
return df
After applying this function, the dataframe’s shape (X features) remains (578, 18). So, 289 columns were removed. Here are the results after running the ‘RandomForestClassifier()’ classifier model.
classification accuracy score:
83.62
classification confusion matrix:
[[26 15]
[ 4 71]]
classification report:
precision recall f1-score support
0 0.87 0.63 0.73 41
1 0.83 0.95 0.88 75
accuracy 0.84 116
macro avg 0.85 0.79 0.81 116
weighted avg 0.84 0.84 0.83 116
The 83.62% accuracy score is a poor result when removing the X features’ low-variance columns. Let’s find out if the ‘remove_low_variance()’ function was not applied.
classification accuracy score:
88.79
classification confusion matrix:
[[31 10]
[ 3 72]]
classification report:
precision recall f1-score support
0 0.91 0.76 0.83 41
1 0.88 0.96 0.92 75
accuracy 0.89 116
macro avg 0.89 0.86 0.87 116
weighted avg 0.89 0.89 0.89 116
Much better results were achieved with an accuracy score of 88.79%. Therefore, it’s unnecessary to remove the low variance columns. This is a very good example of when Data Scientists make false assumptions about the dataset values without any validations at all. In ML, it is very important to validate any suggestions by running real tests and confirming them practically.
7. Target Imbalanced Activity Classes
The figure below shows how the compound activity classes are imbalanced. Imbalanced classes refer to a situation in ML where the distribution of classes in the training dataset is skewed, meaning that one class (the majority class) has significantly more instances than the other class or classes (the minority class or classes). Algorithms trained on imbalanced data may exhibit biases, have difficulty generalizing to new data, and produce inaccurate predictions, especially for the minority class.
After applying Synthetic Minority Oversampling Technique (SMOTE), the activity classes got balanced as shown below.
Let’s run the ‘RandomForestClassifier()’ classifier model again to see the results.
classification accuracy score:
96.36
classification confusion matrix:
[[81 4]
[ 2 78]]
classification report:
precision recall f1-score support
0 0.98 0.95 0.96 85
1 0.95 0.97 0.96 80
accuracy 0.96 165
macro avg 0.96 0.96 0.96 165
weighted avg 0.96 0.96 0.96 165
The 96.36% accuracy score is an excellent result. Here we go again, the Random Forest algorithm remains one of the best in many ML projects. There’s no need to jump right away to Deep Learning algorithms. It’s still not very clear to me why many Data Scientists think that Deep Learning algorithms are the best first solution to build any ML models today. In many companies real-production use cases, it’s not!
8. Detecting Machine Learning Model Overfitting
Machine Learning model overfitting occurs when a model learns the training data too well, capturing noise or random fluctuations in the data rather than the underlying patterns or relationships. This results in a model that performs very well on the training data but fails to generalize to new, unseen data.
One of the simple ways to detect ML model overfitting is by checking the accuracy score based on the quality of the output test data. Let’s split our dataset into train-valid-test sets (80%-10%-10%) and run the ‘RandomForestClassifier()’ model again.
Validate Metrics
classification accuracy score:
95.12
classification confusion matrix:
[[37 1]
[ 3 41]]
classification report:
precision recall f1-score support
0 0.93 0.97 0.95 38
1 0.98 0.93 0.95 44
accuracy 0.95 82
macro avg 0.95 0.95 0.95 82
weighted avg 0.95 0.95 0.95 82
Test Metrics
classification accuracy score:
95.18
classification confusion matrix:
[[44 3]
[ 1 35]]
classification report:
precision recall f1-score support
0 0.98 0.94 0.96 47
1 0.92 0.97 0.95 36
accuracy 0.95 83
macro avg 0.95 0.95 0.95 83
weighted avg 0.95 0.95 0.95 83
The validation and test accuracy scores are very close, so our Random Forest classification model is not overfitted. Is that simple?
9. The Lazy Predict Library for Machine Learning Algorithms
Lazy Predict is a Python library that provides a simple and efficient way to make predictions. It is easy to use and easy to install. Lazy Predict is open source and is released under the MIT license. Lazy Predict is a great tool for predictive modeling projects. It is simple to use and easy to install. It is open source and released under the MIT license. Below is the results of using Lazy Predict classifiers. As can see, tree and boosting ML classification models provide better accuracy score of drug discovery activity compounds.
10. Conclusions
1. The PadelPy library proves to be a good solution to calculate molecule fingerprint descriptors for Machine Learning algorithms.
2. The Random Forest and Extreme Gradient Boosting (XGBoost) classifier models provide a high accuracy score for predicting active and inactive compounds in the drug discovery process.
3. Machine Learning tree and boosting algorithms perform better for classifying drug discovery datasets.
4. Data Scientists should not make any assumptions in Machine Learning project development, whether regarding ideas or opinions. Everything in Machine Learning project development needs to be tested to achieve the most accurate model as the final product.