RNA-Seq Gene Expression Classification Using Machine Learning Algorithms

10 min readAug 9, 2022

1. Overview
2. Fast RNA-Seq Gene Expression Dataset Load
3. RNA-Seq Gene Expression Dataset Imbalanced Class Analysis
4. Applying Principal Component Analysis (PCA) to RNA-Seq Gene Expression Dataset
5. Applying Machine Learning Classification Algorithms to RNA-Seq Gene Expression Dataset
6. PCA Quantity Components Selection using Classification Accuracy Scores Variation
7. Conclusions

Updated: Sep/25/2023

1. Overview

From the National Human Genome Research Institute (NHGRI) site we can see some Artificial Intelligence/Machine Learning application examples in genomics.

Examining people’s faces with facial analysis AI programs to accurately identify genetic disorders.
Using Machine Learning techniques to identify the primary kind of cancer from a liquid biopsy.
Predicting how a certain kind of cancer will progress in a patient.
Identifying disease-causing genomic variants compared to benign variants using Machine Learning.
Using Deep Learning to improve the function of gene editing tools such as CRISPR.

Let’s define gene expression. Gene expression is the process by which the information encoded in a gene is used to either make RNA molecules that code for proteins or to make non-coding RNA molecules that serve other functions. Gene expression acts as an “on/off switch” to control when and where RNA molecules and proteins are made and as a “volume control” to determine how much of those products are made. The process of gene expression is carefully regulated, changing substantially under different conditions. The RNA and protein products of many genes serve to regulate the expression of other genes.

The paper “A Review of Deep Learning Applications In Human Genomics Using Next-Generation Sequencing Data” mentioned that “Machine Learning (ML) has been deliberated as a core technology in artificial intelligence (AI), which enables the use of algorithms and makes critical predictions based on data learning and not simply following instructions. It has broad technology applications; however, standard ML methods are too narrow to deal with complex, natural, highly dimensional raw data, such as those of genomics. Alternatively, the Deep Learning (DL) approach is a promising and exciting field currently employed in genomics.” Let’s prove that this statement is completely false.”

This blog paper will cover how to apply traditional standard classification Machine Learning algorithms to an RNA-Seq gene expression dataset. In general, these types of datasets are multi-dimensional with thousands of unlabeled columns. Because of that specific Machine Learning implementation is necessary to develop high performance production models. The required Machine Learning workflow steps will be provide using the most popular classification models today.

This research paper is a continuation of the Machine Learning models development and deployment for genomic datasets analysis. Below is the latest.

“Apply Machine Learning Algorithms for Genomics Data Classification”. Ernest Bonat, Ph.D., Bishes Rayamajhi, MS. February 03, 2021.
“Web Deployment of Genomics Machine Learning Models Using Flask Web Framework”. Ernest Bonat, Ph.D., June 18, 2022.
“Building Machine Learning Clustering Models for Gene Expression RNA-Seq Data”. Ernest Bonat, Ph.D., Dec 28, 2022.

2. Fast RNA-Seq Gene Expression Dataset Load

In this blog paper the gene expression cancer RNA-Seq dataset will be used. This collection of data is part of the RNA-Seq (HiSeq) Pan-Cancer Atlas dataset, it is a random extraction of gene expressions of patients having different types of tumors including BRCA — Breast invasive carcinoma, KIRC — Kidney renal clear cell carcinoma, COAD — Colon adenocarcinoma, LUAD — Lung adenocarcinoma and PRAD — Prostate adenocarcinoma. The downloaded file TCGA-PANCAN-HiSeq-80120531.tar.gz contains two CSV files. The data.csv file that represents the X feature in Machine Learning and the lable.csv file which the y target (label) data. Below is the library pandas data frame information for each of them. Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

For X feature (data.csv file).

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 801 entries, 0 to 800
Columns: 20531 entries, gene_0 to gene_20530
dtypes: float64(20531)
memory usage: 125.5 MB
None

For y target (label.csv file).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 1 columns):
# Column Non-Null Count Dtype
---  ------  --------------  -----
0 Class 801 non-null object
dtypes: object(1)
memory usage: 6.4+ KB
None

During the Machine Learning model development and tests, these CSV files will need to be loaded many times in the Python IDE program. In general, the duration time to load these files is approximately around 16–18 seconds (according to my laptop). This will increase considerable the development time of the total model final implementation. The question is: How to speed-up this data loading process? One of the easy ways is to serialize X and y objects once and then deserialize them back many times. The next time when you run the Python file, you’ll need to deserialize X and y only as needed. This Python CSV files loading technique can reduce the development time up to 1–2 seconds for each run — excellent idea with a good result! This object serialization and deserialization in Python can be done by importing the pickle or joblib modules. I would recommend each Machine Learning Developer to implement this fast multi-dimensional data loading technique if necessary.

Here is the PyDNA code to serialize and deserialize X and y pandas objects.

# X feature:rna_seq_X_path = os.path.join(rna_seq_pkl_path, "rna_seq_X.pkl")
PyDNA.pickle_serialize_object(rna_seq_X_path, X)
X = PyDNA.pickle_deserialize_object(rna_seq_X_path)# y target:rna_seq_y_path = os.path.join(rna_seq_pkl_path, "rna_seq_y.pkl")
PyDNA.pickle_serialize_object(rna_seq_y_path, y)
y = PyDNA.pickle_deserialize_object(rna_seq_y_path)

3. RNA-Seq Gene Expression Dataset Imbalanced Class Analysis

The imbalanced class of the target data in real business datasets is a very common issue today. There are many papers written about this important classification issue applied to Machine Learning projects. For example, the How to deal with imbalanced data in Python blog paper contains an excellent information about pythonic solutions to balance the target imbalanced class. The bar chart shown below contains the imbalanced class for our selected genomic dataset.

After applying the Synthetic Minority Oversampling Technique (SMOTE) algorithm all classes were balanced with 300 rows each as you can see in the final bar chart below.

A simple line of Python code was applied to obtain this requirement.

X_train, y_train = PyDNA.balance_class_smote(X_train, y_train)

4. Applying Principal Component Analysis (PCA) to RNA-Seq Gene Expression Dataset

As we know already, the X feature data frame contains 20,531 gene expression columns. We’ll need to reduce this number of columns considerable to properly apply Machine Learning algorithms to this dataset. The main questions here is how to determine how many columns to reduce to? Let’s look at the feature reduction solution first.

Feature reduction, also known as dimensional reduction, is the process of reducing the number of features in a resource heavy computation without losing important information. Reducing the number of features means the number of variables is reduced making the computer’s work easier and faster. Feature reduction can be divided into two processes: feature selection and feature extraction. There are many techniques by which feature reduction is accomplished. Some of the most popular are generalized discriminant analysis, autoencoders, non-negative matrix factorization, and A One-Stop Shop for Principal Component Analysis. PCA is the most common algorithm used for feature reduction in Machine Learning projects workflow today.

A simple way to determine the necessary quantity of principal components is by calculating the individual explained variance ratio. The paper PCA Explained Variance Concepts with Python Example contains very good explanation of this technique including the Python code too.

From the bar chart below we can see that taking 7–8 principal components from 20,531 columns will be sufficient for using any classification Machine Learning algorithms. Of course, the final classification metrics values will decide if this selection is correct or not. More explanation about it will be provided after applying the classification Machine Learning algorithms.

In my experience, PCA has been working very well for me with many high-dimensional datasets. Here is the Python code to calculate the explained variance ration and cumulative sum of eigenvalues of all eigenvectors.

pca = PCA(n_components=config.PCA_N_COMPONENTS_COUNT, random_state=100)
X_train, X_valid, pca_explained_variance, cumulative_sum_eigenvalues = PyDNA.X_train_valid_pca(pca, X_train, X_valid)

5. Applying Machine Learning Classification Algorithms to RNA-Seq Gene Expression Dataset

The following twelve Machine Learning workflow steps are required to apply any traditional classification algorithm:

Data loading and object serialization
Data object deserialization
Target data flattening and encoding
Data split into training (80%), validation (10%) and test (10%)
Balance feature training data if necessary
Feature training data scaling
Feature data PCA dimensional reduction
Classification model development and optimization
Classification model fitting
Model target prediction
Classification metrics calculation for validation and test data
Production model deployment and maintenance

The table below show the results of applying twelve Machine Learning classic classification algorithms for our selected dataset. The main classification output metrics are validation and test accuracy scores.

The winner is Light Gradient Boosting Machine (LightGBM) with 98.76% of classification test accuracy score. LightGBM is a free and open source distributed gradient boosting framework for Machine Learning originally developed by Microsoft in 2016. It is based on Decision Tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

Here are the program results for this LightGBM classifier algorithm.

MODEL VALIDATIONvalid accuracy score:
98.75valid precision:
98.88valid recall:
98.75valid f1 score:
98.76valid confusion matrix:[[30  0  0  0  0]
 [ 0  8  0  0  0]
 [ 0  0 15  0  0]
 [ 0  1  0 13  0]
 [ 0  0  0  0 13]]valid classification report:       precision   recall   f1-score   support0.0       1.00      1.00      1.00        30
1.0       0.89      1.00      0.94         8
2.0       1.00      1.00      1.00        15
3.0       1.00      0.93      0.96        14
4.0       1.00      1.00      1.00        13accuracy                            0.99        80
macro avg       0.98      0.99      0.98        80
weighted avg    0.99      0.99      0.99        80MODEL TESTtest accuracy score:
98.76test precision:
98.84test recall:
98.76test f1 score:
98.76test confusion matrix:[[30  0  0  0  0]
 [ 0  8  0  0  0]
 [ 0  0 14  1  0]
 [ 0  0  0 14  0]
 [ 0  0  0  0 14]]test classification report:        precision  recall   f1-score   support0.0       1.00      1.00      1.00        30
1.0       1.00      1.00      1.00         8
2.0       1.00      0.93      0.97        15
3.0       0.93      1.00      0.97        14
4.0       1.00      1.00      1.00        14accuracy                             0.99        81
macro avg        0.99      0.99      0.99        81
weighted avg     0.99      0.99      0.99        81

It’s good to mention how well the gradient boosting algorithms work with this type of unlabeled dataset. The Random Forest (RF) and Multi-Layer Perceptron (MLP) algorithms provide very good results with 97.53% accuracy score. This is a very good sign because many companies using RF as the main Machine Learning project algorithm. RF has proved to create excellent regression and classification models with many different company business datasets. Another interest result was the K-Nearest Neighbors (KNN) algorithm with high 98.71% accuracy score. This is my first time I saw this result. It makes sense to try KNN with different gene expression datasets. In future paper I would like to find out how good are the Deep Learning algorithms with this type of datasets. The main question could be: Can CNN or LSTM performance better than LigthGBM with 98.76% of classification accuracy score?

6. PCA Quantity Components Selection using Classification Accuracy Scores Variation

The accuracy score is one of the main metrics to validate classification models performance. It’s clear that based on the PCA analysis and classification algorithms results table, 7–8 principal components will do the job. Another way, that I have not seen before, is to visualize how the validation and test scores vary based on selected principal components quantity. The figure below shows how both accuracy scores got the max values and got close with six selected principal components. This is another proof of our right 7–8 principal components selection.

Below is the Python code to calculate the require validation classification metrics.

print("MODEL VALIDATION")
accuracy_score_value, precision_value, recall_value, f1_score_value, confusion_matrix_value, classification_report_value = PyDNA.calculate_classification_metrics(y_val, y_predicted)
print("valid accuracy score:\n{}\n".format(accuracy_score_value))
print("valid precision:\n{}\n".format(precision_value))
print("valid recall:\n{}\n".format(recall_value))
print("valid f1 score:\n{}\n".format(f1_score_value))
print("valid confusion matrix:\n{\n".format(confusion_matrix_value))
print("valid classification report:\n{ \n".format(classification_report_value))

7. Conclusions

Data loading pandas objects serialize and deserialize will speed-up the process of Machine Learning models development and optimization
Target imbalanced class is an issue that need to be resolved before applying classification Machine Learning algorithms. Using SMOTE technique could be a good place to start to.
Feature reduction will be required for multi-dimensional gene expression datasets. PCA could be the optimal option with individual explained variance analysis and visualization.
Traditional classification gradient boosting models guaranty good metrics results with Pan-Cancer Atlas dataset.
The selection of the principal component quantity can be validated and visualized by using classification accuracy scores variation.
It’s strongly recommended to apply all traditional and modern Machine Learning classification algorithms to any genomic dataset and find out which model provides the best prediction results.