Building Machine Learning Clustering Models for Gene Expression RNA-Seq Data

10 min readDec 28, 2022

1. Introduction
2. Using Clustering Algorithms in Bioinformatics
3. Cancer Gene Expression RNA-Seq Dataset
4. K-Means Clustering Evaluation Metrics
5. Determine the Number of Clusters Using the Elbow Method
6. Using the Kneed Library to Determine the Number of Clusters
7. Applying K-Means Clustering Model
8. K-Means Adjusted Rand Index Metric
9. Silhouette Analysis for K-Means Clustering
10. Conclusions

1. Introduction

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters (“K-Means Clustering in Python: A Practical Guide”). The primary goal of clustering is the grouping of data into clusters based on similarity, density, intervals or particular statistical distribution measures of the data space (“Deep Learning-based Clustering Approaches for Bioinformatics”). In general, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Some of the latest and best practices of Machine Learning algorithms applied in genomics Life Sciences have been published on Medium.com in the paper “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.”

2. Using Clustering Algorithms in Bioinformatics

Gene expression data contains DNA microarray data and RNA-seq data. Analysis of microarray data helps clarify biological mechanisms and push drugs toward a more predictable future. Compared to hybridization-based microarray technology, RNA-seq has a larger range of expression levels, and more information is detected (“An Efficient Feature Selection Strategy Based on Multiple Support Vector Machine Technology with Gene Expression Data”).

DNA microarrays revolutionized the approach to gene expression profiling. Compared to prior methods, DNA microarrays have dramatically high throughput and are less cumbersome. Microarrays developed as a technique for large-scale DNA mapping and sequencing. However, changing the support surface from a porous membrane to a solid surface afforded significant improvements by increasing reaction kinetics and reducing background noise. The technology for DNA microarrays continues to evolve (“Chapter 7-DNA Microarrays in Biological Discovery and Patient Care”).

Some clustering applications in Bioinformatics could be: Identified groups of patients who respond differently to medical treatments for specify disease; Reveal groups of functionally related genes in which genes with a small distance share the same expression patterns and might be similar. Visualizing, interpreting and analyzing high-dimensional and large-scale biological data in its unstructured entirety can be perplexing unless the data is organized into clusters. Another example is the clustering of genes or biomedical images by learning the hidden patterns from an unlabeled dataset (“An Efficient Feature Selection Strategy Based on Multiple Support Vector Machine Technology with Gene Expression Data”).

The RNA-Seq has become the main data for gene expression measurements and offers several advantages over microarrays. Further, the single-cell experiments have become a fundamental Bioinformatics research project, where clustering algorithms are important parts of the Machine Learning (ML) projects implementation.

In this ML paper the K-Means clustering algorithm will be applied to build a predicted model for a gene expression RNA-Seq dataset. This clustering algorithm is an Unsupervised Learning technique used to identify clusters of multidimensional data objects in a dataset. The K-Means clustering algorithm is part of the popular ML scikit-learning framework library.

3. Cancer Gene Expression RNA-Seq Dataset

The cancer gene expression RNA-Seq dataset can be downloaded from UCI Machine Learning Repository. This collection of data is part of the RNA-Seq (HiSeq) Pan-Cancer Atlas dataset. It is a random extraction of gene expressions of patients having different types of tumors including BRCA — Breast invasive carcinoma, KIRC — Kidney renal clear cell carcinoma, COAD — Colon adenocarcinoma, LUAD — Lung adenocarcinoma and PRAD — Prostate adenocarcinoma. The downloaded file, TCGA-PANCAN-HiSeq-80120531.tar.gz, contains two CSV files. The data.csv file represents the X feature in Machine Learning and the label.csv file represents the y target (label) data. The dataset contains gene expression RNA-Seq values from 20,531 genes to 881 samples.

The imbalanced class analysis from this dataset was provided in the paper “RNA-Seq Gene Expression Classification Using Machine Learning Algorithms”. The bar chart shown below contains the imbalanced class for our selected genomic dataset.

After applying the Synthetic Minority Oversampling Technique (SMOTE) algorithm all classes were balanced with 300 rows each as you can see in the final bar chart below. Using the PyDNA library, the X feature and y label will be converted using the function below balance_class_smote().

def balance_class_smote(X, y):
    """X and Y smote over sampling
    args:
        X feature
        y label
    returns:
        smote X feature
        smote y label
    """
    try:
        smote_over_sampling = SMOTE(random_state=50, n_jobs=-1)    
        X, y = smote_over_sampling.fit_resample(X, y)
    except:            
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())  
    return X, y

# calling function
X, y = PyDNA.balance_class_smote(X, y)

The Principal Component Analysis (PCA) was used to reduce the number of X features (columns) to 7–10 from 20,531. The non-clustered dataset can be visualized using PCA first and second components. Below is the function pca_x_reduction() and the visualization plot.

pca = PCA(n_components=config.PCA_N_COMPONENTS_COUNT, random_state=50)
def pca_x_reduction(pca, X):
    """ X feature reduction
    args:
        pca class object
        X feature
    returns:
        X new feature
        PCA explained variance
        cumulative sum of eigen values
    """
    try:
        X = pca.fit_transform(X)
        pca_explained_variance = pca.explained_variance_ratio_                
        cumulative_sum_eigenvalues = np.cumsum(pca_explained_variance)
    except:               
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())
    return X, pca_explained_variance, cumulative_sum_eigenvalues

# calling function
X = PyDNA.pca_x_reduction(pca_component_number, X)

4. K-Means Clustering Evaluation Metrics

Contrary to Machine Supervised Learning where we have the ground truth to evaluate the model’s performance, clustering analysis doesn’t have a solid evaluation metric that we can use to evaluate the outcome of different clustering algorithms. Moreover, since K-Means requires the number of clusters k as an input and doesn’t learn it from data, there is no right answer in terms k that we should have in any problem. Sometimes domain knowledge and intuition may help but usually that is not the case. In the cluster-predict methodology, we can evaluate how well the models are performing based on different k clusters since clusters are used in the downstream modeling. I could recommended everyone to read and understand the paper “K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks”.

5. Determine the Number of Clusters Using the Elbow Method

The Elbow Method is the most popular method to determine the optimal number of clusters. This method uses the Sum of Squared Error (SSE) parameter between data points and their assigned clusters’ centroids. The k cluster value should be selected at the spot where SSE starts to flatten out and form an elbow. Below is the code to calculate the SSE and the elbow method plot.

def calculate_kmeans_sse(X, max_elbow_number):
    """ calculate sum of squared error (sse)
    args:
        X feature
        max of elbow number
    returns:
        SSE
    """
    try:
        sse = []
        for i in range(1, max_elbow_number):
            kmeans = KMeans(n_clusters=i, random_state=50)
            kmeans.fit(X)
            sse.append(kmeans.inertia_)       
    except:            
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())  
    return sse

def elbow_method_plot(sum_squared_error, max_elbow_number):
    """elbow method clustering plot
    args:
       sum squared error
       max of elbow number
    return:
       None
    """
    try:       
        plt.grid(True)
        plt.plot(range(1, max_elbow_number), sum_squared_error)
        plt.title("Elbow Method - Cancer Gene Expression RNA-Seq")
        plt.xlabel("Number of clusters")
        plt.ylabel("Sum of Squared Error (Distortion)")
        plt.show()
    except:            
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())  

# calling functions
max_elbow_number = config.MAX_ELBOW_NUMBER
sum_squared_error = PyDNA.calculate_kmeans_sse(X, max_elbow_number)
PyDNA.elbow_method_plot(sum_squared_error, max_elbow_number)

As we can see the right number of clusters should be five. Each of them represents a cancer class type. There are other available methods to determine the optimal number of clusters including in the paper “5 Ways for Deciding Number of Clusters in a Clustering Model”.

6. Using the Kneed Library to Determine the Number of Clusters

The kneed library is used for knee-point detection in Python. Given a set of x and y values, kneed will return the knee point of the function. The knee point is the point of maximum curvature. If the elbow method is too hard to determine the number of clusters, then I recommend using this library to verify and determine the right optimal number of clusters. Below is the function get_kneed_elbow_point() code that calculate five clusters. As you can see the result of the applied kneed library validate our correct selection user with the elbow method.

def get_kneed_elbow_point(x_coordinate , y_coordinate ):
    """determine the kneed elbow point
    args:
        x coordinate
        y coordinate
    returns:
        kneed elbow point
    """
    try:
        kneed_locator = KneeLocator(x=x_coordinate, y=y_coordinate, curve="convex", direction="decreasing")
        kneed_locator_elbow = kneed_locator.elbow
    except:            
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())  
    return kneed_locator_elbow

# calling functions
max_elbow_number = config.MAX_ELBOW_NUMBER
sum_squared_error = PyDNA.calculate_kmeans_sse(X, max_elbow_number)
kneed_locator_elbow = PyDNA.get_kneed_elbow_point(range(1, max_elbow_number), sum_squared_error)
print(kneed_locator_elbow)

Result: 5

7. Applying K-Means Clustering Model

Now that we know that the number of clusters is five, the K-Means model can be applied as shown below.

number_clusters = config.NUMBER_CLUSTERS
k_means = PyDNA.k_means_model(number_clusters)
y_predicted = PyDNA.k_means_predict(X)
cluster_centroids = PyDNA.k_means_cluster_centroids(kmeans.cluster_centers_)
print(cluster_centroids)

Here are the cluster centroids for the selected dataset.

[[  -8.45801335  -10.14582353   76.74888149  -57.75948388    4.03079771
    16.34636755   -5.99093303]
 [ 137.31363915  -57.41108246  -33.51073392    4.60164944    5.01152715
     0.58311618    1.11230399]
 [ -14.69712212  119.25252845  -58.71502055  -22.30000913   -1.16636062
     3.59466095    5.54235961]
 [-106.38275644  -90.37657514  -41.62782788    2.08651512   -9.49929942
   -11.24228754   -1.96960122]
 [  -8.43637589   36.25531939   54.76312094   72.61165042    1.44742792
    -9.41736093    1.29875299]]

Using the K-Means model and the predicted labels we can plot the five clusters for each type of cancer gene expression RNA-seq data.

8. K-Means Adjusted Rand Index Metric

A common metric to validate K-Means model’s performance is called the Adjusted Rand Index (ARI). Unlike the Silhouette Coefficient, the ARI uses true cluster assignments to measure the similarity between true and predicted labels. The ARI output values range between -1.0 and 1.0. A score close to 0.0 indicates random assignments, and a score close to 1.0 indicates perfectly labeled clusters. The ARI is bounded below by -0.5 for especially discordant clusterings. The line of code below calculate ARI value by using the function calculate_adjusted_rand_score(). The 0.98 number represents a very good performance of our selected K-Means model.

def calculate_adjusted_rand_score(y, y_predicted):
    """ calculate the adjusted rand score
    args:
        y label
        y predicted label
    returns:
        adjusted rand score
    """
    try:
        adjusted_rand_index = adjusted_rand_score(y, y_predicted) 
        adjusted_rand_index = float("{0:0.2f}".format(adjusted_rand_index)) 
    except:               
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())
    return adjusted_rand_index

# calling function
adjusted_rand_index = PyDNA.calculate_adjusted_rand_score(y, y_predicted) 
print(adjusted_rand_index)

Result: 0.98

A 3D clustered plot can be built by using the matplotlib library.

9. Silhouette Analysis for K-Means Clustering

The Silhouette Analysis is used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].

To help to determine the optimal number of clusters, this Silhouette Analysis provides two main metrics. The silhouette_samples() — compute the Silhouette Coefficient for each sample. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. The silhouette_score() — compute the mean Silhouette Coefficient of all samples. This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples(). The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. Here is the code to compute the silhouette score value.

def calculate_print_silhouette_score(X):
    """calculate silhouette score for each cluster number
    args:
        X feature
    returns
        None
    """    
    try:
        range_number_clusters = [2, 3, 4, 5]
        for number_cluster in range_number_clusters:
            print(number_cluster)
            kmeans = KMeans(n_clusters=number_cluster, random_state=50)
            y_cluster_labels = kmeans.fit_predict(X)
            silhouette_score_avg = silhouette_score(X, y_cluster_labels)
            print(silhouette_score_avg)
    except:               
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())

# calling function
PyDNA.calculate_print_silhouette_score(X)

For our dataset, the table below shows calculated silhouette score values for specific number of clusters.

As we can see the highest value of the silhouette score is 0.57 for five clusters. So, the optimal number of clusters for our dataset has been verified as five, as it should be from the elbow methods and kneed library calculation results.

The silhouette plot is shown below. The thickness of this plot gives an indication of how big each cluster is. It proves that for five clusters the highest value of the silhouette score is 0.57.

10. Conclusions

1. The K-Means clustering algorithm proves to be a good option for multidimensional gene expression RNA-Seq datasets.

2. The Elbow Method provides a simple graphical solution to determine the optimal number of clusters.

3. The Need library can determine and verify the optimal number of clusters for multidimensional gene expression RNA-Seq datasets.

4. The Adjusted Rand Index value is an excellent metric to validate the K-Means model performance.

5. The Silhouette Analysis provides a good solution to determine the optimal number of clusters by using the calculations of silhouette samples and score metrics.