Advanced DNA Sequence Text Classification Using Natural Language Processing

11 min readDec 21, 2023

1. Overview
2. Generate DNA Sequence K-mers Substrings
3. DNA Sequence Text Preprocessing for Machine Learning Algorithms
4. DNA Sequence Protein Bound Dataset
5. DNA Sequence Text Classification
6. Detect DNA Sequence Model Overfitting
6.1 Compare Validation and Test Classification Metrics
6.2 K-fold Cross-Validation Algorithm
7. Conclusion

1. Overview

A DNA sequence is a simple text or a string data type in a programming language. Considering it as text, we might contemplate applying Natural Language Processing (NLP) to determine if this Artificial Intelligence (AI) algorithm can be utilized for DNA sequence classification. Several advanced Machine Learning application papers have been published for processing and classifying DNA sequence datasets in “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.”

NLP is an AI field that focuses on enabling computers to understand, interpret, generate, and respond to human language in a valuable and meaningful manner. It involves the interaction between computers and humans through natural language. NLP techniques encompass Machine Learning algorithms, statistical models, and Deep Learning Networks like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers such as BERT, GPT, and their respective variants. These methods aid machines in learning patterns, structures, and subtleties within language, allowing them to efficiently perform various NLP tasks.

Machine Learning text classification is a technique used in the field of NLP. It involves training a Machine Learning algorithm to automatically classify text documents into predefined categories or classes. The goal is to teach a computer model to identify patterns and features within textual data, allowing it to make accurate predictions or categorizations when presented with new, unseen text.

Specifically for text classification using NLP, there are two unique machine learning steps:

Preprocessing: Cleaning and preparing the text data by removing irrelevant information, such as punctuation, stopwords, and converting text to a consistent format (lowercase, tokenization, stemming, etc.)
Feature Extraction: Transforming the text data into numerical features that the machine learning algorithm can understand. Techniques like Bag-of-Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings (Word2Vec, GloVe), or more sophisticated methods like BERT embeddings are commonly used for this purpose.

2. DNA Sequence Text Preprocessing for Machine Learning Algorithms

DNA k-mers are substrings of length k that are extracted from a DNA sequence. In genetics and bioinformatics, DNA sequences consist of four nucleotide bases: adenine (A), cytosine ©, guanine (G), and thymine (T). K-mers are formed by considering all possible subsequences of length k within a longer DNA sequence. Widely utilized in various bioinformatics applications, such as sequence alignment, assembly, analysis, and identification of patterns within DNA sequences, k-mers provide crucial information about sequence composition, structure, and function. Analyzing k-mers aids in tasks such as identifying motifs, detecting sequence similarities, and studying genetic variations among different organisms.

def dna_k_mers_generation(dna_sequence, k_mers_length):
    """generate dna k-mers based on the substring length number
    args:
        dna_sequence (string): dna sequence string
        k_mers_length (interger): k-mers substring length
    returns:
        k_mers_list (list):  k-mers list
    """
    try:
        k_mers_list = []
        for i in range(len(dna_sequence) - k_mers_length + 1):
            k_mer_substring = dna_sequence[i : i + k_mers_length]
            k_mers_list.append(k_mer_substring)
    except:
        tb.print_exc()  
    return k_mers_list

As you can see, this function code includes a docstring header definition and error handling. These implementations represent good software development practices. It is well-known how much poorly written Python code can be found online today. ‘Be a Python programmer, not just a ‘Pythonic’ programmer’. Feel free to read this paper when you get some time. “Refactoring Python Code for Machine Learning Projects. Python “Spaghetti Code” Everywhere!”

Let’s look at some results:

DNA sequence:
CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGACACC
DNA sequence k-mers length:
2
DNA sequence k-mers substring:
['CC', 'CG', 'GA', 'AG', 'GG', 'GG', 'GC', 'CT', 'TA', 'AT', 'TG', 'GG', 'GT', 'TT', 'TT', 'TG', 'GG', 'GA', 'AA', 'AG', 'GT', 'TT', 'TA', 'AG', 'GA', 'AA', 'AC', 'CC', 'CC', 'CT', 'TG', 'GG', 'GG', 'GG', 'GC', 'CT', 'TT', 'TC', 'CT', 'TC', 'CG', 'GC', 'CG', 'GG', 'GA', 'AC', 'CA', 'AC', 'CC']

DNA sequence k-mers length:
3
DNA sequence k-mers substring:
['CCG', 'CGA', 'GAG', 'AGG', 'GGG', 'GGC', 'GCT', 'CTA', 'TAT', 'ATG', 'TGG', 'GGT', 'GTT', 'TTT', 'TTG', 'TGG', 'G'AG', 'GT', 'TT', 'TA', 'AG', 'GA', 'AA', 'AC', 'CCGA', 'GAA', 'AAG', 'AGT', 'GTT', 'TTA', 'TAG', 'AGA', 'GAA', 'AAC', 'ACC', 'CCC', 'CCT', 'CTG', 'TGG', 'GGG', 'GGG'', 'CG',C', 'CC'], 'GGC', 'GCT', 'CTT', 'TTC', 'TCT', 'CTC', 'TCG', 'CGC', 'GCG', 'CGG', 'GGA', 'GAC', 'ACA', 'CAC', 'ACC']

DNA sequence k-mers length:
4
DNA sequence k-mers substring:
['CCGA', 'CGAG', 'GAGG', 'AGGG', 'GGGC', 'GGCT', 'GCTA', 'CTAT', 'TATG', 'ATGG', 'TGGT', 'GGTT', 'GTTT', 'TTTG', 'T'AG', 'GT', 'TT', 'TA', 'AG', 'GA', 'AA', 'AC', 'CCTGG', 'TGGA', 'GGAA', 'GAAG', 'AAGT', 'AGTT', 'GTTA', 'TTAG', 'TAGA', 'AGAA', 'GAAC', 'AACC', 'ACCC', 'CCCT', 'CCTG', 'CG',C', 'CC']', 'CTGG', 'TGGG', 'GGGG', 'GGGC', 'GGCT', 'GCTT', 'CTTC', 'TTCT', 'TCTC', 'CTCG', 'TCGC', 'CGCG', 'GCGG', 'CGGA', 
'GGAC', 'GACA', 'ACAC', 'CACC']

DNA sequence k-mers length:
5
DNA sequence k-mers substring:
['CCGAG', 'CGAGG', 'GAGGG', 'AGGGC', 'GGGCT', 'GGCTA', 'GCTAT', 'CTATG', 'TATGG', 'ATGGT', 'TGGTT', 'GGTTT', 'GTTTG'AG', 'GT', 'TT', 'TA', 'AG', 'GA', 'AA', 'AC', 'CC', 'TTTGG', 'TTGGA', 'TGGAA', 'GGAAG', 'GAAGT', 'AAGTT', 'AGTTA', 'GTTAG', 'TTAGA', 'TAGAA', 'AGAAC', 'GAACC', 'AAC', 'CG',C', 'CC']CC', 'ACCCT', 'CCCTG', 'CCTGG', 'CTGGG', 'TGGGG', 'GGGGC', 'GGGCT', 'GGCTT', 'GCTTC', 'CTTCT', 'TTCTC', 'TCTCG', 'CTCGC', 'TCGCG', 'CGCGG', 'GCGGA', 'CGGAC', 'GGACA', 'GACAC', 'ACACC']

Or, we can obtain the same results by using the following function dna_bag_of_words().

dna_k_mers_vectorizer_array = PyDNA.dna_bag_of_words(dna_k_mers_concatenate_list)
print("DNA sequence k-mers vectorizer:\n{}".format(dna_k_mers_vectorizer_array))

DNA sequence k-mers vectorization:
[[2 2 1 4 2 1 1 1 3 2 2 2 1 1 1 3 2 2 1 2 1 1 1 1 1 1 2 2 2 3 1 1]]

3. DNA Sequence Text Preprocessing for Machine Learning Algorithms

I could not find sufficient and comprehensible online information about DNA sequence text preprocessing using NLP. Allow me to develop and explain the actual logic behind this DNA sequence text preprocessing. DNA sequence text preprocessing using NLP differs from standard sentences of text data. For DNA sequence text, there is no need to employ some of the standard preprocessing tasks found in typical text data, such as tokenization, stop word removal, stemming, and lemmatization. For DNA sequence k-mers substrings, the following three preprocessing steps are required:

K-mers Generation — generating the k-mers substring.
K-mers Concatenation — concatenating the k-mers substring.
K-mers Vectorization — vectorizing the k-mers substring.

The code below demonstrates the main DNA sequence text preprocessing functions and their results. Following this process, the DNA sequence dataset is ready to be utilized with any Supervised Machine Learning algorithms.

print("DNA sequence text preprocess using NLP ")
dna_sequence = "AACTTCTCCAACGACATCATGCTACTGCAGGTCAGGCACACTCCTGCCACTCTTG"     
print("DNA sequence string:\n{}".format(dna_sequence))    

# 1.
k_mers_length = 3
dna_k_mers_generate_list = PyDNA.dna_k_mers_generation(dna_sequence, k_mers_length)
print("DNA sequence k-mers generation:\n{}".format(dna_k_mers_generate_list))

# 2.
dna_k_mers_concatenate_list = PyDNA.dna_k_mers_concatenation(dna_k_mers_generate_list)
print("DNA sequence k-mers concatenation:\n{}".format(dna_k_mers_concatenate_list))

# 3.
dna_k_mers_vectorizer_array = PyDNA.dna_k_mers_vectorization(dna_k_mers_concatenate_list)
print("DNA sequence k-mers vectorization :\n{}".format(dna_k_mers_vectorizer_array))

Results.

DNA sequence text preprocess using NLP 
DNA sequence string:
AACTTCTCCAACGACATCATGCTACTGCAGGTCAGGCACACTCCTGCCACTCTTG

DNA sequence k-mers generation:
['AAC', 'ACT', 'CTT', 'TTC', 'TCT', 'CTC', 'TCC', 'CCA', 'CAA', 'AAC', 'ACG', 'CGA', 'GAC', 'ACA', 'CAT', 'ATC', 'TCA', 'CAT', 'ATG', 'TGC', 'GCT', 'CTA', 'TAC', 'ACT', 'CTG', 'TGC', 'GCA', 'CAG', 'AGG', 'GGT', 'GTC', 'TCA', 'CAG', 'AGG', 'GGC', 'GCA', 'CAC', 'ACA', 'CAC', 'ACT', 'CTC', 'TCC', 'CCT', 'CTG', 'TGC', 'GCC', 'CCA', 'CAC', 'ACT', 'CTC', 'TCT', 'CTT', 'TTG']

DNA sequence k-mers concatenation:
['AAC ACT CTT TTC TCT CTC TCC CCA CAA AAC ACG CGA GAC ACA CAT ATC TCA CAT ATG TGC GCT CTA TAC ACT CTG TGC GCA CAG AGG GGT GTC TCA CAG AGG GGC GCA CAC ACA CAC ACT CTC TCC CCT CTG TGC GCC CCA CAC ACT CTC TCT CTT TTG']

DNA sequence k-mers vectorization:
[[2 2 1 4 2 1 1 1 3 2 2 2 1 1 1 3 2 2 1 2 1 1 1 1 1 1 2 2 2 3 1 1]]

Or, we can obtain the same results by using the following function dna_bag_of_words().

dna_k_mers_vectorizer_array = PyDNA.dna_bag_of_words(dna_k_mers_concatenate_list)
print("DNA sequence k-mers vectorizer:\n{}".format(dna_k_mers_vectorizer_array))

DNA sequence k-mers vectorization:
[[2 2 1 4 2 1 1 1 3 2 2 2 1 1 1 3 2 2 1 2 1 1 1 1 1 1 2 2 2 3 1 1]]

4. DNA Sequence Protein Bound Dataset

An example of the DNA sequence protein bound dataset “Deep Learning Genomics Primer” can be downloaded from the following URL links:

The panda’s data frame can be used to get the dataset metadata info.

<class 'pandas.core.frame.DataFrame'>    
RangeIndex: 2000 entries, 0 to 1999      
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   dna_sequence  2000 non-null   object
 1   dna_class     2000 non-null   int64 
dtypes: int64(1), object(1)

This dataset contains varying amounts of DNA classes, as illustrated in the picture below. In Machine Learning, this is referred to as an imbalanced class dataset.

After applying the ETL SMOTE package from “Advanced DNA Sequences Preprocessing for Deep Learning Networks”, the DNA classes have been preprocessed, balanced and are now prepared for Machine Learning algorithms.

5. DNA Sequence Text Classification

Now that the DNA sequence column has been preprocessed as text data, any Supervised Machine Learning algorithm can be applied. The code below utilizes the LazyClassifier library to employ multiple Machine Learning classification models.

def main():   
    # dna sequence dataset loading   
    csv_path_folder = r"data_folder_path\csv"
    csv_path_file = os.path.join(csv_path_folder, "dna_sequence_protein_original.csv")        
    df_genomics = PyDNA.pandas_read_data("CSV", csv_path_file, None)  
    df_genomics.info()

    # select y target
    y = PyDNA.dna_y_label(df_genomics, "dna_class")     
    
    # generate the k-mers substring
    X = PyDNA.dna_k_mers_generation("dna_sequence", df_genomics)    
    
    # concatenate the k-mers substring
    X = PyDNA.dna_k_mers_concatenation(X, "k_mers")    

    # vectorize the k-mers substring
    X = PyDNA.dna_k_mers_vectorization(X)  
    
    # training, validation and test data split
    X_train, y_train, X_valid, y_valid, X_test, y_test = PyDNA.train_validation_test_split(X, y, test_size=0.2, valid_size=0.5, random_state=50, stratify=False)    
      
    # lazy classifier implementation
    lazy_classifier = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None, predictions=False, random_state=50, classifiers="all")
    models, predictions = lazy_classifier.fit(X_train, X_test, y_train, y_test)
    print(models)    

if __name__ == '__main__':    
    main()

Here are the results obtained using the LazyClassifier library.

As you can see, the use of tree and boosting models significantly enhances the classification performance in the DNA dataset. It’s particularly intriguing that the RandomForestClassifier model achieves a perfect 100% accuracy score. This outcome is quite unprecedented in my experience, and I am concerned about the potential for overfitting the model in general. It might be prudent to conduct some testing to validate this specific case. Additionally, it’s important to note that the LazyClassifier library does not provide the confusion matrix and classification report model metrics. I hope that the LazyClassifier library development team will consider incorporating these features in future updates.

6. Detect DNA Sequence Model Overfitting

Overfitting occurs when a Machine Learning model learns the training data too well to the point that it negatively impacts its performance on unseen or new data. There are several ways to detect overfitting in a machine learning model:

1. High Training Accuracy, Low Test Accuracy: If the model shows significantly higher accuracy on the training dataset compared to the test dataset, it might be overfitting. A large gap between training and test accuracy indicates the model is not generalizing well to new data.

2. Validation Curves: Plotting the training and validation (or test) accuracy/loss over different epochs can provide insight. If the training accuracy keeps increasing while the validation accuracy plateaus or decreases, it suggests overfitting.

3. Cross-Validation: Using techniques like k-fold cross-validation can help assess the model’s performance on different subsets of the data. If the model performs significantly better on the training set compared to the validation sets, it might be overfitting.

Learning Curves: Plotting learning curves that show the model’s performance (accuracy or loss) on the training and validation datasets as the training size increases can reveal overfitting. If the training error is low, but the validation error remains high even with more data, it could indicate overfitting.

4. Regularization Techniques: Applying regularization techniques like L1 or L2 regularization, dropout, or early stopping can help prevent overfitting. Monitoring changes in performance with and without these techniques can indicate whether overfitting is present.

Feature Importance: If the model relies too heavily on specific features in the training data, it might not generalize well to new data. Analyzing feature importance can help identify overfitting caused by an excessive focus on certain features.

5. Confusion Matrix and Metrics: Examining metrics like precision, recall, F1-score, or the confusion matrix can reveal if the model is performing well on all classes or if it’s biased towards certain classes due to overfitting.

It’s important to note that overfitting is a common issue in machine learning, and mitigating it often involves a balance between model complexity, data size, and various regularization techniques. Employing proper evaluation methods and understanding the model’s behavior with different datasets can help in detecting and addressing overfitting issues.

6.1 Compare Validation and Test Classification Metrics

In general, the scikit-learn ‘train_test_split()’ function splits the data into training and testing sets. It’s not entirely clear to me what the definition of the test data extracted from this function is. I believe that this ‘test’ data should be called ‘validation’ data because it’s used to validate the created model’s performance. The ‘test’ set will represent the real production application data. Considering that I have developed my own ‘train_validation_test_split()’ function to split the whole dataset into train, validation, and test (train/validation/test) sets as shown below.

X_train, X_valid, X_test, y_train, y_valid, y_test = PyDNA.train_validation_test_split(X, y, test_valid_size=0.35, valid_size=0.5, random_state=50, stratify=True)

This will allow us to determine if the model has been overfitting easily by comparing the validation and test classification metrics. In this particular case, the dataset has been split into 80% for training, 10% for validation, and 10% for testing (80%/10%/10%). Now, using the Random Forest classifier model.

RandomForestClassifier(n_jobs=-1, random_state=50)

The following results can be obtained.

Validate metrics:

classification accuracy score:
99.14

classification confusion matrix:
[[174   3]
 [  0 173]]

classification report:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       177
           1       0.98      1.00      0.99       173

    accuracy                           0.99       350
   macro avg       0.99      0.99      0.99       350
weighted avg       0.99      0.99      0.99       350


Test metrics:

classification accuracy score:
99.43

classification confusion matrix:
[[176   2]
 [  0 172]]

classification report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       178
           1       0.99      1.00      0.99       172

    accuracy                           0.99       350
   macro avg       0.99      0.99      0.99       350
weighted avg       0.99      0.99      0.99       350

As we can see, the classification accuracy scores of the validation and test sets are very close. This demonstrates that the developed model is not overfitted — very simple!

6.2 K-fold Cross-Validation Algorithm

Another effective method to detect overfit models is by using k-fold cross-validation on the test data. A high error rate in this data serves as an indicator of overfitting. The k_folds_cross_validation() static function allows calculation of the mean and standard deviation values of the cross-validation scores array for a specific k-fold value.

@staticmethod
def k_folds_cross_validation(model_estimator, X_feature, y_label, k_number_splits, k_random_state=None):
    """_summary_
    args:
        model_estimator (object): object model
        X_feature (dataframe): pandas data frame
        y_label (serie): pandas data series
        k_number_splits (int): k-fold number of splits
        k_random_state (int, optional): defaults to none
    returns:        
        mean_cv_score (int): cv score mean value
        standard_deviation_cv_score (int): cv score standard deviation value
        cv_scores_array
    """
    try:
        k_folds = KFold(n_splits=k_number_splits, random_state=k_random_state)
        cv_scores_array = cross_val_score(estimator=model_estimator, X=X_feature, y=y_label, cv =k_number_splits, n_jobs=-1)
        mean_cv_score = cv_scores_array.mean()
        standard_deviation_cv_score = cv_scores_array.std()
    except:
        tb.print_exc()
    return mean_cv_score, standard_deviation_cv_score, cv_scores_array

Here is an example of using Random Forest classifier with five k-folds.

model_classifier = RandomForestClassifier(n_jobs=-1, random_state=50)
model_estimator = model_classifier
X_feature = X_test
y_label = y_test
k_number_splits = 5
mean_cv_score, standard_deviation_cv_score, cv_scores_array = PyDNA.k_folds_cross_validation(model_estimator, X_feature, y_label, k_number_splits, k_random_state=None)
print("mean_cv_score: {}".format(mean_cv_score))
print("standard deviation cv score: {}".format(standard_deviation_cv_score))
print("cv scores array: {}".format(cv_scores_array))

Results:

mean cv score: 0.9933333333333334
standard deviation cv score: 0.0062360956446232485
cv scores array: [0.98333333 0.99166667 1. 0.99166667 1.]

The table below displays the stability of the mean and standard deviation values for k-folds 5, 10, 15, 20, 25, and 30. It is evident that the developed Random Forest classifier model is not overfitting.

7. Conclusion

1. Three primary DNA sequence k-mer preprocessing methods were developed and tested, including generation, concatenation, and vectorization steps.

2. Traditional machine learning tree and boosting algorithms yield high accuracy scores in DNA sequence text classification. Practically the same results were obtained in “Apply Machine Learning Algorithms for Genomics Data Classification”.

3. The comparison of the validation/test classification metrics and the k-fold cross-validation algorithms proved that the developed Random Forest classifier model is not overfitted.