Apply Machine Learning Algorithms for Genomics Data Classification

Ernest Bonat, Ph.D.
33 min readFeb 3, 2021

--

Ernest Bonat, Ph.D., Bishes Rayamajhi, M.S.

(Updated: 09/30/2023)

1. Overview
2. DNA Sequence
3. Validating DNA Sequence String
4. Counting Base Nucleotides in a DNA Sequence String
5. Reversing a DNA Sequence String
6. Complementing a DNA Sequence String
7. Reversing-Complementing a DNA Sequence String
8. Counting CG-Content in a DNA Sequence String
9. Converting DNA Sequence String into NumPy Array
10. Searching for DNA Sequence Patterns
11. DNA Sequence String Translation into Protein
12. DNA Sequence String Transcription into RNA Sequence String
13. Encoding DNA Sequence String
14. Bag of Words for a DNA Sequence String
15. RNA Sequence
16. Validating RNA Sequence String
17. Counting Base Nucleotides in an RNA Sequence String
18. RNA Sequence String Translation into Protein
19. DNA, RNA and Protein Sequence Alignments
20. Biopython Library Overview
21. Custom PyDNA Library
22. DNA Sequence Classification using Machine Learning Algorithms
23. Applying Convolutional Neural Networks (CNN) for DNA Sequence Classification
24. Applying Long Short-Term Memory Networks (LSTM) for DNA Sequence Classification
25. Conclusions

1. Overview

In the fields of molecular biology and genetics, a genome is all genetic material of an organism. It consists of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and the non-coding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics . It includes the interactions of genes with each other and with the person’s environment. Genomic data is used in the field of Bioinformatics for collection, storage and processing of living being genomes. Genomic data processing requires a significant amount of data storage and high-performance hardware and software for statistical analysis.

Machine Learning (ML) is an application of Artificial Intelligence (AI) that enables automatic learning and improvement from experience without explicitly programming and knowledge of the learning environment. The most important goal is to develop and deploy a computer system that can learn automatically without any human interference. By definition it must be able to adjust its actions from prior results.

Machine Learning has become one of the main methods for many genomics researches tasks today, including:

1. Description and interpretation of large-scale genomic datasets.
2. Annotation of wide variety of genomics sequence elements.
3. Predicting the influence of genetic variation on DNA/RNA sequences.
4. Determine the likelihood of developing a particular disease and also to determine genetic heredity.
5. Identify patterns, make predictions and model the progression or treatment of a specific disease.

The future applications of ML in genomics could be: Pharmacogenomics, newborn genetic screening tools, agriculture, etc. Based on ML project types we can define specific applications. For classification (Supervised Learning): classifying shorter sequences into classes (phylum, genus, species, etc.); phylogenetic inference of the sequences; detection of plasmids and chromosomes; finding coding regions; chromosome prediction in human genomics; etc. For clustering (Unsupervised Learning): binning of metagenomics contigs; identification of plasmids and chromosomes; clustering reads into chromosomes for better assembly; clustering of reads as a preprocessor for assembly of reads, etc.

In this blog paper, we will understand the DNA/RNA/protein sequences structure and their manipulation using Python Data Ecosystem libraries. It will show how ML algorithms can be used for DNA/RNA/protein sequences classification and prediction. A comparison table of traditional and modern ML classification algorithms for genomic datasets will be provided. A simple Python library PyDNA will be presented for DNA/RNA/protein sequences string processing and ML classification algorithms. Some of the latest and best practices of Machine Learning algorithms applied in genomics Life Sciences have been published on Medium.com in the paper “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.

2. DNA Sequence

Based on the National Human Genome Research Institute, Deoxyribonucleic Acid (DNA) is the chemical compound which consists of the instructions needed to develop and direct the activities of nearly all living organisms. A DNA molecule is a double helix structure consisting of two twisted paired strands. DNA consists of four bases which are adenine [A], cytosine [C], guanine [G], or thymine [T]. DNA sequence is a laboratory process of determining the sequence of these four bases in a DNA molecule. More info about DNA Sequencing can be found in “Introduction to DNA Sequencing”.

Let’s look at DNA sequence strings manipulation using Python Data Ecosystem libraries. All provided function calls have been implemented in a Python custom DNA library named PyDNA. Further explanation about this library is found later in this blog paper.

3. Validating DNA Sequence String

The DNA sequence string must contain four base nucleotides [“A”,”C”,”G”,”T”]. This function allows to validate DNA sequences with any defined custom base of nucleotides.

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT”
is_dna_result = PyDNA.is_dna(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“Is DNA:\n{}”.format(is_dna_result))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT
Is DNA:
True

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTF”
is_dna_result = PyDNA.is_dna(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“Is DNA:\n{}”.format(is_dna_result))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTF
Is DNA:
False

4. Counting Base Nucleotides in a DNA Sequence String

The function below allows to count DNA sequence nucleotides with any defined custom base. The length of the DNA sequence will be returned too.

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
base_sequence_count, dna_sequence_length = PyDNA.dna_count_nucleotide(dna_sequence_string, base_sequence=[“A”,”C”,”G”,”T”], is_length=True)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“DNA nucleotides count:\n{}”.format(base_sequence_count))
print(“DNA length:\n{}”.format(dna_sequence_length))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT
DNA nucleotides count:
{‘A’: 13, ‘C’: 7, ‘G’: 12, ‘T’: 23}
DNA length:
55

Example:

dna_sequence_string = “ATYTRTCCYGGYAATRYTCGTAGTTAGRCTGATYTTATTGGYGCGAARATTYYTR”base_sequence_count, dna_sequence_length = PyDNA.dna_count_nucleotide(dna_sequence_string, base_sequence=[“A”,”C”,”G”,”T”,”R”,”Y”], is_length=True)print(“DNA Nucleotide Count:\n{}”.format(base_sequence_count))print(“DNA length:\n{}”.format(dna_sequence_length))

Results:

DNA Nucleotide Count:
{‘A’: 10, ‘C’: 5, ‘G’: 10, ‘T’: 17, ‘R’: 5, ‘Y’: 8}
DNA length:
55

5. Reversing a DNA Sequence String

The following function reverse a DNA sequence string.

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_reverse_sequence = PyDNA.dna_sequence_reverse(dna_sequence_string)
print(“DNA Sequence String:\n{}”.format(dna_sequence_string))
print(“Reverse DNA sequence:\n{}”.format(dna_reverse_sequence))

Results:

DNA Sequence String:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT
Reverse DNA sequence:
TTTTTTAAAAGCGCGGTTATTTTAGTCGGATTGATGCTTTTAAGGGCCCTATATA

6. Complementing a DNA Sequence String

The complementation of a DNS sequence is based on the IUPAC Degeneracies Conversion (https://www.bioinformatics.org/sms/iupac.html).

Example:

dna_sequence_test = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_complement_sequence = PyDNA.dna_sequence_complement(dna_sequence_test)
print(“DNA sequence string:\n{}”.format(dna_sequence_test))
print(“Complement DNA sequence:\n{}”.format(dna_complement_sequence))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT
Complement DNA sequence:
TATATAGGGCCCTTAAAAGCATCAATCCGACTAAAATAACCGCGCTTTTAAAAAA

7. Reversing-Complementing a DNA Sequence String

The reverse-complement of DNS sequence string is implemented by using the dna_sequence_reverse() and dna_sequence_complement() functions.

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT”
dna_reverse_complement_sequence = PyDNA.dna_sequence_reverse_complement(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“Reverse-Complement DNA sequence:\n{}”.format(dna_reverse_complement_sequence))

Results:

ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT
AAATTTTCGCGCCAATAAAATCAGCCTAACTACGAAAATTCCCGGGATATAT

8. Counting GC-Content in a DNA Sequence String

The GC-Content represents the percentage of nitrogenous bases in a DNA or RNA sequences.

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT”
gc_content = PyDNA.dna_count_gc_content(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“GC-content:\n{}”.format(gc_content))

Result:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT
GC-content:
36.5%

9. Converting DNA Sequence String into NumPy Array

Numerical Python (NumPy) is a main library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. This library is one the fastest available in Python Data Ecosystem today.

The NumPy core program is compiled and well-optimized C code. Because of that it’s highly recommended to be used in Computational Biology and Bioinformatics for string high performance processing and Machine Learning algorithms. Still many Data Scientists using List/Sets/Dictionary objects for strings manipulation in Python. Some of them complain that Python is very slow when they are writing poor programming codes in their daily work. I always recommend my Data Scientist/Engineer friends take advanced Python programming courses before writing a single line of Python code. When you have a chance, I would like you to read the following blog paper for fun: “Refactoring Python Code for Machine Learning Projects. Python ‘Spaghetti Code’ Everywhere!

Below is the PyDNA function code to convert a DNA sequence string to NumPy one-dimensional array.

@staticmethod
def dna_sequence_np_array(dna_sequence_string):
dna_sequence_array = None
try:
dna_sequence_string = dna_sequence_string.lower()
regex_acgt = re.compile('[^acgt]')
if (regex_acgt.search(dna_sequence_string) == None):
dna_sequence_array = np.array(list(dna_sequence_string))
else:
dna_sequence_array = None
except:
print(PyDNA.get_exception_info())
if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())
return dna_sequence_array

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_np_array = PyDNA.dna_sequence_np_array(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“DNA NumPy array:\n{}”.format(dna_np_array))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT
DNA NumPy array:
['a' 't' 'a' 't' 'a' 't' 'c' 'c' 'c' 'g' 'g' 'g' 'a' 'a' 't' 't' 't' 't' 'c' 'g' 't' 'a' 'g' 't' 't' 'a' 'g' 'g' 'c' 't' 'g' 'a' 't' 't' 't' 't' 'a' 't' 't' 'g' 'g' 'c' 'g' 'c' 'g' 'a' 'a' 'a' 'a' 't' 't' 't' 't' 't' 't']

10. Searching for DNA Sequence Patterns

Searching for DNA sequence patterns is a standard task in Bioinformatics today including protein domains, DNA transcription factor binding motifs, restriction enzyme cut sites, degenerate PCR primer sites, runs of mononucleotides and many more. The PyDNA library contains a simple dna_sequence_pattern() method to find patterns in DNA sequence strings.

@staticmethod
def dna_sequence_pattern(dna_sequence_string, dna_sequence_pattern):
search_result = False
try:
search_pattern = re.search(dna_sequence_pattern.lower(), dna_sequence_string.lower())
if search_pattern: search_result = True
except:
print(PyDNA.get_exception_info())
if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())
return search_result

Let’s look at some examples.

Example 1.

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_sequence_pattern = “AATTTT”
result = PyDNA.dna_sequence_pattern(dna_sequence_string, dna_sequence_pattern)
print(result)
True

Example 2.

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_sequence_pattern = “AATTTTAA”
result = PyDNA.dna_sequence_pattern(dna_sequence_string, dna_sequence_pattern)
print(result)
False

11. DNA Sequence String Translation into Protein

To translate a DNA sequence into proteins the DNA Genetic Code Chart is used.

Example:

dna_sequence_string = “ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG” 
protein_translation = PyDNA.dna_protein_translation(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“Protein translation:\n{}”.format(protein_translation))

Results:

DNA sequence string:
ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG
Protein translation:
MEVFKAPPIGI

12. DNA Sequence String Transcription into RNA Sequence String

The function below transcript a DNA sequence string into an RNA sequence string.

Example:

dna_sequence_string = “ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG” 
rna_transcription = PyDNA.dna_rna_transcription(dna_sequence_string)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“RNA transcription:\n{}”.format(rna_transcription))

Results:

DNA sequence string:
ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG
RNA transcription:
AUGGAAGUAUUUAAAGCGCCACCUAUUGGGAUAUAAG

13. Encoding DNA Sequence String

To use the DNA sequence string in ML projects it needs to be encoded first. ML mathematical algorithms don’t work with text categorical data. Depending on the ML algorithm selected, there are three main types of DNA sequence string encoding:

1. Label Encoding — this label (ordinary) encoding will encode each base nucleotide as a custom numerical value. In general, to be more accurate with ML algorithms floating (decimal) numbers are used. Before applying this encoding, the DNA sequence string needs to be converted into NumPy one-dimensional array. This type of encoding is very popular in Supervised Learning using scikit-learn library. In the example bellow the “ACGT” sequence string will be encoded as [0.25, 0.5, 0.75, 1.0].

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_np_array = PyDNA.dna_sequence_np_array(dna_sequence_string)
dna_label_encoder = PyDNA.dna_label_encoder(dna_np_array)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“DNA NumPy array:\n{}”.format(dna_np_array))
print(“Custom Label Encoding:\n{}”.format(dna_label_encoder))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT
DNA NumPy array:
[‘a’ ‘t’ ‘a’ ‘t’ ‘a’ ‘t’ ‘c’ ‘c’ ‘c’ ‘g’ ‘g’ ‘g’ ‘a’ ‘a’ ‘t’ ‘t’ ‘t’ ‘t’ ‘c’ ‘g’ ‘t’ ‘a’ ‘g’ ‘t’ ‘t’ ‘a’ ‘g’ ‘g’ ‘c’ ‘t’ ‘g’ ‘a’ ‘t’ ‘t’ ‘t’ ‘t’ ‘a’ ‘t’ ‘t’ ‘g’ ‘g’ ‘c’ ‘g’ ‘c’ ‘g’ ‘a’ ‘a’ ‘a’ ‘a’ ‘t’ ‘t’ ‘t’ ‘t’ ‘t’ ‘t’]
Custom Label Encoding:
[0.25 1. 0.25 1. 0.25 1. 0.5 0.5 0.5 0.75 0.75 0.75 0.25 0.25 1. 1. 1. 1. 0.5 0.75 1. 0.25 0.75 1. 1. 0.25 0.75 0.75 0.5 1. 0.75 0.25 1. 1. 1. 1. 0.25 1. 1. 0.75 0.75 0.5 0.75 0.5 0.75 0.25 0.25 0.25 0.25 1. 1. 1. 1. 1. 1. ]

2. One-hot Encoding — this encoding is often use in Artificial Neural Networks (ANN). Many times, ANN are called Deep Learning. Deep learning (also known as Deep Structured Learning) is part of a broader family of ML methods based on ANN with representation learning. It includes the following main architectures: Deep Neural Networks (DNN), Deep Belief Networks (DBN), Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) . The scikit-learn and Keras libraries can provide this one-hot encoding implementation. For the standard base of nucleotides, the “ACGT” sequence string will be one-hot encoded as [[1. 0. 0. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.]] using the NumPy array [‘a’ ‘c’ ‘g’ ‘t’].

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_np_array = PyDNA.dna_sequence_np_array(dna_sequence_string)
dna_one_hot = PyDNA.dna_onehot_encoder(dna_np_array)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“DNA NumPy array:\n{}”.format(dna_np_array))
print(“DNA One-Hot Encoding with Scikit-Learn framework:\n{}”.format(dna_one_hot))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT
DNA NumPy array:
[‘a’ ‘t’ ‘a’ ‘t’ ‘a’ ‘t’ ‘c’ ‘c’ ‘c’ ‘g’ ‘g’ ‘g’ ‘a’ ‘a’ ‘t’ ‘t’ ‘t’ ‘t’ ‘c’ ‘g’ ‘t’ ‘a’ ‘g’ ‘t’ ‘t’ ‘a’ ‘g’ ‘g’ ‘c’ ‘t’ ‘g’ ‘a’ ‘t’ ‘t’ ‘t’ ‘t’ ‘a’ ‘t’ ‘t’ ‘g’ ‘g’ ‘c’ ‘g’ ‘c’ ‘g’ ‘a’ ‘a’ ‘a’ ‘a’ ‘t’ ‘t’ ‘t’ ‘t’ ‘t’ ‘t’]
DNA One-Hot Encoding with Scikit-Learn framework:
[[1. 0. 0. 0.] [0. 0. 0. 1.] [1. 0. 0. 0.] [0. 0. 0. 1.] [1. 0. 0. 0.] [0. 0. 0. 1.] [0. 1. 0. 0.] [0. 1. 0. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 1. 0.] [0. 0. 1. 0.] [1. 0. 0. 0.] [1. 0. 0. 0.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.] [1. 0. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.] [0. 0. 0. 1.] [1. 0. 0. 0.] [0. 0. 1. 0.] [0. 0. 1. 0.] [0. 1. 0. 0.] [0. 0. 0. 1.] [0. 0. 1. 0.] [1. 0. 0. 0.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [1. 0. 0. 0.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 1. 0.] [0. 0. 1. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [1. 0. 0. 0.] [1. 0. 0. 0.] [1. 0. 0. 0.] [1. 0. 0. 0.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.] [0. 0. 0. 1.]]

The same results can be provided by using the Keras library.

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTTTTT”
dna_np_array = PyDNA.dna_sequence_np_array(dna_sequence_string)
dna_one_hot = PyDNA.dna_onehot_encoder_keras(dna_np_array)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“DNA NumPy array:\n{}”.format(dna_np_array))
print(“DNA One-Hot Encoding with Scikit-Learn framework:\n{}”.format(dna_one_hot))

3. K-mer Counting — In Bioinformatics, k-mers are subsequences of length contained within a biological sequence. Usually, the term k-mer refers to all of a sequence’s subsequences of length k, such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT) — from k-mer definition. In general, decomposing a sequence into its k-mers fixed-size chunks allows fast and easy string manipulation. This is wisely applied in Natural Language Processing (NLP) bag of words method for ML algorithms. This method will be covered in the next topic.

Example:

dna_sequence_string = “ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT”
k_mer_list, k_mer_numpy_array = PyDNA.k_mer_words(dna_sequence_string, k_mer_length=6)
print(“DNA sequence string:\n{}”.format(dna_sequence_string))
print(“K-mer list:\n{}”.format(k_mer_list))
print(“K-mer array:\n{}”.format(k_mer_numpy_array))

Results:

DNA sequence string:
ATATATCCCGGGAATTTTCGTAGTTAGGCTGATTTTATTGGCGCGAAAATTT
K-mer list:
[‘atatat’, ‘tatatc’, ‘atatcc’, ‘tatccc’, ‘atcccg’, ‘tcccgg’, ‘cccggg’, ‘ccggga’, ‘cgggaa’, ‘gggaat’, ‘ggaatt’, ‘gaattt’, ‘aatttt’, ‘attttc’, ‘ttttcg’, ‘tttcgt’, ‘ttcgta’, ‘tcgtag’, ‘cgtagt’, ‘gtagtt’, ‘tagtta’, ‘agttag’, ‘gttagg’, ‘ttaggc’, ‘taggct’, ‘aggctg’, ‘ggctga’, ‘gctgat’, ‘ctgatt’, ‘tgattt’, ‘gatttt’, ‘atttta’, ‘ttttat’, ‘tttatt’, ‘ttattg’, ‘tattgg’, ‘attggc’, ‘ttggcg’, ‘tggcgc’, ‘ggcgcg’, ‘gcgcga’, ‘cgcgaa’, ‘gcgaaa’, ‘cgaaaa’, ‘gaaaat’, ‘aaaatt’, ‘aaattt’]
K-mer array:
[‘atatat’ ‘tatatc’ ‘atatcc’ ‘tatccc’ ‘atcccg’ ‘tcccgg’ ‘cccggg’ ‘ccggga’ ‘cgggaa’ ‘gggaat’ ‘ggaatt’ ‘gaattt’ ‘aatttt’ ‘attttc’ ‘ttttcg’ ‘tttcgt’ ‘ttcgta’ ‘tcgtag’ ‘cgtagt’ ‘gtagtt’ ‘tagtta’ ‘agttag’ ‘gttagg’ ‘ttaggc’ ‘taggct’ ‘aggctg’ ‘ggctga’ ‘gctgat’ ‘ctgatt’ ‘tgattt’ ‘gatttt’ ‘atttta’ ‘ttttat’ ‘tttatt’ ‘ttattg’ ‘tattgg’ ‘attggc’ ‘ttggcg’ ‘tggcgc’ ‘ggcgcg’ ‘gcgcga’ ‘cgcgaa’ ‘gcgaaa’ ‘cgaaaa’ ‘gaaaat’ ‘aaaatt’ ‘aaattt’]

14. Bag of Words for a DNA Sequence String

A DNA sequence is a simple unstructured text data. Because of that, NLP should be an excellent tool to use for it. The main idea of using NLP is to allow computers to understand the unstructured text and retrieve meaningful pieces of information from it to make business decisions. NLP is part of the Artificial Intelligence Ecosystem.

Bag of words is one of the methods used in NLP. Its main function is to convert raw text data into words and count the frequency of them in the text. The order of these words in the text is not relevant. For a text document a matrix of token counts is generated. In general, this matrix represents the final features vector to be applied in ML algorithms. In our example below the k-mers list of DNA subsequences is the input parameter to generate the bag of words for it.

Example:

k_mer_list = [‘atatat’, ‘tatatc’, ‘atatcc’, ‘tatccc’, ‘atcccg’, ‘tcccgg’, ‘cccggg’, ‘ccggga’, ‘cgggaa’, ‘gggaat’]
word_ngram = 1
k_mer_token_count = PyDNA.bag_of_word_list(k_mer_list, word_ngram)
print(“K-mer list:\n{}”.format(k_mer_list))
print(“Word ngram:\n{}”.format(word_ngram))
print(“K-mer matrix token counts:\n{}”.format(k_mer_token_count.toarray()))

Results:

K-mer list:
[‘atatat’, ‘tatatc’, ‘atatcc’, ‘tatccc’, ‘atcccg’, ‘tcccgg’, ‘cccggg’, ‘ccggga’, ‘cgggaa’, ‘gggaat’]
Word ngram:
1
K-mer matrix of token counts:
[[1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0]]

15. RNA Sequence

Based on the National Human Genome Research Institute, Ribonucleic Acid (RNA) is a nucleic acid that is similar in structure to DNA but different in subtle ways. The cell uses RNA for a number of different tasks, one of which is called messenger RNA, or mRNA. And that is the nucleic acid information molecule that transfers information from the genome into proteins by translation. Another form of RNA is tRNA, or transfer RNA, and these are non-protein encoding RNA molecules that physically carry amino acids to the translation site that allows them to be assembled into chains of proteins in the process of translation.

RNA sequence analyses the continuously changing cellular transcriptome using Next-Generation Sequencing (NGS) techniques. NGS is a large-scale automation process which accomplishes rapid sequencing of base pairs in a DNA or RNA sample. NGS has enabled researchers to perform several biological studies and can sequence an entire human genome in a single day. NGS applications are used in a wide variety of modern technologies such as high-throughput complete viral genome sequencing, virus genome variability detection and evolution in a host.

Let’s look at some PyDNA functions to support RNA sequence strings manipulation.

16. Validating RNA Sequence String

The RNA sequence string must contain four base nucleotides [“A”,”C”,”G”,”U”]. This function allows validating RNA sequences with any defined custom base nucleotides.

Example:

rna_sequence_string = “AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA”
is_rna_result = PyDNA.is_rna(rna_sequence_string, base_sequence=[“A”,”C”,”G”,”U”])
print(“RNA sequence string:\n{}”.format(rna_sequence_string))
print(“Is RNA:\n{}”.format(is_rna_result))

Results:

RNA sequence string:
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
Is RNA:
True

Example:

rna_sequence_string = “AUGGCCTUGGCGCCCAGAACUGAGAUCTAUAGUACCCGUAUUAACTGGUGA”
is_rna_result = PyDNA.is_rna(rna_sequence_string, base_sequence=[“A”,”C”,”G”,”U”])
print(“RNA sequence string:\n{}”.format(rna_sequence_string))
print(“Is RNA:\n{}”.format(is_rna_result))

Results:

RNA sequence string:
AUGGCCTUGGCGCCCAGAACUGAGAUCTAUAGUACCCGUAUUAACTGGUGA
Is RNA:
False

17. Counting Base Nucleotides in an RNA Sequence String

The function below allows counting RNA sequence nucleotides with any defined custom base. The length of the RNA sequence is returned too.

Example:

rna_sequence_string = “AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA”
base_sequence_count, rna_sequence_length = PyDNA.rna_count_nucleotide(rna_sequence_string, base_sequence=[“A”,”C”,”G”,”U”], is_length=True)
print(“RNA sequence string:\n{}”.format(rna_sequence_string))
print(“RNA nucleotides count:\n{}”.format(base_sequence_count))
print(“RNA length:\n{}”.format(rna_sequence_length))

Results:

RNA sequence string:
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
RNA nucleotides count:
{‘A’: 15, ‘C’: 12, ‘G’: 14, ‘U’: 10}
RNA le

18. RNA Sequence String Translation into Protein

To translate an RNA sequence into proteins the RNA Genetic Code Chart is used.

Example:

rna_sequence_string = “AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA”
protein_translation = PyDNA. rna_protein_translation(rna_sequence_string)
print(“RNA sequence string:\n{}”.format(rna_sequence_string))
print(“Protein translation:\n{}”.format(protein_translation))

Results:

RNA sequence string:
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
Protein translation:
MAMAPRTEINSTRING

19. DNA, RNA and Protein Sequence Alignments

Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

One of the fundamental techniques used for faster sequence alignment today is Dynamic Programming (DP). DP is both a mathematical optimization method and a computer programming method. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have optimal substructure.

Sequence alignment can have the following main applications: Given a new sequence, predict its function based on similarity to another sequence, find important molecular regions, determine the evolutionary constraints at work, find mutations in a population or family of genes, network biology gives us functional information on genes/proteins, analysis of mutant’s links unknown genes to diseases, etc.

If we compare two sequences, it is known as pairwise sequence alignment. If we compare more than two sequences, it is known as multiple sequence alignment. In general, there are two types of sequence alignment: global and local alignments. Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the Needleman-Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.

Below is the code to compare two sequences using global and local algorithms with the PyDNA library.

sequence_1 = “ACGGGT”
sequence_2 = “ACG”
print(“Needleman-Wunsch Global Algorithm”)
PyDNA..needleman_wunsch_global(sequence_1, sequence_2)
print(“Smith-Waterman Local Algorithm”)
PyDNA.smith_waterman_local(sequence_1, sequence_2)

Results:

Needleman-Wunsch Global Algorithm
score: 15
ACGGGT
AC G
AC — G-
Smith-Waterman Local Algorithm
score: 30
ACG

20. Biopython Library Overview

The most popular Python library used in Computational Biology and Bioinformatics is Biopython . The Biopython project is an open-source collection of non-commercial Python tools for Computational Biology and Bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython’s capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and Machine Learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in Computational Biology.

To compare, let’s use this library to align the sequences shown above. As you can see from the program below the Bio library has been imported.

from Bio import pairwise2
from Bio.pairwise2 import format_alignment
sequence_1 = "ACGGGT"
sequence_2 = "ACG"
print("Needleman-Wunsch Global Algorithm")
alignments = pairwise2.align.globalxx(sequence_1, sequence_2)
for item in alignments:
print(format_alignment(*item))
print("Smith-Waterman Local Algorithm")
alignments = pairwise2.align.localxx(sequence_1, sequence_2)
for item in alignments:
print(format_alignment(*item))

Results:

Needleman-Wunsch Global Algorithm
ACGGGT
|| |
AC--G-
Score=3
ACGGGT
|| |
AC-G--
Score=3
ACGGGT
|||
ACG---
Score=3
Smith-Waterman Local Algorithm
1 ACGGG
|| |
1 AC--G
Score=3
1 ACGG
|| |
1 AC-G
Score=3
1 ACG
|||
1 ACG
Score=3

Both libraries produce the same results. For our case, the best global and local alignments are AC — G- and ACG.

21. Custom PyDNA Library

During researching the applications of ML algorithms for genomics data I found many standard and generic programming procedures and sequence data preprocessing were unavailable. The Biopython library does not provide the required modern ANN and Boosting Gradient models like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Extreme Gradient Boosting (XGBoost), etc. Because of this, I decided to build a simple Python library so I could reuse it in my daily ML bioinformatics work. After a couple of months of work, this library became simple, good and large. The setup packaging for it may be available in the future. Below are the main design basics.

1. Easy to use by copying and pasting the files pydna.py and ipydna.py into any Python project.
2. Generic ML methods and logic procedures for the whole project work flow.
3. Error handling, configuration file and log messages implementation.
4. Simple maintenance and future upgrades.
5. Unit tests project implementation.

Here is an example code of the PyDNA library public interface file.

from zope.interface import Interface
class IPyDNA(Interface):
def dna_sequence_np_array(dna_sequence_string):
"""
convert a dna sequence string to numpy one-dimensional array
dna_sequence_string: dna sequence string
return: numpy one-dimensional array
"""
pass

22. DNA Sequence Classification using Machine Learning Algorithms

There is a huge demand of applying ML to genomics dataset analytics today. Many questions about it are still not very clear at all. Let me mentions some of them:

  • What DNA sequence encoder to use based on the selected ML model?
  • What ML models to use for generic genomics datasets and how to interpret it?
  • How to optimize the ML model hyperparameters?
  • How to apply Natural Language processing to DNA sequence as an unstructured text dataset?
  • How to handle high dimensional and unlabeled gene expression datasets?
  • What method to use to handle genomic datasets with imbalanced classes?
  • What metrics to use to validate the ML model with genomics datasets?
  • How to expose and consume the ML model using web services API’s call?
  • How to design and build a client side and/or desktop application to use these ML models in real research and/or production business environments?

Selection of the ML model and its hyperparameters optimization is part of the ML project flow from the steps shown below:

  • Project description and specifications.
  • Data loading.
  • Data preprocessing.
  • Data exploration and visualization.
  • Features engineering and reduction.
  • Features and labels encoding.
  • Features and labels data splitting.
  • Features and labels scaling.
  • Model selection and hyperparameters optimization.
  • Model cross validation.
  • Model prediction.
  • Model performance metrics analysis.
  • Model production deployment using Web APIs IT application integration.
  • Model retraining schedule and redeployment business intelligence and decisions making.

To define this model for specific genomics dataset is an important decision for Bioinformaticians, Biologists and Clinicians. In many use cases this selection is defined, for example, based on the uniformity of the DNA sequence length across the dataset. Traditional ML algorithms (Linear and Logistics Regressions, Decision Trees, Support Vector Machines, Random Forest, Boosting Algorithms, Bayesian Network, etc.) can be used with any DNA sequence length. Modern ANN algorithms like CNN and RNN require consistent DNA sequence length in the whole dataset column. The PyDNA library provides a simple function to determine if the selected DNA sequence string contains uniform length or not.

dna_is_same_length = PyDNA.dna_sequence_is_equal_length(X)   
if dna_is_same_length == False:
print("DNA sequence length validation")

Let’s look at the first use case. Suppose we need to build a classification model that can predict a gene family based on a human DNA sequence dataset. Genes are categorized into families based on shared nucleotide or protein sequences. A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. This case will use a ‘human_data.txt’ file that can be downloaded from GitHub website (https://github.com/nageshsinghc4/DNA-Sequence-Machine-learning/blob/master/human_data.txt)

The information method of the pandas DataFrame will provide us a complete dataset description. It contains two columns ‘sequence’ and ‘class’ with 4,380 rows.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4380 entries, 0 to 4379
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sequence 4380 non-null object
1 class 4380 non-null int64
dtypes: int64(1), object(1)
memory usage: 68.6+ KB
None

Dataset example:

sequence   class
0 ATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACCCCCA... 4
1 ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG... 4
2 ATGTGTGGCATTTGGGCGCTGTTTGGCAGTGATGATTGCCTTTCTG... 3
3 ATGTGTGGCATTTGGGCGCTGTTTGGCAGTGATGATTGCCTTTCTG... 3
[4380 rows x 2 columns]

As you can see this dataset contains two data columns. The ‘sequence’ column is the DNA sequences string and a ‘class’ column that contain seven possible gene family labels shown in the table 1 below.

Table 1. Gene family name and count.

Figure 1 shows a bar chat of the class labels. We got an imbalanced classes situation with this dataset. In general, before applying ML algorithms, these classes need to be an oversampling or undersampling. Let’s find out if we need to apply any of these techniques for our classification model.

Figure 1. Human class labels bat chart.

Below are y class label examples.

y     class
0 4
1 4
2 3
3 3

Here is the final X feature bag of words and class column examples.

 X           class
(0, 52803) 1
(0, 207969) 1
(0, 136621) 1
(0, 79202) 1

Let’s check for DNA sequence uniform length.

X = PyDNA.select_df_column(df_dna, “sequence”)
dna_is_same_length = PyDNA.dna_sequence_is_equal_length(X)
print(dna_is_same_length)
False

The result is “False”, so the DNA sequence length is not uniform across the column. In this case, traditional ML algorithms will be used.

Based on the PyDNA library the complete code of the ML algorithms is show below. I have commented on every line of code for anyone to understand how this program works.

import sys
import time
import os
os.system("cls")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
from pydna import PyDNA
import warnings
warnings.filterwarnings('ignore')
def get_program_running(start_time):
end_time = time. process_time()
diff_time = end_time - start_time
result = time.strftime("%H:%M:%S", time.gmtime(diff_time))
print("program runtime: {}".format(result))
def main():
# text file path and name. this path to be defined in the project config file
human_data_txt = r"folder_path\human_data.txt"

# data load
df_dna = PyDNA.pandas_read_data("TXT", human_data_txt, None)
# select y label
y = PyDNA.select_y_label(df_dna, "class")

# generate a k-mer of words data frame column
df_dna = PyDNA.create_dataframe_column_words(df_dna, "sequence", "words")

# show y label imbalanced classes plot
PyDNA.imbalanced_classes_plot(y, True, "class", "Gene Family Class", "Count", "Human Gene Family Classes")

# generate X feature bag of works
X = PyDNA.bag_of_word_series(df_dna["words"], 4)

# data split in train, valid and test (80%/10%/10%)
X_train, y_train, X_valid, y_valid, X_test, y_test = PyDNA.train_validation_test_split(X, y, True, test_size=0.2, valid_size=0.5)

# create machine learning model and optimize it's hyperparameters
ml_model, ml_model_hyperparameter = PyDNA.create_ml_model("MultinomialNB", X_train, y_train)
print(ml_model_hyperparameter)

# get y predicted valid
y_predicted_valid = PyDNA.ml_model_predict(ml_model, X_valid)

# calculate valid classification metrics
accuracy_score_value, precision_value, recall_value, f1_score_value, confusion_matrix_value, classification_report_value = PyDNA.calculate_classification_metrics(y_valid, y_predicted_valid)
print("valid accuracy score:\n{}\n".format(accuracy_score_value))
print("valid precision:\n{}\n".format(precision_value))
print("valid recall:\n{}\n".format(recall_value))
print("valid f1 score:\n{}\n".format(f1_score_value))
print("valid confusion matrix:\n{}\n".format(confusion_matrix_value))
print("valid classification report:\n{}\n".format(classification_report_value))

# get y predicted test
y_predicted_test = PyDNA.ml_model_predict(ml_model, X_test)

# calculate test classification metrics
accuracy_score_value, precision_value, recall_value, f1_score_value, confusion_matrix_value, classification_report_value = PyDNA.calculate_classification_metrics(y_test, y_predicted_test)
print("test accuracy score:\n{}\n".format(accuracy_score_value))
print("test precision:\n{}\n".format(precision_value))
print("test recall:\n{}\n".format(recall_value))
print("test f1 score:\n{}\n".format(f1_score_value))
print("test confusion matrix:\n{}\n".format(confusion_matrix_value))
print("test classification report:\n{}\n".format(classification_report_value))
if __name__ == '__main__':
start_time = time. process_time()
main()
get_program_running(start_time)

Below are the program results using the Multinomial Naive Bayes classifier model. The results show six main calculated metrics used in classification models validation: accuracy score, precision, recall, f1 score, confusion matrix and classification report. The 97.7% of test accuracy score is an excellent result for our genomic dataset. Looking at the validation and test accuracy score values we can ensure that model overfitting/underfitting issue is not our case. For these reasons there is no need to apply any imbalanced classes methods for our dataset.

validation accuracy score:
98.858
validation precision:
98.872
validation recall:
98.858
validation f1 score:
98.86
validation confusion matrix:
[[ 53 0 0 0 0 0 0]
[ 0 52 0 0 0 1 0]
[ 0 0 35 0 0 0 0]
[ 0 0 0 66 1 0 0]
[ 1 0 0 0 69 0 1]
[ 0 0 0 0 0 24 0]
[ 0 0 0 0 1 0 134]]
validation classification report:
precision recall f1-score support
0 0.98 1.00 0.99 53
1 1.00 0.98 0.99 53
2 1.00 1.00 1.00 35
3 1.00 0.99 0.99 67
4 0.97 0.97 0.97 71
5 0.96 1.00 0.98 24
6 0.99 0.99 0.99 135
accuracy 0.99 438
macro avg 0.99 0.99 0.99 438
weighted avg 0.99 0.99 0.99 438
test accuracy score:
97.717
test precision:
97.8
test recall:
97.717
test f1 score:
97.732
test confusion matrix:
[[ 50 0 0 0 3 0 0]
[ 0 53 0 0 0 0 1]
[ 0 0 35 0 0 0 0]
[ 0 0 0 67 0 0 0]
[ 0 1 0 0 70 0 0]
[ 0 0 0 0 0 23 1]
[ 0 0 0 0 1 3 130]]
test classification report:
precision recall f1-score support
0 1.00 0.94 0.97 53
1 0.98 0.98 0.98 54
2 1.00 1.00 1.00 35
3 1.00 1.00 1.00 67
4 0.95 0.99 0.97 71
5 0.88 0.96 0.92 24
6 0.98 0.97 0.98 134
accuracy 0.98 438
macro avg 0.97 0.98 0.97 438
weighted avg 0.98 0.98 0.98 438

Table 2 shows the results of applying different types of ML classification models. The best results were obtained with Multinomial Naive Bayes and Multi-layer Perceptron classifier models with more that 97% of test accuracy score. Logistic Regression and Random Forest models provided around 92% of test accuracy score. Usually, Random Forest model provides good results, practically, for many possible datasets. I would recommend starting with Random Forest model for any ML regression and classification projects. Gradient boosting models didn’t do a job for our dataset. Maybe, in general, these models are not good enough for classification of genomics datasets. I’ll be providing more information and new results about it in the next coming blog papers.

Table 2. ML classifier models results with gene family dataset.

23. Applying Convolutional Neural Networks (CNN) for DNA Sequence Classification

In our second use case, we’ll be predicting if a DNA sequence can bind to protein or not. DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for single- or double-stranded DNA. A DNA-binding domain is an independently folded protein domain that contains at least one structural motif that recognizes double- or single-stranded DNA. This is a standard functional genomic question to detect of transcription-factor binding sites in DNA sequences.

Convolutional Neural Networks (CNN) is a class of Deep Neural Networks (DLN), most commonly applied for analyzing visual imagery. They are also known as shift invariant or space Invariant Artificial Neural Networks (SIANN), based on their shared-weights architecture and translation invariance characteristics. They have applications in image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, brain-computer interfaces, financial time series, etc.

It’s known that CNN is used, in general, for two-dimensional (2D) and three-dimensional (3D) image convolutions analysis. With our genomic dataset a simple 1D CNN model and Keras library will be applied. An example of the DNA sequence protein bound dataset “Deep Learning Genomics Primer” can be downloaded from the following URL links:

dna_sequence = https://github.com/abidlabs/deep-learning-genomics-primer/blob/master/sequences.txt
dna_label = “https://github.com/abidlabs/deep-learning-genomics-primer/blob/master/labels.txt"

In ML interactive project projects using Jupyter Notebooks, for example, loading this data into a pandas DataFrame will take some program runtime. To avoid that, a simple generic function was developed in the PyDNA library. This function loads these two links in parallel and creates a final CSV file with DNA sequence string and label number columns. The third parameter ‘dna_sequence_protein’ is a defined name of the CSV file. The calling function is shown below.

PyDNA.GenerateCSVFileParallel(dna_sequence, dna_label, ‘’dna_sequence_protein’’)

Data loading in ML projects is a concerned task especially in the area of Big Data domain. Even a simple dataset with millions of rows will slow down the process of loading a pandas DataFrame. Applying techniques like Python asynchronous and multiprocessing programming will improve this process considerably. I’ll prepare a good blog paper for this issue in the future.

The table 3 show ten rows of the final ‘dna_sequence_protein.csv’ file. The ‘dna_label’ columns contain two binary values: 0 — DNA sequence can’t bind to protein and 1 — can bind to protein. This is a simple binary classification ML task.

Table 3. ‘dna_sequence_protein.csv’ data rows.

To see the dataset description, the pandas DataFrame information method was applied. This dataset contains 2,000 rows with two columns ‘dna_sequence’ strings and ‘dna_label’ binary numbers of 0 and 1.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dna_sequence 2000 non-null object
1 dna_label 2000 non-null int64
dtypes: int64(1), object(1)
memory usage: 46.9+ KB
None

Before applying any ANN models, the DNA sequence length needs to be checked for uniformity. The result of the code below is ‘True’, so CNN and RNN models can be applied to this genomic dataset.

X = PyDNA.select_df_column(df_genomics, “dna_sequence”)
dna_is_same_length = PyDNA.dna_sequence_is_equal_length(X)
print(dna_is_same_length)

The DNA sequences and the class labels had been encoded using one-hot encoding. As it was explained before this is one of the main standard encoding uses in CNN and RNN models. Example of the DNA sequences one-hot encoding (X features) is shown below.

[[[0. 1. 0. 0.] 
[0. 1. 0. 0.]
[0. 0. 1. 0.]
...
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]]
[[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
...
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]]
[[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
...
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]]]

Here is the class label one-hot encoding (y label).

[[1. 0.]
[1. 0.]
[1. 0.]
...
[1. 0.]
[0. 1.]
[0. 1.]]

Below is the full program code to implement the CNN model for this genomic dataset using PyDNA custom library.

import sys
import time
import os
os.system("cls")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
tf.random.set_seed(10)
import tensorflow.keras.backend as K
from tensorflow.keras.models import load_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score
from pydna import PyDNA
import warnings
warnings.filterwarnings("ignore")
def get_program_running(start_time):
end_time = time.process_time()
diff_time = end_time - start_time
result = time.strftime("%H:%M:%S", time.gmtime(diff_time))
print("program runtime: {}".format(result))

def main():
# csv file path and name
csv_path_file = r"csv_file_path/name.csv"
# data frame load
df_genomics = PyDNA.pandas_read_data("CSV", csv_path_file, None)
# remove rows and columns with missing values.
df_genomics.dropna(how="all", inplace=True)
# select X features
X = PyDNA.select_df_column(df_genomics, "dna_sequence")
# check if dna sequences have the same legnth - part of dna sequence preprocessing!
dna_is_same_length = PyDNA.dna_sequence_is_equal_length(X)
if dna_is_same_length == False:
exit()
# X features one-hot encoder
X = PyDNA.cnn_X_onehot_encoder(X)
# select y label
y = PyDNA.select_df_column(df_genomics, "dna_label")
# show y label imbalanced classes plot
PyDNA.imbalanced_classes_plot(y, True, "dna_label", "DNA bind to protein class", "Count", "DNA Sequence Protein Classes")
# y label one-hot encoder
y = PyDNA.cnn_y_onehot_encoder(y)
# data split in train, valid and test (80%/10%/10%)
X_train, y_train, X_valid, y_valid, X_test, y_test = PyDNA.train_validation_test_split(X, y, test_size=0.2, valid_size=0.5)
# create cnn model and get loss/metrics values history
epochs = 50
data_split = 0.2
cnn_model, cnn_history = PyDNA.create_cnn_model(y_train, X_train, epochs, data_split)
# plot cnn model loss
font_size = 8
PyDNA.cnn_model_loss_plot(cnn_history, font_size, "CNN Model Loss", "Epoch", "Loss", ["Train", "Validation"])
# plot cnn model accuracy
PyDNA.cnn_model_accuracy_plot(cnn_history, font_size, "CNN Model Accuracy", "Epoch", "Accuracy", ["Train", "Validation"])
print("Model Validation")
# get y_predicted valid
y_predicted = cnn_model.predict(X_valid)
# get max indices of the maximum values along an axis
y_val_max = PyDNA.get_max_nparray(y_valid)
y_predicted_max = PyDNA.get_max_nparray(y_predicted)
# calculate valid classification metrics
accuracy_score_value, precision_value, recall_value, f1_score_value, confusion_matrix_value, classification_report_value = PyDNA.calculate_classification_metrics(y_val_max, y_predicted_max)
print("valid accuracy score:\n{}\n".format(accuracy_score_value))
print("valid precision:\n{}\n".format(precision_value))
print("valid recall:\n{}\n".format(recall_value))
print("valid f1 score:\n{}\n".format(f1_score_value))
print("valid confusion matrix:\n{}\n".format(confusion_matrix_value))
print("valid classification report:\n{}\n".format(classification_report_value))
print("Model Test")
# get y_predicted test
y_predicted = cnn_model.predict(X_test)

# get max indices of the maximum values along an axis
y_test_max = PyDNA.get_max_nparray(y_test)
y_predicted_max = PyDNA.get_max_nparray(y_predicted)
# calculate test classification metrics
accuracy_score_value, precision_value, recall_value, f1_score_value, confusion_matrix_value, classification_report_value = PyDNA.calculate_classification_metrics(y_test_max, y_predicted_max)
print("test accuracy score:\n{}\n".format(accuracy_score_value))
print("test precision:\n{}\n".format(precision_value))
print("test recall:\n{}\n".format(recall_value))
print("test f1 score:\n{}\n".format(f1_score_value))
print("test confusion matrix:\n{}\n".format(confusion_matrix_value))
print("test classification report:\n{}\n".format(classification_report_value))

# save cnn model h5
cnn_model_path = r"cnn_model_path "
cnn_model_name = "cnn_model_name"
PyDNA.cnn_model_save_h5(cnn_model, cnn_model_path, cnn_model_name)

# load cnn model h5
cnn_model = PyDNA.cnn_model_load_h5(cnn_model_path, cnn_model_name)
print(cnn_model)
if __name__ == '__main__':
start_time = time.process_time()
main()
get_program_running(start_time)

One of first things to do in ML classification projects is to check for imbalanced classes as we did with the previous gene family dataset. Figure 2 shows a bar chat of the DNA binary class labels. It’s obvious that both class labels are in a good balance.

Figure 2. DNA binary class labels bar chart.

The results of the program are shown below. The 97.0 % of the test accuracy score is a very good one. This CNN model can be applied to classify a DNA sequence that can bind to protein or not.

Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_1 (Conv1D) (None, 39, 32) 1568
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 9, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 288) 0
_________________________________________________________________
dense_1 (Dense) (None, 16) 4624
_________________________________________________________________
dense_2 (Dense) (None, 2) 34
=================================================================
Total params: 6,226
Trainable params: 6,226
Non-trainable params: 0
_________________________________________________________________
validation accuracy score:
97.5
validation precision:
97.544
validation recall:
97.5
validation f1 score:
97.5
validation confusion matrix:
[[98 4]
[ 1 97]]
validation classification report:
precision recall f1-score support
0 0.99 0.96 0.98 102
1 0.96 0.99 0.97 98
accuracy 0.97 200
macro avg 0.98 0.98 0.97 200
weighted avg 0.98 0.97 0.98 200
test accuracy score:
97.0
test precision:
97.019
test recall:
97.0
test f1 score:
97.0
test confusion matrix:
[[97 4]
[ 2 97]]
test classification report:
precision recall f1-score support
0 0.98 0.96 0.97 101
1 0.96 0.98 0.97 99
accuracy 0.97 200
macro avg 0.97 0.97 0.97 200
weighted avg 0.97 0.97 0.97 200

It’s a good practice when using CNN in ML to visualize and validate the loss and accuracy plots of train/validation sets based on epoch iterations. The number of epochs is a model hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. Figure 3 show CNN model train/validation loss plot (curve). Once the loss for the validation set stops improving or gets worse throughout the learning cycles, it is time to stop training because the model has already converged and may be just overfitting. In our case the number of epochs should be between 40 and 50. This number shouldn’t be smaller than 40 to prevent model underfitting. So, the selection of the number of epochs is very important to guarantee CNN model accuracy score.

Figure 3. CNN model train/validation loss plot.

The figure 4 shows CNN model train/validation accuracy plot. As you can see, after 30 epochs the validation set accuracy is stable. So, combining these two plots we can conclude that selecting the number of epochs between 40 and 50 is a good solution for this specific genomics dataset.

Figure 4. CNN model train/validation accuracy plot.

24. Applying Long Short-Term Memory Networks (LSTM) for DNA Sequence Classification

A Recurrent Neural Network (RNN) is a class of Artificial Neural Networks (ANN) where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

Long Short-Term Memory (LSTM) is RNN architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition and anomaly detection in network traffic or IDSs.

LSTMs are capable of learning long-term dependencies with memorize many previous steps in a sequence providing that enough data is available. LSTM holds promise for any sequential processing task in which a possible hierarchical decomposition may exist, but do not know, in advance, what that decomposition is. If we define a DNA as a sequence with a long-memory that is manifested via the long-range correlations along the sequence, why not apply LSTM algorithm for DNA sequence classification.

To run the LSTM algorithm the following changes are required in the previous program using CNN.

# create lstm model and get loss/metrics values history
epoch = 50
data_split = 0.2
lstm_model, lstm_history = PyDNA.create_lstm_model(y_train, X_train, epoch, data_split)
# plot lstm model loss
font_size = 8
PyDNA. lstm_model_loss_plot(lstm_history, font_size, "LSTM Model Loss", "Epoch", "Loss", ["Train", "Validation"])
# plot lstm model accuracy
PyDNA.lstm_model_accuracy_plot(lstm_history, font_size, "LSTM Model Accuracy", "Epoch", "Accuracy", ["Train", "Validation"])

For the same ‘dna_sequence_protein.csv’ file the results are shown below.

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 39, 32) 1568
_________________________________________________________________
lstm_1 (LSTM) (None, 39, 64) 24832
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 9, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 576) 0
_________________________________________________________________
masking_1 (Masking) (None, 576) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 36928
_________________________________________________________________
dropout_1 (Dropout) (None, 64) 0
_________________________________________________________________
dense_2 (Dense) (None, 2) 130
=================================================================
Total params: 63,458
Trainable params: 63,458
Non-trainable params: 0
_________________________________________________________________

validation accuracy score:
98.5
validation precision:
98.545
validation recall:
98.5
validation f1 score:
98.5
validation confusion matrix:
[[99 3]
[ 0 98]]
validation classification report:
precision recall f1-score support
0 1.00 0.97 0.99 102
1 0.97 1.00 0.98 98
accuracy 0.98 200
macro avg 0.99 0.99 0.98 200
weighted avg 0.99 0.98 0.99 200
test accuracy score:
99.5
test precision:
99.505
test recall:
99.5
test f1 score:
99.5
test confusion matrix:
[[100 1]
[ 0 99]]
test classification report:
precision recall f1-score support
0 1.00 0.99 1.00 101
1 0.99 1.00 0.99 99
accuracy 0.99 200
macro avg 0.99 1.00 0.99 200
weighted avg 1.00 0.99 1.00 200

In evaluating the model performance on the test data, the accuracy score obtained was 99.5%. I did run a couple more genomic datasets and LSTM algorithms provides the best results for DNA sequence classification so far. If we apply the Multi-layer Perceptron classifier algorithm MLPClassifier(), the final results can be shown below. The test accuracy score is very good as 98.0%. LSTM algorithm performances is better still.

validation accuracy score:
96.0
validation precision:
96.299
validation recall:
96.0
validation f1 score:
95.995
validation confusion matrix:
[[93 8]
[ 0 99]]
validation classification report:
precision recall f1-score support
0 1.00 0.92 0.96 101
1 0.93 1.00 0.96 99
accuracy 0.96 200
macro avg 0.96 0.96 0.96 200
weighted avg 0.96 0.96 0.96 200
test accuracy score:
98.0
test precision:
98.078
test recall:
98.0
test f1 score:
98.0
test confusion matrix:
[[98 4]
[ 0 98]]
test classification report:
precision recall f1-score support
0 1.00 0.96 0.98 102
1 0.96 1.00 0.98 98
accuracy 0.98 200
macro avg 0.98 0.98 0.98 200
weighted avg 0.98 0.98 0.98 200

The figures 5 and 6 shows LSTM model train/validation loss and accuracy plots. As we can see, selecting the number of epochs between 30 and 40 can guarantee a good LSTM model performance.

Figure 5. LSTM model train/validation loss plot.

Figure 6. LSTM model train/validation accuracy plot.

The results below were obtained with the Multinomial Naive Bayes classifier model MultinomialNB(). Well, as you can see, this classifier does not perform very well with this specific genomic dataset. This is a very good practical example to prove that every genomic dataset is unique and the best practices for any Data Scientists is to apply all traditional and modern ML algorithms for each one. Many Data Scientists thinking that Deep Learning can do everything in ML for them today. It’s very simple and easy to prove that even Random Forest and Extreme Gradient Boosting algorithms have performed better in many cases compared with Deep Learning ones including CNN and LSTM. So, always apply all of them for each specific genomic dataset and use the one with better metrics performance. I do understand this may take some time but it’s very necessary requirement for getting the best possible model.

validation accuracy score:
80.5
validation precision:
86.011
validation recall:
80.5
validation f1 score:
79.772
validation confusion matrix:
[[62 39]
[ 0 99]]
validation classification report:
precision recall f1-score support
0 1.00 0.61 0.76 101
1 0.72 1.00 0.84 99
accuracy 0.81 200
macro avg 0.86 0.81 0.80 200
weighted avg 0.86 0.81 0.80 200
test accuracy score:
79.5
test precision:
85.547
test recall:
79.5
test f1 score:
78.695
test confusion matrix:
[[61 41]
[ 0 98]]
test classification report:
precision recall f1-score support
0 1.00 0.60 0.75 102
1 0.71 1.00 0.83 98
accuracy 0.80 200
macro avg 0.85 0.80 0.79 200
weighted avg 0.86 0.80 0.79 200

25. Conclusions

  • A custom Python library PyDNA was developed for DNA/RNA/protein sequences string processing and ML classification genomic datasets.
  • To increase performance of large DNA sequences string processing, it’s recommended to use NumPy ndarrys where ever possible.
  • The following main DNA sequence string encoding were presented and analyzed: Label Encoding, One-hot Encoding and K-mer Counting.
  • Natural Language Processing bag of words algorithm was implemented for DNA sequence strings processing.
  • The uniformity of the DNA sequence length across the dataset can determine the right ML algorithm to use.
  • Multinomial Naive and Multi-layer Perceptron classifier models provide more than 97% of classification accuracy score with DNA multi-class dataset and no uniform string length.
  • Convolutional Neural Networks model can provide 97% of classification accuracy with DNA uniform sequence string length.
  • Long Short-Term Memory model provides best result with 99.5% of classification accuracy score with DNA uniform sequence strings length.
  • It’s strongly recommended to apply all traditional and modern ML classification algorithms to any genomic dataset and find out which model provides the best prediction results.

--

--

Ernest Bonat, Ph.D.

I’m a Senior Data Scientist and Engineer consultant. I work on Machine Learning application projects for Life Sciences using Python and Python Data Ecosystem.