Advanced DNA Sequences Preprocessing for Deep Learning Networks

Ernest Bonat, Ph.D.
19 min readMar 29, 2023

--

Introduction
DNA Sequence One-hot Encoder Review
DNA Sequence One-Hot Encoder Padding
DNA Sequence Dataset Cleaning
Splice-junction Gene Sequences Dataset
DNA Sequence Imbalanced Classes Problems
Advanced DNA Sequence Imbalanced Classes Solution using ETL Data Pipeline
1. ETL DNA Sequence Dataset Cleaning
2. ETL DNA Sequence Dataset Label Encoding
3. ETL DNA Sequence Dataset SMOTE
4. ETL DNA Sequence Dataset Preprocessed for Deep Learning Networks
5. ETL DNA Sequence Final Dataset Cleaning
ETL DNA Sequence Full Data Processing
Applying Convolutional Neural Networks to DNA sequence imbalanced and balanced classes datasets
Conclusions

Introduction

Deep-learning architectures such as Deep Neural Networks, Deep Belief Networks, Deep Reinforcement Learning, Recurrent Neural Networks, Convolutional Neural Networks and Transformers have been applied to fields including Computer Vision, Speech Recognition, Natural Language Processing, Machine Translation, Bioinformatics, Drug Design, Medical Image Analysis, Climate Science, Material Inspection and Board Game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

Machine Learning (ML) has demonstrated very promising results for DNA sequence analysis in clinical medicine. It has been proved to solve specific tasks for clinical genomics, variant calling, phenotype-to-phenotype mapping, genome annotation, variant classification, etc.

In many cases, when we apply ML for DNA sequence datasets, the following problems may appear:

  • No enough training and validation data.
  • Data cleaning and encoding requirements.
  • Label imbalanced classes data.

These problems can be fixed by generating synthetic DNA sequence data. This paper proposes an Extract-Transform-Load (ETL) data pipeline process to solve the above problems. It applies DNA sequence string cleaning and validation, label encoding and Synthetic Minority Over-sampling Technique (SMOTE) algorithm. Some of the latest and best practices of Machine Learning algorithms applied in genomics Life Sciences have been published on Medium.com in the paper “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.

DNA Sequence One-hot Encoder Review

For Deep Learning Networks (DLN), the required DNA sequence encoding should be one-hot encoder as explained in Apply Machine Learning Algorithms for Genomics Data Classification paper. Let’s try to review and analyze one-hot encoder for a DNA sequence containing the main four nucleotides A, C, G and T. The dna_sequence_one_hot_encoder() function will be used. This function was developed using the scikit-learning OneHotEncoder() class object.

dna_sequence = “ACGT”
dna_onehot_encoder = PyDNA.dna_sequence_one_hot_encoder(dna_sequence)

Results:

DNA sequence string:
ACGT
DNA sequence list:
[‘A’, ‘C’, ‘G’, ‘T’]
DNA sequence label encoder:
[0 1 2 3]
DNA sequence label encoder reshape:
[[0 1 2 3]]
DNA sequence one-hot encoder:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
DNA sequence one-hot encoder shape:
(4, 4)
Number of rows:
4
Number of columns:
4

As you can see the number of rows represents the size (length) of the DNA sequence and number of columns determines the number of unique base nucleotides. The number of columns should be four. This is a requirement for DLN models’ development with DNA sequence one-hot encoder.

Let’s try a DNA sequence with the size of 20 base nucleotides.

dna_sequence = “ATGATCGCATAGATGACTAG”
dna_onehot_encoder = PyDNA.dna_sequence_one_hot_encoder(dna_sequence)

Results:

DNA sequence string:
ATGATCGCATAGATGACTAG
DNA sequence list:
['A', 'T', 'G', 'A', 'T', 'C', 'G', 'C', 'A', 'T', 'A', 'G', 'A', 'T', 'G',
'A', 'C', 'T', 'A', 'G']
DNA sequence label encoder:
[0 3 2 0 3 1 2 1 0 3 0 2 0 3 2 0 1 3 0 2]
DNA sequence label encoder reshape:
[[0 3 2 0 3 1 2 1 0 3 0 2 0 3 2 0 1 3 0 2]]
DNA sequence one-hot encoder:
[[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]]
DNA sequence one-hot encoder shape:
(20, 4)
Number of rows:
20
Number of columns:
4

We get 20 rows and 4 columns as it should be.

In the below DNA sequence, the G base nucleotide is missing.

DNA sequence list:
['A', 'T', 'C', 'A', 'T', 'C', 'C', 'C', 'A', 'T', 'A', 'C', 'A', 'T', 'C',
'A', 'C', 'T', 'A', 'C']
DNA sequence label encoder:
[0 2 1 0 2 1 1 1 0 2 0 1 0 2 1 0 1 2 0 1]
DNA sequence label encoder reshape:
[[0 2 1 0 2 1 1 1 0 2 0 1 0 2 1 0 1 2 0 1]]
DNA sequence one-hot encoder:
[[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]
DNA sequence one-hot encoder shape:
(20, 3)
Number of rows:
20
Number of columns:
3

As you can see, the number of columns decreased to three. In a DNA sequence dataset this specific row sequence will not work with one-hot encoding for ML algorithms. Specifically, for DLN, the data must be vectorized to efficiently perform the required matrix operations. This means the number of rows and columns for each DNA sequence must be the same for the whole dataset. For example: If the number of columns is different, then a DNA base nucleotide is missed and the following error occurs when creating the final input one-hot encoder matrix.

[File Name]: \anaconda3\envs\python3.10.9\lib\site-packages\numpy\core\shape_base.py [Procedure Name]: stack [Error Message]: all input arrays must have the same shape [Error Type]: <class ‘ValueError’> [Line Number]: 426 [Line Code]: raise ValueError(‘all input arrays must have the same shape’)

For these use cases we can remove the row (loose the data) or use some kind of sequence padding. For any of these solutions the selected ML model(s) must be tested carefully with real production datasets.

DNA Sequence One-Hot Encoder Padding

Let’s look at the DNA sequence padding solutions. The blog paper “Handling variable size DNA inputs” provides the basics of sequence padding using the Keras library (https://keras.io/). A good introduction of the pad_sequences() function was covered in “Keras pad_sequences” paper.

Suppose we have two DNA sequences with different sizes. Here is the first one.

1. DNA sequence string: 
ATGATCGCATAGATGACTAGT
DNA sequence list:
['A', 'T', 'G', 'A', 'T', 'C', 'G', 'C', 'A', 'T', 'A', 'G', 'A', 'T', 'G',
'A', 'C', 'T', 'A', 'G', 'T']
DNA sequence label encoder:
[0 3 2 0 3 1 2 1 0 3 0 2 0 3 2 0 1 3 0 2 3]
DNA sequence label encoder reshape:
[[0 3 2 0 3 1 2 1 0 3 0 2 0 3 2 0 1 3 0 2 3]]
DNA sequence one-hot encoder (dna_one_hot_encoder1):
[[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
DNA sequence one-hot encoder shape:
(21, 4)
Number of rows:
21
Number of columns:
4

And the second one.

2. DNA sequence string:
ATGATCGCATAGATGACTAGTAGAT
DNA sequence list:
['A', 'T', 'G', 'A', 'T', 'C', 'G', 'C', 'A', 'T', 'A', 'G', 'A', 'T', 'G',
'A', 'C', 'T', 'A', 'G', 'T', 'A', 'G', 'A', 'T']
DNA sequence label encoder:
[0 3 2 0 3 1 2 1 0 3 0 2 0 3 2 0 1 3 0 2 3 0 2 0 3]
DNA sequence label encoder reshape:
[[0 3 2 0 3 1 2 1 0 3 0 2 0 3 2 0 1 3 0 2 3 0 2 0 3]]
DNA sequence one-hot encoder (dna_one_hot_encoder2):
[[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]]
DNA sequence one-hot encoder shape:
(25, 4)
Number of rows:
25
Number of columns:
4

The first DNA sequence has shape (21, 4) and the second (25, 4). We’ll need to use some kind of padding for the first one to add four more rows to it. To do that the Keras pad_sequences() function shown below will be used.

@staticmethod            
def dna_onehot_encoder_padding(dna_onehot_encoded_list, data_type, padding_type, padding_value):
try:
dna_padding = tf.keras.preprocessing.sequence.pad_sequences(sequences=dna_onehot_encoded_list,
dtype=data_type, padding=padding_type, value=padding_value)
except:
print(PyDNA.get_exception_info())
if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())
return dna_padding

In the program below, the list of the DNA sequences one-hot encoder is passed to PyDNA.dna_onehot_encoder_padding() to pad the first one.

dna_onehot_encoded_list = [dna_one_hot_encoder1, dna_one_hot_encoder2]
for item in dna_onehot_encoded_list:
print("DNA one-hot encoder:\n{}".format(item))
dna_padding = PyDNA.dna_onehot_encoder_padding(dna_onehot_encoded_list, "float32", "pre", 0)
for padded in dna_padding:
print(padded.shape)
print("DNA one-hot encoder padded:\n{}".format(padded))

Let’s look at the program results for the first DNA sequence encoder dna_one_hot_encoder1 before padding.

DNA one-hot encoder:
[[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
shape: (21, 4)

Here are the results after the padding.

DNA one-hot encoder padded:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
shape: (25, 4)

As you can see the following four rows with zero values had been added to the beginning of the numpy array encoder.

[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]

If the padding value changes to 1, the first four rows will be:

[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]

It’s important to mention that if any of these DNA sequence one-hot encoded datasets have a different number of columns this sequence padding function will not work. For this case the following error will occur.

[File Name]: C:\Users\portland\anaconda3\envs\python3.10.9\lib\site-packages\keras\utils\data_utils.py [Procedure Name]: pad_sequences [Error Message]: Shape of sample (3,) of sequence at position 1 is different from expected shape (4,) [Error Type]: <class ‘ValueError’> [Line Number]: 1084 [Line Code]: raise ValueError(Shape of sample (3,) of sequence at position 1 is different from expected shape (4,))

Now that all DNA sequences one-hot encoders have the same shape, any DLN algorithms can be applied to develop the final predictive models. This model must be tested with real known production data. I have decided to write another blog paper about it in the near future.

DNA Sequence Data Cleaning

During ML development, the data cleaning (preprocessing) is about 60% — 70% of the whole work. Data Engineers know that very well. For ML DNA sequence projects using classification, regression and clustering the data cleaning process is very important and required. For DNA sequence validation there are two main requirements:

1. The DNA sequence string must only contain the main four nucleotide letters A, C, G and T. Any other letter in the sequence string will generate a new column when converted to a one-hot encoder.

2. The DNA sequence string must contain all the main four nucleotides letter A, C, G and T. Any missing letter in the sequence string will decrease the amount of the columns when convert it to one-hot encoder.

Let’s look at some use cases to verify these two requirements. In the code results below, we can see a valid DNA sequence that contains all the main four nucleotides. The one-hot encoder shape has 60 rows and 4 columns.

DNA sequence string:
AACTTCTCCAACGACATCATGCTACTGCAGGTCAGGCACACTCCTGCCACTCTTGCTCTT
DNA sequence validation result:
True
DNA sequence one-hot encoder shape:
(60, 4)

In the second results shown below, we can see the DNA sequence contains a new letter ’N’. Because of that the validation failed. The one-hot encoder shape has increased by one more column.

DNA sequence string:
AACTTCTCCAACGACATCATGCTACTGCAGGNCAGGCACACTCCTGCCACTCTTGCTCTT
DNA sequence validation result:
False
DNA sequence one-hot encoder shape:
(60, 5)

In the third code results below, we can see the DNA sequence is missing the base nucleotide letter ‘G’. The validation fails because of this. The one-hot encoder shape has decreased by one column.

DNA sequence string:
AATCTTCCCAACCCCTCTCTCTTACTTTCTAATCTATCATCTACTCATCTATCCTCACTT
DNA sequence validation result:
False
DNA sequence one-hot encoder shape:
(60, 3)

For the second and third above cases the DNA sequence rows must be updated or removed. It’s up to the ML project management team to make these decisions.

Splice-junction Gene Sequences Dataset

The splice-junction gene sequences dataset contains information about splice-junctions taking from Genbank 64.1. The task description states that genes are removed during the RNA transcription process, and are called introns, while regions are used to generate mRNA and are called exons. Junctions between them are called splice-junctions. These junctions are points on a DNA sequence at which `superfluous’ DNA is removed during the process of protein creation in higher organisms. There are two kinds of splice-junctions: exon-intron junction and intron-exon junctions. Each of the DNA sequences in this dataset are 60 base nucleotides long. Each DNA sequence belong to one of 3 classes: “EI” (Extron-Intron junction), “IE” (Intron-Extron junction) and “N” (neither EI or IE). There are 767 genes with the EI label, 768 with the IE label, and 1655 with the N label. The task of this dataset is to classify, given a DNA sequence, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out).

DNA Sequence Imbalanced Classes Problems

Dealing with the real-world problems of imbalanced classes datasets using today’s ML is a simple task for numerical features data types. In the paper “RNA-Seq Gene Expression Classification Using Machine Learning Algorithms” I used Synthetic Minority Oversampling Technique (SMOTE) for the RNA-Seq (HiSeq) Pan-Cancer Atlas dataset. This was implemented with the following tree lines of Python code.

from imblearn.over_sampling import SMOTE
smote_over_sampling = SMOTE(random_state=50, n_jobs=-1)
X, y = smote_over_sampling.fit_resample(X, y)

Where X is the matrix of the features values and y is the matrix (or one-dimensional vector) of the target (label) values. That’s all so far, good and simple!

In our genomics datasets the DNA sequences are categorical (text) data types. As we know, any imbalanced classes solution algorithms work with numerical data only. In this case the DNA sequence dataset must be validated (cleanup) and preprocessed (encoded) before. In my paper “Apply Machine Learning Algorithms for Genomics Data Classification”, I covered three main DNA sequence string encoding: Label Encoding, One-Hot Encoding and K-mer Counting. In this paper, I’ll be using label encoding. For our base four nucleotides “ACGT” sequence the default label encoded will be [0, 1, 2, 3]. For the splice-junction original dataset the imbalanced DNA classes boundaries plot is shown below.

Advanced DNA Sequence Imbalanced Classes Solution using ETL Data Pipeline

For DNA sequence validation and encoding of splice-junction dataset, I’ll be using very popular data pipe process called Extract-Transform-Load (ETL). Where ‘Extract’ is data extraction from one or many sources, ‘Transform’ is data transformation (preprocessing or cleansing) and ‘Load’ is data loading to one or many destinations.

ETL is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s). The ETL process became a popular concept in the 1970s and is often used in data warehousing. Data extraction involves extracting data from homogeneous or heterogeneous sources; data transformation processes data by data cleansing and transforming them into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse.” Here is a simple diagram of a ETL process.

In our case, the final product of this ETL process is to create a validated and preprocessed CSV file. This file can be applied to any ML supervised classifications algorithms including DLN.

The following new ETL DNA sequence algorithm was designed, developed and tested with many genomics datasets.

1. ETL DNA Sequence Dataset Cleaning

For this initial ELT process, open the original dataset file ‘splice_junction_dna_sequence_original.csv’, apply DNA sequences clearing validation and generate a cleanup file ‘splice_junction_dna_sequence_cleanup.csv’. Below is the first function ‘dna_sequence_cleanup_csv_path()’.

def etl_dna_sequence_cleanup(dna_sequence_original_csv_path, dna_sequence_cleanup_csv_path, dna_sequence_columns_list):
"""splice_junction_dna_sequence original data preprocessing to create a
new splice_junction_dna_sequence cleanup data csv file
args:
dna_sequence_original_csv_path (string): dna sequence original csv path
dna_sequence_cleanup_csv_path (string): dna sequence cleanup csv path
dna_sequence_columns_list (list): list example ["dna_class", "dna_sequence"]
return: None
"""
result = False
try:
df_genomics_original = PyDNA.pandas_read_data("CSV", dna_sequence_original_csv_path, None)
df_genomics_original.dropna(how="all", inplace=True)
print("DNA sequence original data frame shape:\n{} ".format(df_genomics_original.shape))
dna_class_list = []
dna_sequence_list = []
for row in df_genomics_original.itertuples():
dna_class = row.dna_class
dna_sequence = row.dna_sequence
is_dna_result = PyDNA.is_dna(dna_sequence)
if is_dna_result == True:
dna_class_list.append(dna_class)
dna_sequence_list.append(dna_sequence)
df_genomics_cleanup = pd.DataFrame(list(zip(dna_class_list, dna_sequence_list)), columns=dna_sequence_columns_list)
print("DNA sequence cleanup data frame shape:\n{}".format(df_genomics_cleanup.shape))
total_remove_rows = df_genomics_original.shape[0] - df_genomics_cleanup.shape[0]
print("DNA sequence original total remove rows:\n{}".format(total_remove_rows))
df_genomics_cleanup.to_csv(path_or_buf=dna_sequence_cleanup_csv_path, index=False)
print("DNA sequence cleanup csv file created:\n{}".format(dna_sequence_cleanup_csv_path))
result = True
except:
print(PyDNA.get_exception_info())
if PyDNA._app_is_log: PyDNA.write_log_file("error", PyDNA.get_exception_info())
return result

Passing parameters and calling function code.

csv_path_folder = r"\csv_folder_path"
dna_sequence_original_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_original.csv")
dna_sequence_cleanup_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_cleanup.csv")
dna_sequence_columns_list = ["dna_class", "dna_sequence"]
result = etl_dna_sequence_cleanup(dna_sequence_original_csv_path, dna_sequence_cleanup_csv_path, dna_sequence_columns_list)
print(result)

Results:

DNA sequence original data frame shape:
(3190, 2)
DNA sequence cleanup data frame shape:
(3169, 2)
DNA sequence original total remove rows:
21
DNA sequence cleanup csv file created:
\csv_folder_path\splice_junction_dna_sequence_cleanup.csv
True

As you can see, the fist ETL process removes 21 rows from the DNA sequence original file. Here is the new DNA sequence imbalanced classes boundaries plot.

For ‘EI’ class 5 rows were removed, 3 rows for ‘IE’ class and 13 rows for ’N’ class. Very good first ETL cleaning process.

2. ETL DNA Sequence Dataset Label Encoding

The second ETL process uses the function ‘etl_dna_sequence_label_encoder()’ to generated the file ‘splice_junction_dna_sequence_labelencoder.csv’. This function converts each DNA sequence string row to ordinary label encoder using the library PyDNA.data_frame_label_encoder(dna_numpy_array) API method provided in Apply Machine Learning Algorithms for Genomics Data Classification blog paper.

Passing parameters and calling function code.

csv_path_folder = r"\csv_folder_path"
dna_sequence_cleanup_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_cleanup.csv")
dna_sequence_labelencoder_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder.csv")
result = etl_dna_sequence_label_encoder(dna_sequence_cleanup_csv_path, dna_sequence_labelencoder_csv_path)
print(result)

Results:

DNA sequence cleanup data frame shape:
(3169, 2)
DNA sequence label encoder data frame shape:
(3169, 61)
DNA sequence label encoder csv file created:
\csv_folder_path\splice_junction_dna_sequence_labelencoder.csv
True

Below are the first 10 rows for ‘EI’ class.

EI,1,1,0,2,1,3,2,1,0,3,1,0,1,0,2,2,0,2,2,1,1,0,2,1,2,0,2,1,0,2,2,3,1,3,2,3,3,1,1,0,0,2,2,2,1,1,3,3,1,2,0,2,1,1,0,2,3,1,3,2
EI,0,2,0,1,1,1,2,1,1,2,2,2,0,2,2,1,2,2,0,2,2,0,1,1,3,2,1,0,2,2,2,3,2,0,2,1,1,1,1,0,1,1,2,1,1,1,1,3,1,1,2,3,2,1,1,1,1,1,2,1
EI,2,0,2,2,3,2,0,0,2,2,0,1,2,3,1,1,3,3,1,1,1,1,0,2,2,0,2,1,1,2,2,3,2,0,2,0,0,2,1,2,1,0,2,3,1,2,2,2,2,2,1,0,1,2,2,2,2,0,3,2
EI,2,2,2,1,3,2,1,2,3,3,2,1,3,2,2,3,1,0,1,0,3,3,1,1,3,2,2,1,0,2,2,3,0,3,2,2,2,2,1,2,2,2,2,1,3,3,2,1,3,1,2,2,3,3,3,3,1,1,1,1
EI,2,1,3,1,0,2,1,1,1,1,1,0,2,2,3,1,0,1,1,1,0,2,2,0,0,1,3,2,0,1,2,3,2,0,2,3,2,3,1,1,1,1,0,3,1,1,1,2,2,1,1,1,3,3,2,0,1,1,1,3
EI,1,0,2,0,1,3,2,2,2,3,2,2,0,1,0,0,1,0,0,0,0,1,1,3,3,1,0,2,1,2,2,3,0,0,2,0,2,0,2,2,2,1,1,0,0,2,1,3,1,0,2,0,2,0,1,1,0,1,0,2
EI,1,1,3,3,3,2,0,2,2,0,1,0,2,1,0,1,1,0,0,2,0,0,2,3,2,3,2,1,0,2,2,3,0,1,2,3,3,1,1,1,0,1,1,3,2,1,1,1,3,2,2,3,2,2,1,1,2,1,1,0
EI,1,1,1,3,1,2,3,2,1,2,2,3,1,1,0,1,2,0,1,1,0,0,2,0,1,1,0,2,1,2,2,3,2,0,2,1,1,0,1,2,2,2,1,0,2,2,1,1,2,2,2,2,3,1,2,3,2,2,2,2
EI,3,2,2,1,2,0,1,3,0,1,2,2,1,2,1,2,2,0,2,2,1,1,1,3,2,2,0,2,0,2,2,3,2,0,2,2,0,1,1,1,3,1,1,3,2,3,1,1,1,3,2,1,3,1,1,0,2,3,1,1
EI,0,0,2,1,3,2,0,1,0,2,3,2,2,0,1,1,1,2,2,3,1,0,0,1,3,3,1,0,0,2,2,3,2,0,2,1,1,0,2,2,0,2,3,1,2,2,2,3,2,2,2,0,2,2,2,3,2,0,2,0

3. ETL DNA Sequence Dataset SMOTE

The third ETL process uses the function ‘etl_dna_sequence_label_encoder_smote()’ to generate the file ‘splice_junction_dna_sequence_labelencoder_smote.csv’. This function uses Synthetic Minority Oversampling Technique (SMOTE) algorithm to balance all tree classes to 1,642 rows. The total rows in this dataset now should be 1,642 x 3 = 4,926 as you can see below from the SMOTE data frame shape calculation.

Passing parameters and calling function code.

csv_path_folder = r"\csv_folder_path"
dna_sequence_labelencoder_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder.csv")
dna_sequence_labelencoder_smote_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder_smote.csv")
dna_sequence_size = 60
result = etl_dna_sequence_label_encoder_smote(dna_sequence_labelencoder_csv_path, dna_sequence_labelencoder_smote_csv_path, dna_sequence_size)
print(result)

Results:

DNA sequence label encoder data frame shape:
(3168, 61)
DNA sequence SMOTE data frame shape:
(4926, 61)
DNA sequence smote csv file created:
\csv_folder_path\splice_junction_dna_sequence_labelencoder_smote.csv
True

The API method PyDNA.balance_class_smote(X, y) was provided in RNA-Seq Gene Expression Classification Using Machine Learning Algorithms blog paper. The final DNA classes balanced boundaries plot is shown below.

4. ETL DNA Sequence Dataset Preprocessed for Deep Learning Networks

The fourth ETL process uses the function ‘etl_dna_sequence_final_dln()’ to generate the file ‘splice_junction_dna_sequence_final_dln.csv’. This process converts the DNA sequence label encoder back to DNA sequence string. The API method PyDNA.dna_labelencoder_to_sequence_string(X) was used for it.

Passing parameters and calling function code.

dna_sequence_labelencoder_smote_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder_smote.csv")    
dna_sequence_final_to_dln_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_final_to_dln.csv")
dna_sequence_size = 60
result = etl_dna_sequence_final_dln(dna_sequence_labelencoder_smote_csv_path, dna_sequence_final_to_dln_csv_path, dna_sequence_size)
print(result)

Results:

DNA sequence SMOTE data frame shape:
(4926, 61)
DNA sequence DLN data frame shape:
(4926, 2)
DNA sequence DLN csv file created:
\csv_folder_path\splice_junction_dna_sequence_final_to_dln.csv
True

Now we have a valid and complete DSA sequence dataset, we can use any of the DLN algorithms. Below are 10 example rows for ‘EI’ class.

dna_class,dna_sequence
EI,GAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATG
EI,GGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCC
EI,GCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCT
EI,CAGACTGGGTGGACAACAAAACCTTCAGCGGTAAGAGAGGGCCAAGCTCAGAGACCACAG
EI,CCTTTGAGGACAGCACCAAGAAGTGTGCAGGTACGTTCCCACCTGCCCTGGTGGCCGCCA
EI,CCCTCGTGCGGTCCACGACCAAGACCAGCGGTGAGCCACGGGCAGGCCGGGGTCGTGGGG
EI,TGGCGACTACGGCGCGGAGGCCCTGGAGAGGTGAGGACCCTCCTGTCCCTGCTCCAGTCC
EI,AAGCTGACAGTGGACCCGGTCAACTTCAAGGTGAGCCAGGAGTCGGGTGGGAGGGTGAGA
EI,TGGCGACTACGGCGCGGAGGCCCTGGAGAGGTGAGGACCCTGGTATCCCTGCTGCCAGTC
EI,AAGCTGAGAGTGGACCCTGTCAACTTCAAGGTGAGCCACCAGTCGGGTGGGGAGGGTGAG

5. ETL DNA Sequence Final Dataset Cleaning

I would recommend to apply the first DNA sequence ETL process again to make sure we have a final valid dataset.

Passing parameters and calling function code.

csv_path_folder = r"\csv_folder_path"
dna_sequence_original_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_final_dln.csv")
dna_sequence_cleanup_csv_path = os.path.join(csv_path_folder, " splice_junction_dna_sequence_final_dln_cleanup.csv")
dna_sequence_columns_list = ["dna_class", "dna_sequence"]
result = etl_dna_sequence_cleanup(dna_sequence_original_csv_path, dna_sequence_cleanup_csv_path, dna_sequence_columns_list)
print(result)

Results:

DNA sequence final DLN data frame shape:
(4926, 2)
DNA sequence final DLN cleanup data frame shape:
(4926, 2)
DNA sequence DLN csv file created:
\csv_folder_path\splice_junction_dna_sequence_final_to_dln_cleanup.csv
True

ETL DNA Sequence Full Data Processing

Below is the code of the whole ETL DNA sequence data pipeline process.

# 1. ETL DNA Sequence Cleanup
csv_path_folder = r"G:\Visual WWW\Python\1000_python_workspace\bushnell_ml_ai_project\cvs2"
dna_sequence_original_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_original.csv")
dna_sequence_cleanup_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_cleanup.csv")
dna_sequence_columns_list = ["dna_class", "dna_sequence"]
result = etl_dna_sequence_cleanup(dna_sequence_original_csv_path, dna_sequence_cleanup_csv_path, dna_sequence_columns_list)
print("1. ETL DNA Sequence Cleanup: {}".format(result))
if result == False:
exit()

# 2. ETL DNA Sequence Label Encoder
# csv_path_folder = r"G:\Visual WWW\Python\1000_python_workspace\bushnell_ml_ai_project\cvs"
dna_sequence_cleanup_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_cleanup.csv")
dna_sequence_labelencoder_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder.csv")
result = etl_dna_sequence_label_encoder(dna_sequence_cleanup_csv_path, dna_sequence_labelencoder_csv_path)
print("2. ETL DNA Sequence Label Encoder: {}".format(result))
if result == False:
exit()

# 3. ETL DNA Sequence SMOTE
# csv_path_folder = r"G:\Visual WWW\Python\1000_python_workspace\bushnell_ml_ai_project\cvs"
dna_sequence_labelencoder_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder.csv")
dna_sequence_labelencoder_smote_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder_smote.csv")
dna_sequence_size = 60
result = etl_dna_sequence_label_encoder_smote(dna_sequence_labelencoder_csv_path, dna_sequence_labelencoder_smote_csv_path, dna_sequence_size)
print("3. ETL DNA Sequence SMOTE: {}".format(result))
if result == False:
exit()

# 4. ETL DNA Sequence Final DLN
# csv_path_folder = r"G:\Visual WWW\Python\1000_python_workspace\bushnell_ml_ai_project\cvs"
dna_sequence_labelencoder_smote_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_labelencoder_smote.csv")
dna_sequence_final_to_dln_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_final_to_dln.csv")
dna_sequence_size = 60
result = etl_dna_sequence_final_dln(dna_sequence_labelencoder_smote_csv_path, dna_sequence_final_to_dln_csv_path, dna_sequence_size)
print("4. ETL DNA Sequence Final DLN: {}".format(result))
if result == False:
exit()

# 5. ETL DNA Sequence Final DLN Cleanup
# csv_path_folder = r"G:\Visual WWW\Python\1000_python_workspace\bushnell_ml_ai_project\cvs"
dna_sequence_final_to_dln_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_final_to_dln.csv")
dna_sequence_final_to_dln_cleanup_csv_path = os.path.join(csv_path_folder, "splice_junction_dna_sequence_final_to_dln_cleanup.csv")
dna_sequence_columns_list = ["dna_class", "dna_sequence"]
result = etl_dna_sequence_cleanup(dna_sequence_final_to_dln_csv_path, dna_sequence_final_to_dln_cleanup_csv_path, dna_sequence_columns_list)
print("5. ETL DNA Sequence Final DLN Cleanup: {}".format(result))
if result == False:
exit()

Applying Convolutional Neural Networks to DNA sequence imbalanced and balanced classes datasets

In my paper Apply Machine Learning Algorithms for Genomics Data Classification the Convolutional Neural Networks (CNN) algorithm was applied to predict DNA sequence binding to protein. For our splice-junction dataset, let’s find how CNN model performs with DNA sequence imbalanced and balanced classes. Below are the results for imbalanced classes in csv file ‘splice_junction_dna_sequence_cleanup.csv’.

Model: "sequential"
_____________________________________________________________
Layer (type) Output Shape Param #
==============================================================
conv1d (Conv1D) (None, 49, 32) 1568
max_pooling1d (MaxPooling1D) (None, 12, 32) 0
flatten (Flatten) (None, 384) 0
dense (Dense) (None, 16) 6160
dense_1 (Dense) (None, 3) 51
==============================================================
Total params: 7,779
Trainable params: 7,779
Non-trainable params: 0
_________________________________________________________________
Model Validation
10/10 [==============================] - 0s 2ms/step
valid accuracy score:
95.584

valid precision:
95.686

valid recall:
95.584

valid f1 score:
95.544

valid confusion matrix:
[[ 75 0 1]
[ 2 68 7]
[ 3 1 160]]

valid classification report:
precision recall f1-score support
0 0.94 0.99 0.96 76
1 0.99 0.88 0.93 77
2 0.95 0.98 0.96 164
accuracy 0.96 317
macro avg 0.96 0.95 0.95 317
weighted avg 0.96 0.96 0.96 317

Model Test
10/10 [==============================] - 0s 1ms/step
test accuracy score:
95.268

test precision:
95.336

test recall:
95.268

test f1 score:
95.189

test confusion matrix:
[[ 75 0 1]
[ 4 65 7]
[ 1 2 162]]

test classification report:
precision recall f1-score support
0 0.94 0.99 0.96 76
1 0.97 0.86 0.91 76
2 0.95 0.98 0.97 165
accuracy 0.95 317
macro avg 0.95 0.94 0.95 317
weighted avg 0.95 0.95 0.95 317

Model One Test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dna_sequence 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
1/1 [==============================] - 0s 86ms/step
Class Predictive: 1

For DNA sequence balanced classes in the csv file ‘splice_junction_dna_sequence_final_to_dln_cleanup.csv’ the results are shown below.

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
==============================================================
conv1d (Conv1D) (None, 49, 32) 1568
max_pooling1d (MaxPooling1D (None, 12, 32)) 0
flatten (Flatten) (None, 384) 0
dense (Dense) (None, 16) 6160
dense_1 (Dense) (None, 3) 51
==============================================================
Total params: 7,779
Trainable params: 7,779
Non-trainable params: 0
_________________________________________________________________
Model Validation
16/16 [==============================] - 0s 3ms/step
valid accuracy score:
95.732

valid precision:
95.746

valid recall:
95.732

valid f1 score:
95.733

valid confusion matrix:
[[157 3 4]
[ 3 156 5]
[ 4 2 158]]

valid classification report:
precision recall f1-score support
0 0.96 0.96 0.96 164
1 0.97 0.95 0.96 164
2 0.95 0.96 0.95 164
accuracy 0.96 492
macro avg 0.96 0.96 0.96 492
weighted avg 0.96 0.96 0.96 492

Model Test
16/16 [==============================] - 0s 2ms/step
test accuracy score:
97.154

test precision:
97.19

test recall:
97.154

test f1 score:
97.16

test confusion matrix:
[[161 0 3]
[ 1 156 6]
[ 1 3 161]]

test classification report:
precision recall f1-score support
0 0.99 0.98 0.98 164
1 0.98 0.96 0.97 163
2 0.95 0.98 0.96 165
accuracy 0.97 492
macro avg 0.97 0.97 0.97 492
weighted avg 0.97 0.97 0.97 492

Model One Test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dna_sequence 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
1/1 [==============================] - 0s 58ms/step
Class Predictive: 1

Based on the above CNN results we can see a better test accuracy score of 97.1% with DNA sequence balanced classes dataset compared with imbalanced classes 95.2%. I hope you understand why DNA sequence datasets classes should be balanced before applying any DLN algorithms — it’s very important! In the future I’ll be writing a new paper to cover in detail the DLN classification algorithms for different DNA sequence datasets.

Conclusions

1. To apply any DLN algorithms to DNA sequence datasets the following validation are required:

  • The DNA sequence string must contain the main four nucleotide letters A, C, G and T only. Any other letter in the sequence string will generate a new column when converted to a one-hot encoder.
  • The DNA sequence string must contain all the main four nucleotides letter A, C, G and T. Any missing letter in the sequence string will decrease the amount of the columns when convert it to one-hot encoder.

2. To generate synthetic DNA sequence data, a ETL pipeline process was proposed to increase training/validation data and solve the label imbalanced classes problems.

3. Solving the label imbalanced classes problems in DNA sequence datasets improves the DLN classification models performances. This should be a must requirement for any ML DNA sequence classification, regression and clustering projects.

--

--

Ernest Bonat, Ph.D.

I’m a Senior Data Scientist and Engineer consultant. I work on Machine Learning application projects for Life Sciences using Python and Python Data Ecosystem.