DNA Sequence String Conversion to Label Encoder for Machine Learning Algorithms

Ernest Bonat, Ph.D.
5 min readApr 18, 2023

--

1. Overview
2. Convert DNA Label Encoder Class Object
3. Convert DNA Sequence String CSV File to DNA Label Encoder CSV File
4. Convert DNA Label Encoder CSV File to DNA Sequence String CSV File
5. Conclusions

1. Overview

I have received many questions regarding the conversion of DNA sequence datasets for Machine Learning algorithms. As a result, I have decided to write this simple paper to demonstrate how to implement this data preprocessing task. In genomics datasets, DNA sequences are categorized as string (text) data types. However, Machine Learning algorithms only work with numerical data. Therefore, DNA sequence strings must be converted to numerical values. It’s very important for any bioinformaticians and computational biologists to know how to implement this data conversion. This can be accomplished by using any of the following encoding methods: Label Encoder, One-Hot Encoder, or K-mer Counting (as discussed in “Apply Machine Learning Algorithms for Genomics Data Classification”). This paper will cover the Label Encoder method, which was used in the “Advanced DNA Sequences Preprocessing for Deep Learning Networks” to generate synthetic DNA sequence string data. The code to convert DNA sequence string datasets to DNA label encoder datasets and viceversa is provided in GitHub repository. Some of the latest and best practices of Machine Learning algorithms applied in genomics Life Sciences have been published on Medium.com in the paper “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.

2. Convert DNA Label Encoder Class Object

A generic class object was developed to handle DNA sequence string datasets conversion to DNA label encoder datasets and viceversa. The code is shown below.

import sys
import time
import traceback
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
os.system("cls")

import warnings
warnings.filterwarnings("ignore")

class ConvertDNALabelEncoder(object):
"""
convert dna sequence string csv file to dna label encoder csv file and viceverse
"""
def __init__(self):
pass

@staticmethod
def convert_dna_string_to_dna_labelencoder(dna_string_csv_path, dna_labelencoder_csv_path):
"""
convert dna sequence string csv file to dna label encoder csv file
args:
dna_string_csv_path (string): dna string csv file path
dna_labelencoder_csv_path (string): dna label encoder csv file path
returns:
none
"""
try:
df_dna_string = pd.read_csv(filepath_or_buffer=dna_string_csv_path)
label_encoder = LabelEncoder()
dna_string_list = []
for row in df_dna_string.itertuples():
dna_string_row = row.dna_sequence
dna_string_nparray = np.array(list(dna_string_row))
dna_labelencoder_row = label_encoder.fit_transform(dna_string_nparray)
dna_string_list.append(dna_labelencoder_row)
df_dna_labelencoder = pd.DataFrame(dna_string_list)
df_dna_labelencoder.to_csv(path_or_buf=dna_labelencoder_csv_path, index=False, header=None)
except:
print("An error occurred. {}".format(ConvertDNALabelEncoder.get_exception_stack_trace()))

@staticmethod
def convert_dna_labelencoder_to_dna_string(dna_labelencoder_csv_path, dna_string_csv_path):
"""
convert dna sequence label encoder csv file to dna string csv file
args:
dna_labelencoder_csv_path (string): dna label encoder csv file path
dna_string_csv_path (string): dna string csv file path
"""
try:
df_dna_labelencoder = pd.read_csv(filepath_or_buffer=dna_labelencoder_csv_path, header=None)
dna_labelencoder_list = df_dna_labelencoder.values.tolist()
dna_string_list = []
for item in dna_labelencoder_list:
dna_string = ""
for column in item:
if column == 0:
nucleotide_letter = "A"
elif column == 1:
nucleotide_letter = "C"
elif column == 2:
nucleotide_letter = "G"
elif column == 3:
nucleotide_letter = "T"
dna_string += nucleotide_letter
dna_string_list.append(dna_string)
df_dna_string = pd.DataFrame(dna_string_list)
df_dna_string.to_csv(path_or_buf=dna_string_csv_path, index=False, header=None)
except:
print("An error occurred. {}".format(ConvertDNALabelEncoder.get_exception_stack_trace()))

@staticmethod
def get_exception_stack_trace():
"""
get exception stack trace
args:
none
returns:
exception_stack_trace (string): exception stack trace parameters
"""
try:
exception_type, exception_value, exception_traceback = sys.exc_info()
file_name, line_number, procedure_name, line_code = traceback.extract_tb(exception_traceback)[-1]
exception_stack_trace = ''.join('[Time Stamp]: ' + str(time.strftime('%d-%m-%Y %I:%M:%S %p')) + '' + '[File Name]: ' + str(file_name) + ' '
+ '[Procedure Name]: ' + str(procedure_name) + ' '
+ '[Error Message]: ' + str(exception_value) + ' '
+ '[Error Type]: ' + str(exception_type) + ' '
+ '[Line Number]: ' + str(line_number) + ' '
+ '[Line Code]: ' + str(line_code))
except:
print("An error occurred in {}".format("get_exception_stack_trace() function"))
return exception_stack_trace

@staticmethod
def get_program_running(start_time):
"""
calculate program running
args:
start_time (string): start time program runtime
returns:
none
"""
try:
end_time = time.time()
diff_time = end_time - start_time
result = time.strftime("%H:%M:%S", time.gmtime(diff_time))
print("program runtime: {}".format(result))
except:
print("An error occurred. {}".format(ConvertDNALabelEncoder.get_exception_stack_trace()))

As you can see from the class code above the following static methods were developed.

def convert_dna_string_to_dna_labelencoder(dna_string_csv_path, dna_labelencoder_csv_path):
"""
convert dna sequence string csv file to dna label encoder csv file
args:
dna_string_csv_path (string): dna string csv file path
dna_labelencoder_csv_path (string): dna label encoder csv file path
returns:
none
"""

def convert_dna_labelencoder_to_dna_string(dna_labelencoder_csv_path, dna_string_csv_path):
"""
convert dna sequence label encoder csv file to dna string csv file
args:
dna_labelencoder_csv_path (string): dna label encoder csv file path
dna_string_csv_path (string): dna string csv file path
"""

def get_exception_stack_trace():
"""
get exception stack trace
args:
none
returns:
exception_stack_trace (string): exception stack trace parameters
"""

def get_program_running(start_time):
"""
calculate program running
args:
start_time (string): start time program runtime
returns:
none
"""

This class code implements two important Python programming practices: function strings documentation and exception handling. The exception handling code captures the complete stack trace, not just the exception message. It can be difficult to quickly identify and fix errors with only a simple exception message. I have noticed that many online Python resources lack these essential code implementations. Unfortunately, many people write poor quality Python code today. I would like to invite you to read and understand the following paper: “Refactoring Python Code for Machine Learning Projects. Python “Spaghetti Code” Everywhere!”. I believe you will find it informative and enjoyable.

3. Convert DNA Sequence String CSV File to DNA Label Encoder CSV File

Below is the code to convert a DNA sequence string CSV file to DNA label encoder CSV file.

from class_convert_dna_label_encoder import ConvertDNALabelEncoder
dna_string_csv_path = r”\folder_path\dna_sequence_string_example.csv”
dna_labelencoder_csv_path = r”\folder_path\dna_label_encoder_example.csv”
ConvertDNALabelEncoder.convert_dna_string_to_dna_labelencoder(dna_string_csv_path, dna_labelencode_csv_path)

Here are some examples of a DNA sequence string dataset and a DNA label encoder dataset. As you can see the second DNA label encoder dataset can be used with any Machine Learning algorithm.

  1. ‘dna_sequence_string_example.csv’ CVS file example data.

2. ‘dna_sequence_label_encoder_example.csv’ CVS file example data.

4. Convert DNA Label Encoder CSV File to DNA Sequence String CSV File

Below is the code to convert DNA label encoder CSV file to DNA sequence string CSV file.

from class_convert_dna_label_encoder import ConvertDNALabelEncoder
dna_string_csv_path_back = r”\folder_path\dna_sequence_string_example_back.csv”
dna_labelencoder_csv_path = r”\folder_path\dna_label_encoder_example.csv”
ConvertDNALabelEncoder. convert_dna_labelencoder_to_dna_string(dna_string_csv_path_back, dna_labelencode_csv_path)

After this conversion, we got back the same original DNA sequence string CSV file, what it should be.

  1. ‘dna_sequence_label_encoder_example.csv’ CVS file example data.

2. dna_sequence_string_example_back.csv CSV file example data

5. Conclusions

1. DNA sequence string datasets need to be converted to numerical datasets for Machine Learning classification, regression and clustering projects.

2. The Label Encoder method provide a simple way to convert DNA sequence string datasets to numerical datasets.

3. The DNA Label Encoder class object provides the necessary conversion static methods for DNA sequence string CSV file to DNA label encoder CSV file and viceversa.

--

--

Ernest Bonat, Ph.D.

I’m a Senior Data Scientist and Engineer consultant. I work on Machine Learning application projects for Life Sciences using Python and Python Data Ecosystem.