Fast DNA Sequence Data Loading — List vs. NumPy Array

Ernest Bonat, Ph.D.
6 min readAug 9, 2023

1. Overview
2. Really Fast NumPy Arrays
3. Simple Generation of Test DNA Sequence String Data
4. Serializing And Deserializing NumPy Arrays Objects
5. DNA Sequence Data Loading Using List and NumPy Array
6. Program Execution Time Results Using List and NumPy Array
7. Conclusion

1. Overview

The genomics DNA sequence data can be imported from many different sources, including files, databases, clouds, etc. To use this data for any statistical analysis, it needs to be loaded into any programming data object based on the selected language. In Python, the most common data structures are tuples, sets, dictionaries, and lists. From the Python Data Ecosystem libraries, the most useful ones are NumPy arrays. NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Many times, this needed genomics data is huge, and it takes some time to load it into the selected programming data object. For some reasons, in Google search posts, the most popularly used Python data structure is the List object — very simple and easy to use! It does not matter if the data is numerical or categorical. As you should already know, NumPy arrays are the fastest data manipulation structures in Python today. In this paper, I would like to find out if this is true with thousands of rows in a genomics DNA sequence string dataset. I have written many high-level Data Analytics and Machine Learning papers using DNA/RNA sequence datasets (“Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.”). I truly believe that you can become a better Bioinformatics person if you read and understand them.

2. Really Fast NumPy Arrays

The How Fast Numpy Really is and Why? paper explains very well how NumPy arrays are fastest data manipulation today, special for numerical data types. Here are three main reasons:

  1. A NumPy array is a collection of similar data types that is densely packed in memory. A Python list can have different data types, which impose many extra constraints when performing computations on it.
  2. NumPy is capable of dividing a task into multiple subtasks and processing them in parallel.
  3. NumPy functions are implemented in C, which further enhances its speed compared to Python lists.

Maybe List can be faster than NumPy array sometimes? In the paper Python Lists Are Sometimes Much Faster Than NumPy. Here’s Proof are shown some examples when List are faster than NumPy array. Based on this information, I would recommend to use both of these Python data structures and find out which one offer the fastest speed for your specific data loading requirements.

3. Simple Generation of Test DNA Sequence String Data

To proof faster DNA sequence dataset loading speed between List and NumPy Array we’ll need to have different size of dataset files. Looking in Google search for a simple DNA sequence generator I found one in GitHub website DNASequenceGenerator/main.py. I refactored the code and make it faster using Numba library as you can see in the code below. Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code. Here is an example of using Numba in Python “High Performance Big Data Analysis Using NumPy, Numba & Python Asynchronous Programming”.

@jit(nopython=True)
def get_dna_sequence_base_nucleotide(x):
return {0: 'A', 1: 'C', 2: 'G', 3: 'T' }[x]

@jit(nopython=True)
def generate_dna_sequence_list(number_of_sequences, compilation=None):
if compilation is None:
pass
else:
dna_sequence_list = []
length_start = config.LENGTH_START (#15)
length_end = config. LENGTH_END (#20)
for i in range(number_of_sequences):
dna_sequence = ""
for _ in range(random.randint(length_start, length_end)):
dna_sequence += get_dna_sequence_base_nucleotide(random.randint(0,3))
dna_sequence_list.append(dna_sequence)
return dna_sequence_list

The generate_dna_sequence_list() function shown below generate a DNA sequence dataset list based on the predefined number of sequences. An example of 100 rows is provided.

def main():   
number_of_sequences = config.NUMBER_OF_SEQUENCES
start = time.perf_counter()
# generate a dna sequence list
dna_sequence_list = generate_dna_sequence_list(number_of_sequences, compilation=True)
end = time.perf_counter()
print("Elapsed run time = {}s".format((end - start)))

# convert nunpy array to list with 100 DNA sequence rows
reads_nparray_100 = np.array(dna_sequence_list)
print("reads_nparray_100.shape")
print(reads_nparray_100.shape)

# select the pickle file path
pkl_directory_path = r"\pkl"
file_path_name = os.path.join(pkl_directory_path, "reads_nparray_100.pkl")
# use pydna library to create the pickle file of a numpy array
PyDNA.pickle_serialize_object(file_path_name, reads_nparray_100)
print("File {} had been created.".format(file_path_name))

4. Serializing And Deserializing NumPy Arrays Objects

As you can see the PyDNA library (“Apply Machine Learning Algorithms for Genomics Data Classification”. Ernest Bonat, Ph.D., Bishes Rayamajhi, MS. February 03, 2021) was used to serialize/deserialize the NumPy array using the pickle library. The pickle module implements binary protocols for serializing and deserializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. The following seven NumPy arrays pickle files were created. This will allow to select the right DNA sequence dataset without running generate_dna_sequence_list() function all the time during the speed performance test.

  • reads_nparray_100.pkl
  • reads_nparray_500.pkl
  • reads_nparray_1000.pkl
  • reads_nparray_50000.pkl
  • reads_nparray_100000.pkl
  • reads_nparray_500000.pkl
  • reads_nparray_1000000.pkl

5. DNA Sequence Data Loading Using List and NumPy Array

The code shown below defines two loafing functions: load_empty_list() and load_empty_nparray(). These functions are well implemented including document string (docstring) comments and exception handling with the complete stack trace object information. it’s very interesting to see in Google search how bad Python code Data Scientistic writes today. Many times, I’m trying to find some good Python piece of code and I can’t find one — a complete waste of time. Here are the main three simple issues:

  • Declaring, using and hardcoding variables with no meaning at all
  • Bad functions development without even docstring comments
  • No exception handling implementation at all
import traceback as tb

def load_empty_list(dna_sequence_list):
""" load empty list with dna sequence list
args:
dna_sequence_list (list): test dna sequence list
returns:
list_empty (list)
"""
try:
list_empty = []
for item in dna_sequence_list:
list_empty.append(item)
except:
tb.print_exc()
return list_empty

def load_empty_nparray(dna_sequence_list):
""" load empty nyarray with dna sequence list
args:
dna_sequence_list (list): test dna sequence list
returns:
nparray_empty (numpy array)
"""
try:
nparray_empty = np.array([])
for item in dna_sequence_list:
nparray_empty = np.append(nparray_empty, item)
except:
tb.print_exc()
return nparray_empty

The main() function code below deserializes the selected NumPy array pickle file and converts it to a list for easy manipulation. Both functions, load_empty_list() and load_empty_nparray(), are called, and the elapsed program run time is calculated in seconds for each one.

def main():
pkl_directory_path = r"\pkl"

file_path_name = os.path.join(pkl_directory_path, "reads_nparray_1000000.pkl")
# deserialize numpy array pickle file path name
reads_nparray = PyDNA.pickle_deserialize_object(file_path_name)

# conver numpy array to list
reads_list = reads_nparray.tolist()
# print(len(reads_list))

# load empty list and calculate the elapsed run time
start = time.perf_counter()
list_empty = load_empty_list(reads_list)
end = time.perf_counter()
print("Elapsed runtime (load_empty_list) = {}s".format((end - start)))

# load empty numpy array and calculate the elapsed run time
start = time.perf_counter()
nparray_empty = load_empty_nparray(reads_list)
end = time.perf_counter()
print("Elapsed runtime (load_empty_nparray) = {}s".format((end - start)))

6. Program Execution Time Results Using List and NumPy Array

Python provides several functions to get the execution time of a program. In this paper, I decided to use the perf_counter() function. The table below shows the number of DNA sequence rows in the dataset, the List data loading time (seconds), the NumPy arrays data loading time (seconds), and how much faster the List loads compared to the NumPy Array (seconds). The calculation of program execution time values will vary depending on the PC hardware used. Based on this table, the List is quite much faster than NumPy array with DNA sequence string datasets. This could be obvious because NumPy library was designed and developed specifically for numerical datasets. It’s interesting to see how slow the append() method is in NumPy can be. In this method, a new array is allocated and filled all the time. Why not use the same array and just add a new item at the end, as the List does? If that is the case, maybe the NumPy development team can find a way to speed up this append() method with string datasets manipulation. For now, the List is the best and faster solution for loading string data today.

7. Conclusion

For loading DNA sequence string datasets, use the Python List object. It has been proven to be faster than NumPy arrays.

--

--

Ernest Bonat, Ph.D.

I’m a Senior Machine Learnig Developer. I work on Machine Learning application projects for Life Sciences using Python and Python Data Ecosystem.