ELT Package Development with 3-Tier Architecture for Data Engineering

13 min readJan 23, 2024

1. Overview
2. Why do you need ETL for Data Engineering?
3. A Simple ETL 3-Tier Architecture
4. ETL Package Development Example
5. Extract Tier
6. Application Configuration File
7. Application Log File
8. Transform Tier
9. Load Tier
10. Main Execution File
11. ETL Static Library
12. ETL DNA Sequence Dataset Preprocessing
13. Conclusion

1. Overview

One of the main tasks of Machine Learning (ML) is data preprocessing (cleansing). In general, this task constitutes about 60% — 70% of the entire ML project. There is a significant amount of data preprocessing work to be done before applying your initially selected ML method. The company’s Engineering Department is in charge of providing and maintaining these tasks, handled by specialized Data Engineers and ETL Developers.

From Wikipedia site: “Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s). The ETL process became a popular concept in the 1970s and is often used in data warehousing. Data extraction involves extracting data from homogeneous or heterogeneous sources; data transformation processes data by data cleansing and transforming them into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse.”

Let’s define the ETL process in a very simple and clear manner:

Extract — data extraction from one or many sources.
Transform — data transformation (preprocessing or cleansing). A clear document outlining data transformation requirements is necessary.
Load — data loading to one or many sources.

The data sources could include database engines (such as Oracle, SQL Server, PostgreSQL, MySQL, SQLite, MongoDB, MariaDB, etc.), data files (LOG, CSV, XML, XLS, JSON, etc.), and any possible combinations of data sources, etc.

Here is a simple ETL diagram.

2. Why do you need ETL for Data Engineering

There are many reasons for adopting ETL process in the organization: “ETL (Extract, Transform, and Load) Process in Data Warehouse”.

It helps companies to analyze their business data for taking critical business decisions.
Transactional databases cannot answer complex business questions that can be answered by ETL.
A Data Warehouse provides a common data repository
ETL provides a method of moving the data from various sources into a data warehouse.
As data sources change, the Data Warehouse will automatically update.
Well-designed and documented ETL system is almost essential to the success of a Data Warehouse project.
Allow verification of data transformation, aggregation and calculations rules.
ETL process allows sample data comparison between the source and the target system.
ETL process can perform complex transformations and requires the extra area to store the data.
ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system.
ETL is a predefined process for accessing and manipulating source data into the target database.
ETL offers deep historical context for the business.
It helps to improve productivity because it codifies and reuses without a need for technical skills.

3. A Simple ETL 3-Tier Architecture

Now that we have a very good understanding of an ETL process, let’s define a simple architecture for development and maintenance. As you can already find out, there are three main tiers (components) in any ETL process. Thinking about three tiers, I do remember from the 90s when Client/Server Windows application development was very popular at that time. The most useful application architecture pattern was the 3-Tier Architecture Pattern. This architecture includes Presentation (GUI), Business Logic (BL), and Data Access (DA). Let’s apply the same logic to the ETL process. Below is the similarity table:

As you can see, the Business Logic tier will handle business data preprocessing in both ways, similar to the Transform tier which processes data in one way after it has been extracted from the final source(s). It’s essential to mention that designing an application with a 3-tier architecture ensures independence between the tiers. Each tier operates independently, promoting modularity, scalability, and maintainability. This design approach enhances the application’s overall flexibility and ease of maintenance. Based on this logic, the main folder structure for any ETL process will include extract, transform, and load folders. Below is the general folder setup.

4. ETL Package Development Example

Suppose a business would like to receive an extended price report for a specific customer order of products every day. The extended price is very helpful when it comes to understanding the total cost paid for each unit, especially when the aim is to resell those units to a client base. To calculate the extended price, the following simple formula will be used: ExtendedPrice = Quantity x UnitPrice x (1 — Discount). Here is the step-by-step process:

Select the product order ID.
Retrieve the list of products by order ID from the “products” and “order details” tables in the SQL Server database.
Calculate the extended price for each quantity of the product by order ID.
Generate a final CSV report file.

In this case, the ETL 3-tier process can be defined as:

1. Extract: Retrieve the list of products based on order ID from the “products” and “order details” tables in the SQL Server database.

2. Transform: Process the list of products by calculating extended values and adding price information.

3. Load: Populate the final list of products into a CSV file.

In this example, the ETL data process involves extracting data from SQL Server database tables and generating a CSV file.

5. Extract Tier

The extracted Python code implementation is provided in the ‘extract’ folder. For this example, the standard SQL Server database ‘Northwind’ will be used. The data will be taken from the ‘Products’ and ‘[Order Details]’ tables. Here is the SQL statement for ‘OrderID’ equal to 10324.

SELECT ProductName, 
 O.UnitPrice,
 Quantity,
 Discount
FROM Products P, [Order Details] O
WHERE O.ProductID = P.ProductID 
AND O.OrderID = 10324;

If this statement runs in SQL Server Management Studio the following result is obtain.

Let’s include this SQL statement in a stored procedure. Stored procedures can enhance performance by reducing network traffic. Instead of sending multiple SQL statements from the application to the database, a single call to the stored procedure is made. This can result in fewer round-trips between the application and the database, improving overall performance. Using database stored procedures offers several advantages in software development: modularity and reusability, performance, security, encapsulation of business logic, transaction management, reduced code duplication, ease of maintenance and version control.

USE [Northwind]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[CustomerOrdersDetail] 
(
    @OrderID INT, 
    @ErrorMessage VARCHAR(MAX) OUTPUT
)
AS
BEGIN TRY
    SET NOCOUNT ON;
    SELECT ProductName, 
        O.UnitPrice,
        Quantity,
        Discount
    FROM Products P, [Order Details] O
    WHERE O.ProductID = P.ProductID 
    AND O.OrderID = @OrderID;
END TRY
BEGIN CATCH
        SET @ErrorMessage = 'An error occurred: ' + ERROR_PROCEDURE() + ', ' + 
            CAST(ERROR_LINE() AS VARCHAR(50)) + ', ' + 
            ERROR_MESSAGE();
END CATCH;

This stored procedure will be executed using the PyODBC library, which provides the necessary database APIs to manipulate the SQL Server data. For example, the following is the function to connect to SQL Server. This function returns the SQL Server connection object.

def sqlserver_open_connection(self, timeout=15):    
    """get sql server connection object
    args:
        timeout (int, optional): connection timeout. defaults to 15
    returns:
        sqlserver_connection (object): sql server connection object
    """  
    try:                   
        connection_string = "DRIVER={SQL Server};"\
                                        "SERVER="+self.read_app_config_file("sql_server_home", "server")+";"\
                                        "DATABASE="+self.read_app_config_file("sql_server_home", "database")+";"\
                                        "UID="+self.read_app_config_file("sql_server_home", "uid")+";"\
                                        "PWD="+self.read_app_config_file("sql_server_home", "pwd")+";"            
        sqlserver_connection = pyodbc.connect(connection_string)
        sqlserver_connection.timeout = timeout                    
    except:                
        self.logging_exception_message()
    return sqlserver_connection

Here is the complete code for the ‘extract_data.py’ file, implemented in the class ExtractData().

from library.etl_library import ETLLibrary
from zope.interface import implementer
from extract.extract_data_interface import ExtractDataInterface

@implementer(ExtractDataInterface)
class ExtractData(ETLLibrary):

    def __init__(self, app_config_file, app_log_file):
        super().__init__(app_config_file, app_log_file)                                
        self.app_config_file = app_config_file
        self.app_log_file = app_log_file
    
    def customer_orders_detail_extract_sp(self):     
        """get data row values, columns name and execution data error if any
            returns: 
                data_row (object): dataset row values
                data_column (object): dataset columns name
                data_error (object): execution data error if any
        """  
        data_row, data_column, data_error = (None, None, None)          
        try:
#             get sql server connection object
            sqlserver_connection = self.sqlserver_open_trusted_connection()
#             check for database connection 
            if sqlserver_connection is not None:
#                 set stored procedure string and input/output parameters
                procedure_string = '''\
                    DECLARE @ErrorMessage varchar(max)
                    EXEC [dbo].[CustomerOrdersDetail] @OrderID = ?, @ErrorMessage = @ErrorMessage OUTPUT
                    SELECT @ErrorMessage AS '@ErrorMessage'
                '''
#                 set customer order id parameter
                customer_order_id = int(self.read_app_config_file("stored_procedure_input_parameter", "customer_order_id"))
                procedure_parameter = (customer_order_id,)              
#                 get data row, columns name and error if any
                data_row, data_column, data_error = self.sqlserver_cursor_select(sqlserver_connection, procedure_string, procedure_parameter)                         
        except:     
            self.logging_exception_message()               
        finally:
            self.sqlserver_close_connection(sqlserver_connection)
        return data_row, data_column, data_error

Let’s provide some code explanation. The class ‘ExtractData()’ inherits the ‘ETLLibrary()’ general library class. This will allow us to use all the publicly available methods developed in this class library. In this tier, it’s important to show how the PyODBC APIs library executes a SQL Server stored procedure. Many Python application developers have asked me this question before. Here are the steps:

1. Get the SQL Server connection object.

sqlserver_connection = self.sqlserver_open_trusted_connection()

2. Set the stored procedure string.

procedure_string = '''\
    DECLARE @ErrorMessage varchar(max)
    EXEC [dbo].[CustomerOrdersDetail] @OrderID = ?, @ErrorMessage = @ErrorMessage OUTPUT
    SELECT @ErrorMessage AS '@ErrorMessage'
'''

3. Get the input parameter(s). In our case, the customer order ID from the ‘app.cfg’ configuration file.

customer_order_id = int(self.read_app_config_file("stored_procedure_input_parameter", "customer_order_id"))
procedure_parameter = (customer_order_id,)

4. Execute the stored procedure by using the cursor object.

data_row, data_column, data_error = self.sqlserver_cursor_select(sqlserver_connection, procedure_string, procedure_parameter)

The results after running this extract tier are shown below.

data row:
[('Pavlova', Decimal('13.9000'), 21, 0.15000000596046448), ('Steeleye Stout', Decimal('14.4000'), 70, 0.15000000596046448), ('Spegesild', Decimal('9.6000'), 30, 0.0), ('Raclette Courdavault', Decimal('44.0000'), 40, 0.15000000596046448), ('Vegie-spread', Decimal('35.1000'), 80, 0.15000000596046448)]

data columns:
['ProductName', 'UnitPrice', 'Quantity', 'Discount']

data error:
None

6. Application Configuration File

For the ETL tier, the application requires the configuration and log file paths. Usually, these paths are defined in the class constructor ‘def init()’. Application configuration files provide a way to parameterize an application’s behavior, making it more adaptable, maintainable, and scalable. They are a standard practice in software development to enhance flexibility and eliminate hardcoded values in the application. In our case, the configuration file contains the application settings, including SQL Server login parameters, logging formats, directory paths, file names, etc. Below is an example of the ‘app.cfg’ configuration file.

[sql_server_home]
server=serve_name
database=Northwind
uid=user_id
pwd=user_password

[logging_format]
logging_format_message=%%(levelname)s: %%(asctime)s: %%(message)s
logging_format_datetime=[%%Y-%%m-%%d %%H:%%M:%%S]

[directory]
data_directory=data_directory_path

[file]
config_file=app.cfg
log_file=app.log
app_file_name=customer_orders_detail
app_file_name_ml=customer_orders_detail_ml
time_stamp=_%%Y_%%m_%%d_%%H_%%M_%%S

[stored_procedure_input_parameter]
customer_order_id=10324

[preprocessing]
decimal_place=2
date_time_format=%%Y-%%m-%%d %%I:%%M:%%S %%p

[pyinstaller_directories_files]
build_directory=build
distribution_directory=dist
spec_file=etl_installer.spec

7. Application Log File

An application log file is a document that records events, messages, or activities generated by a software application during its runtime. It is a fundamental tool for developers, administrators, and support teams. It plays a crucial role in diagnosing issues, monitoring performance, ensuring security, and maintaining compliance. Effective logging is a best practice in ETL package development to enhance the reliability, maintainability, and security of applications. In our case, the application log file ‘app.log’ is set up in the ‘log’ folder. The messages are written to the log file using the function ‘etl_library.write_log_file(event_level, event_message)’. Here are some examples of ETL activities recording events and capturing errors that occurred in applications.

Good running process.

INFO: [2024–01–17 20:52:37]: etl process started.
INFO: [2024–01–17 20:52:37]: extract data started.
INFO: [2024–01–17 20:52:41]: extract data ended.
INFO: [2024–01–17 20:52:41]: transform data started.
INFO: [2024–01–17 20:52:41]: transform data ended.
INFO: [2024–01–17 20:52:41]: load data started.
INFO: [2024–01–17 20:52:41]: csv file ‘customer_orders_detail_2024_01_17_20_52_37.csv’ has been created.
INFO: [2024–01–17 20:52:41]: load data ended.
INFO: [2024–01–17 20:52:41]: etl process ended.

2. SQL Server connection object error.

ERROR: [2023–09–23 20:30:50]: An error occurred: [Time Stamp]: 2023–09–23 08:30:50 PM; [File Name]: C:\folder_path\etl_package_customer_orders_detail\src\library\etl_library.py; [Procedure Name]: sqlserver_open_connection_home; [Error Message]: (‘08001’, ‘[08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]SQL Server does not exist or access denied. (17) (SQLDriverConnect); [08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionOpen (Connect()). (53)’); [Error Type]: <class ‘pyodbc.OperationalError’>; [Line Number]: 74; [Line Code]: sqlserver_connection = pyodbc.connect(connection_string)”

3. SQL Server stored procedure execution error.

ERROR: [2023–03–22 10:04:35]: An error occurred: [Time Stamp]: 2023–03–22 10:04:35 AM; [File Name]: C:\folder_path\etl_package_customer_orders_detail\src\library\etl_library.py; [Procedure Name]: sqlserver_cursor_select; [Error Message]: (‘42000’, “[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]Could not find stored procedure ‘dbo.CustomerOrdersDetail’. (2812) (SQLExecDirectW)”); [Error Type]: <class ‘pyodbc.ProgrammingError’>; [Line Number]: 86; [Line Code]: sqlserver_cursor.execute(sql, sql_parameters)”

8. Transform Tier

The ‘transform_data.py’ file contains the necessary code to preprocess the list of products by calculating and adding the extended price values. The code for the class ‘TransformData()’ is shown below.”

import pandas as pd
from library.etl_library import ETLLibrary
from zope.interface import implementer
from transform.transform_data_interface import TransformDataInterface

@implementer(TransformDataInterface)
class TransformData(ETLLibrary):

    def __init__(self, app_config_file, app_log_file):
        super().__init__(app_config_file, app_log_file)                                
        self.app_config_file = app_config_file
        self.app_log_file = app_log_file
                
    def customer_orders_detail_transform(self, data_row, data_column):   
        df_customer_orders = None
        is_transform = True
        try:           
#             create pandas data frame from data row and column                    
            df_customer_orders = pd.DataFrame.from_records(data=data_row, columns=data_column)     
#             data convertion and rounding 
            decimal_place = int(self.read_app_config_file("preprocessing", "decimal_place"))
            df_customer_orders["UnitPrice"] = df_customer_orders["UnitPrice"].astype(float).round(decimals=decimal_place)
            df_customer_orders["Quantity"] = df_customer_orders["Quantity"].astype(int)
            df_customer_orders["Discount"] = df_customer_orders["Discount"].astype(float).round(decimals=decimal_place)
#             calculate extended price
            df_customer_orders["ExtendedPrice"] = df_customer_orders["Quantity"] * (1 - df_customer_orders["Discount"]) * df_customer_orders["UnitPrice"]
#             add dollar sign and round extended price
            df_customer_orders["ExtendedPrice"] = "$" + (df_customer_orders["ExtendedPrice"].round(decimals=decimal_place).astype(str))
        except:     
            is_transform = False         
            self.logging_exception_message()       
        return df_customer_orders, is_transform

The pandas library DataFrame is used to read the data, including rows and column names, from the extract tier. The columns ‘UnitPrice,’ ‘Quantity,’ and ‘Discount’ are converted to float data types. The extended price is then added to the pandas DataFrame and calculated using the following line of code.

df_customer_orders["ExtendedPrice"] = df_customer_orders["Quantity"] * (1 - df_customer_orders["Discount"]) * df_customer_orders["UnitPrice"]

The transform tier produces the final DataFrame with the following data.

9. Load Tier

This load tier is facilitated by the ‘load_data.py’ file. Based on the code below, this tier takes a panda DataFrame from the transformed data tier and converts it into a CSV dataset file. That’s all.

from library.etl_library import ETLLibrary
from zope.interface import implementer
from load.load_data_interface import LoadDataInterface

@implementer(LoadDataInterface)
class LoadData(ETLLibrary):
    
    def __init__(self, app_config_file, app_log_file):
        super().__init__(app_config_file, app_log_file)                                
        self.app_config_file = app_config_file
        self.app_log_file = app_log_file    

    def customer_orders_detail_to_file(self, df_data, file_path_name, file_type="csv"):    
        is_load = True
        try:              
# data load to csv, hdf5 or json file               
            if file_type == "csv":                
                df_data.to_csv(file_path_name, index=False)           
            if file_type == "hdf5":                
                df_data.to_hdf(file_path_name, key="hdf5")       
            elif file_type == "json":                
                df_data.to_json(file_path_name)                         
        except:     
            is_load = False
            self.logging_exception_message()               
        return is_load

Here is the data from the created CSV file.

10. Main Execution File

The main execution file runs the ETL 3-tier in order. It represents a simple 3-tier pipeline process. In this file, the paths for the application and log configuration files are selected, and the timestamp is set up as well. The code below includes well-defined comments, making the pipeline logic easy to understand.

import time
import datetime

import os
from extract.extract_data  import ExtractData
from transform.transform_data  import TransformData
from load.load_data  import LoadData
from library.etl_library import ETLLibrary

def main():    
    """etl 3-tiers code
    """
#     get default app config_old and log files
    app_directory = os.path.dirname(__file__)          
    app_config_file = os.path.join(app_directory, r"config\app.cfg")   
    app_log_file = os.path.join(app_directory, r"log\app.log")   

#     instantiate etl etl_library
    etl_library = ETLLibrary(app_config_file, app_log_file)
    
#     set file time stamp 
    time_stamp_format = etl_library.read_app_config_file("file", "time_stamp")
    time_stamp_file = datetime.datetime.now().strftime(time_stamp_format)
    
#     etl process started
    etl_library.write_log_file("info", "etl process started.")
    
#     1. data extract tier
    etl_library.write_log_file("info", "extract data started.")
    extract_data = ExtractData(app_config_file, app_log_file)    
    data_row, data_column, data_error = extract_data.customer_orders_detail_extract_sp()
    if (data_error is not None):
        etl_library.write_log_file("error", "extract data ended. " + str(data_error))
        exit()
    if (data_row is None):
        etl_library.write_log_file("info", "extract data ended. no data was extracted.")
        exit() 
    else:
        if (len(data_row)) == 0:
            etl_library.write_log_file("info", "extract data ended. no data was extracted.")
            exit()                 
    etl_library.write_log_file("info", "extract data ended.")

#     2 data transform (preprocessing) tier
    etl_library.write_log_file("info", "transform data started.")
    transform_data = TransformData(app_config_file, app_log_file)
    df_customer_orders, is_transform = transform_data.customer_orders_detail_transform(data_row, data_column)    
    if (df_customer_orders is None) or (is_transform == False):
        etl_library.write_log_file("info", "transform data ended. no data was transformed.")
        exit()        
    etl_library.write_log_file("info", "transform data ended")

#     3. data load tier
    etl_library.write_log_file("info", "load data started.")
    load_data =  LoadData(app_config_file, app_log_file)    
    data_directory = etl_library.read_app_config_file("directory", "data_directory")
    app_file_name = etl_library.read_app_config_file("file", "app_file_name")
    csv_file_name = ''.join([app_file_name, time_stamp_file, ".csv"])     
    csv_file_path_name = os.path.join(data_directory, csv_file_name)        
    is_file = load_data.customer_orders_detail_to_file(df_customer_orders, csv_file_path_name)
    if is_file == False:
        etl_library.write_log_file("info", "data load ended. no file was created.")        
    else:
        etl_library.write_log_file("info", "csv file '{}' has been created.".format(csv_file_name))
    etl_library.write_log_file("info", "load data ended.")
                 
#     etl process ended
    etl_library.write_log_file("info", "etl process ended.")    
    
if __name__ == '__main__':  
    main()

11. ETL Static Library

The development of programming libraries contributes to code reusability, abstraction, modularity, community collaboration, adherence to standards, efficiency, cross-platform compatibility, specialized functionality, security, etc. These libraries reduce the amount of application code and eliminate code duplication. The ETL static library provides public methods to be used in any tier without instantiating a defined class. Here are the main static methods to manipulate SQL Server data using the PyODBC library.

import sys
import os
import traceback
import datetime
import time
import logging

import pyodbc
import config
import base64

class ETLLibrary(object):
    
    @staticmethod
    def sqlserver_open_trusted_connection(self, timeout=15):    
        try:            
            connection_string = "DRIVER={SQL Server};"\
                                          "SERVER="+self.read_app_config_file("sql_server_home", "server")+";"\
                                          "DATABASE="+self.read_app_config_file("sql_server_home", "database")+";"\
                                          "Trusted_Connection=yes;"
            sqlserver_connection = pyodbc.connect(connection_string)
            sqlserver_connection.timeout = timeout                    
        except:     
            sqlserver_connection = None
            self.logging_exception_message()             
        return sqlserver_connection
    
    @staticmethod
    def sqlserver_open_connection(self, timeout=15):    
        try:                   
            connection_string = "DRIVER={SQL Server};"\
                                          "SERVER="+self.read_app_config_file("sql_server_home", "server")+";"\
                                          "DATABASE="+self.read_app_config_file("sql_server_home", "database")+";"\
                                          "UID="+self.read_app_config_file("sql_server_home", "uid")+";"\
                                          "PWD="+self.read_app_config_file("sql_server_home", "pwd")+";"            
            sqlserver_connection = pyodbc.connect(connection_string)
            sqlserver_connection.timeout = timeout                    
        except:                
            sqlserver_connection = None
            self.logging_exception_message()
        return sqlserver_connection 
    
    @staticmethod
    def sqlserver_close_connection(self, sqlserver_connection):
        try:            
            if (sqlserver_connection is not None):
                sqlserver_connection.close()        
        except:     
            self.logging_exception_message()
   
    def sqlserver_cursor_select(self, sqlserver_connection, sql, sql_parameters=None):              
        data_row, data_column, data_error = (None, None, None)          
        try:    
            sqlserver_cursor = sqlserver_connection.cursor()    
            if sql_parameters is not None:    
                sqlserver_cursor.execute(sql, sql_parameters)
            else:
                sqlserver_cursor.execute(sql) 
            data_row = sqlserver_cursor.fetchall()   
            data_column = [column[0] for column in sqlserver_cursor.description]            
            if len(data_row) == 0:
                if sqlserver_cursor.nextset():
                    data_error = sqlserver_cursor.fetchone()[0]                                                
        except:     
            self.logging_exception_message()
        finally:
            self.sqlserver_close_cursor(sqlserver_cursor)            
        return data_row, data_column, data_error        

    @staticmethod
    def sqlserver_cursor_transaction(self, sqlserver_connection, sql, parameters=None):
        is_transaction = True
        try:    
            sqlserver_cursor = sqlserver_connection.cursor()    
            if parameters is not None:    
                sqlserver_cursor.execute(sql, parameters)
            else:
                sqlserver_cursor.execute(sql)                         
            sqlserver_connection.commit()
        except:     
            is_transaction = False
            self.logging_exception_message()
        finally:
            self.sqlserver_close_cursor(sqlserver_cursor)       
        return is_transaction

    @staticmethod
    def sqlserver_close_cursor(self, sqlserver_cursor):
        try:                
            if sqlserver_cursor is not None:  
                sqlserver_cursor.close()                           
        except:     
            self.logging_exception_message()

12. ETL DNA Sequence Dataset Preprocessing

The ETL package development tool was applied to the solution for imbalanced classes in Machine Learning DNA sequences classification. The paper ‘Advanced DNA Sequences Preprocessing for Deep Learning Networks’ proposes a 5-step pipeline for preprocessing DNA sequences datasets, including:

1. ETL DNA Sequence Dataset Cleaning
2. ETL DNA Sequence Dataset Label Encoding
3. ETL DNA Sequence Dataset SMOTE
4. ETL DNA Sequence Dataset Preprocessed for Deep Learning Networks
5. ETL DNA Sequence Final Dataset Cleaning

Because data preprocessing is very important for Machine Learning algorithms, I’ll be providing more ETL solutions applicable to Life Sciences disciplines, including Biology, Genetics, Ecology, Biochemistry, Biotechnology, etc. Stay connected! Some of the latest and best practices of Machine Learning algorithms applied in genomics Life Sciences have been published on Medium.com in the paper Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.

13. Conclusion

1. ETL packages 3-tier architecture design and development is an excellent software algorithm where extract, transform and load tier operate independently, promoting modularity, scalability and maintainability.

2. Using Object-Oriented Programming (OOP) patterns implementation for each ETL tier provides good business requirements encapsulation and automated continuous integration.

3. ETL static library providers programming code reusability, abstraction and portability.