Advanced Synthetic DNA Sequence Generation and Preprocessing for Natural Language Processing

12 min readJan 13, 2025

Ernest Bonat, Ph.D.

Updated 01/30/2025

1. Overview
2. Why is creating synthetic DNA sequence data very important?
3. Promoter Gene Sequences Dataset
4. Building Machine Learning Random Forest Model
5. Using Gretel Platform to Generate Synthetic DNA Sequences
6. DNA Sequences ETL Preprocess Pipeline
7. DNA Sequence Production Machine Learning Model Test
8. Conclusions

1. Overview

This research presents a practical approach to generating and preprocessing synthetic DNA sequences for application in Natural Language Processing (NLP) algorithms and data-driven Machine Learning (ML) pipelines. Using a small promoter gene sequence dataset as a foundation, the study addresses the challenge of limited data availability by employing synthetic data generation through the Gretel platform and developing an ETL (Extract, Transform, Load) pipeline for preprocessing. It is highly recommended to read and understand some practical ML applications for Life Sciences at “Machine Learning Applications in Genomics Life Sciences by Ernest Bonat, Ph.D.”

2. Why is creating synthetic DNA sequence data very important?

Creating synthetic DNA sequence data is critically important for advancing science and technology because it allows researchers to simulate, analyze, and innovate without depending on real biological samples. Below are key reasons why synthetic DNA sequence data is particularly important:

1. Simulation and Modeling

Testing Hypotheses: Synthetic DNA sequences can model real-world genetic variations, enabling researchers to test specific hypotheses without needing access to actual biological samples.
Designing and Optimizing Experiments: Scientists can simulate the behavior of DNA sequences to refine experimental conditions before testing with live samples.

2. Developing Machine Learning and AI Models

Synthetic DNA data is crucial for training computational models in genomics and bioinformatics:

Data Augmentation: In fields like predictive genomics, synthetic sequences increase the diversity of training datasets, improving model accuracy.
Algorithm Development: New methods for sequence alignment, variant calling, or motif discovery can be tested using controlled, synthetic datasets.

3. Accelerating Genetic and Genomic Research

Studying Gene Functions: Synthetic DNA sequences allow researchers to introduce specific mutations or variations to study how genes and regulatory elements work.
Exploring Genomic Variability: Synthetic data enables researchers to simulate different populations or evolutionary scenarios.

4. Supporting Gene Therapy Development

Synthetic DNA sequences are used to test therapeutic approaches in a controlled environment:
Optimizing Gene Editing: Techniques like CRISPR-Cas9 rely on synthetic DNA to test editing strategies.
Creating Gene Constructs: Synthetic sequences are critical for designing therapeutic vectors, such as those used in gene therapy.

5. Ethical and Safe Alternatives

Reducing Dependence on Natural Samples: Synthetic data eliminates the need for collecting sensitive or hard-to-access samples, such as human or endangered species’ DNA.
Bioethical Standards: It avoids ethical concerns about using live organisms or human tissue in research.

6. DNA-Based Data Storage

Synthetic DNA sequence data is essential for advancing DNA-based data storage:

Encoding Digital Information: Artificial DNA sequences are used to encode digital data for high-density, long-term storage.
Optimizing Read/Write Methods: Synthetic data helps researchers refine DNA synthesis and sequencing techniques for data storage applications.

7. Synthetic Biology and Biotechnology

Designing New Organisms: Synthetic DNA sequences enable the creation of engineered microbes with custom genetic instructions for producing biofuels, pharmaceuticals, or materials.
Metabolic Pathway Engineering: Artificial sequences are used to optimize biochemical pathways in organisms to enhance production efficiency.

8. Advancing Education and Training

Teaching Bioinformatics: Synthetic data is invaluable for teaching genomics and bioinformatics concepts in a controlled, reproducible manner.
Safe Learning Environment: Synthetic sequences eliminate safety risks associated with handling pathogenic or hazardous DNA samples.

9. Enabling Personalized Medicine

Simulating Patient-Specific Genomes: Synthetic data can mimic patient-specific genetic variations for testing drug efficacy or predicting disease susceptibility.
Designing Precision Therapies: Synthetic sequences help create personalized DNA constructs for treatments like cancer immunotherapy.

10. Controlling Experimental Conditions

Reducing Noise in Data: Synthetic DNA sequences can eliminate variability and noise, providing clean datasets for benchmarking analytical tools.
Benchmarking Tools: Synthetic data helps evaluate the performance of sequencing pipelines, error-detection tools, and variant-calling algorithms.

11. Exploring Rare or Hypothetical Scenarios

Simulating Rare Mutations: Researchers can create synthetic DNA to study the effects of rare or hypothetical mutations that might not be available in natural datasets.
Evolutionary Studies: Synthetic sequences can model ancestral or extinct genomes to study evolutionary processes.

12. Pioneering Research in Epigenetics

Synthetic DNA data can include modifications (e.g., methylation patterns) to study epigenetic mechanisms and their impact on gene expression.

Challenges and Future Considerations

While synthetic DNA sequence data is incredibly valuable, it must be used responsibly:

Data Validation: Synthetic data must be validated to ensure it accurately represents biological systems.
Ethical Concerns: Synthetic sequences should be carefully regulated to avoid misuse (e.g., creating harmful synthetic pathogens).

Creating synthetic DNA sequence data is vital for advancing genomics, bioinformatics, and biotechnology. It provides a safe, ethical, and scalable way to drive innovation, simulate biological systems, and optimize experimental workflows, ultimately unlocking new possibilities in science and medicine.

3. Promoter Gene Sequences Dataset

The promoter gene sequences dataset will be used in the paper. Promoters are regions in DNA sequences that are recognition sites for RNA-polymerase to start transcription. These promoter sequences are fairly conserved throughout each domain of life so they allow for gene identification even when most gene sequences are different. Gene identification is important when trying to identify DNA sequences of proteins (that perform most of the biochemical processes in the body). A general definition can be found on the Promoter NHGRI site.

The CSV dataset file can be download from UCI Machine Learning Repository Molecular Biology (Promoter Gene Sequences). The dataset used in this study contains 106 instances in total, with two classes distributed equally: 53 promoter sequences (positive cases) and 53 non-promoter sequences (negative cases). This dataset was collated by Towell to evaluate a hybrid learning algorithm, KBANN, which uses examples to inductively refine pre-existing knowledge. Each instance consists of 57 base-pair positions, starting at position -50 and ending at position +7.

The panda’s dataframe info of the CSV file ‘promoter_dna_sequence_original_106.csv’ is shown below.

<class 'pandas.core.frame.DataFrame'>    
RangeIndex: 106 entries, 0 to 105        
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   dna_sequence  106 non-null    object
 1   dna_label     106 non-null    int64 
dtypes: int64(1), object(1)p

Six rows from the original file will be removed, with three labels (1 and 0 classes) used for production model deployment testing. This step is crucial for validating the ML model’s performance using the real original dataset. Many open-source papers and projects overlook this critical step in the Machine Learning workflow. Below is part of the CSV spreadsheet file.

4. Building Machine Learning Random Forest Model

Now, the new ‘promoter_dna_sequence_original_100.csv’ file contains 100 rows. Let’s try to build a Random Forest model using Natural Language Processing (NLP) techniques provided in Advanced DNA Sequence Text Classification Using Natural Language Processing. In this case the following functions were used.

# dna k-mers concatenation
X = dna_k_mers_concatenation(X, config.K_MERS_LENGTH)

# dna k-mers vectorization
X, tfidf_vectorizer = dna_k_mers_vectorization(X)

The imbalanced bar chart plot below shows the absolute label class distribution. Therefore, SMOTE algorithms will not be applied at this time.

From the results of the program below, we can see that the Random Forest model is overfitted, with a 100% validation accuracy score and a 90% test score.

validate metrics
classification accuracy score:  
100.0

classification confusion matrix:
[[4 0]
 [0 6]]

classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         6

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10

test metrics
classification accuracy score:
90.0

classification confusion matrix:
[[3 1]
 [0 6]]

classification report:
              precision    recall  f1-score   support

           0       1.00      0.75      0.86         4
           1       0.86      1.00      0.92         6

    accuracy                           0.90        10
   macro avg       0.93      0.88      0.89        10
weighted avg       0.91      0.90      0.90        10

In general, this should occur when you have an insufficient number of rows in the dataset, such as only 100 rows. We can easily request a larger database file. However, suppose there is no additional dataset file, and the ML model needs to be built for prediction and to make important patience healthcare decisions. This scenario can arise at any time in any field of Life Sciences. The question for the Data Scientist is, ‘What should be done?’ Well, a simple solution is to generate synthetic DNA sequence data based on the 100 existing rows. Let’s try to do that using the Gretel platform.

5. Using Gretel Platform to Generate Synthetic DNA Sequences

Gretel is a synthetic data platform purpose-built for AI. It generates artificial, synthetic datasets with the same characteristics as real data, so you can improve AI models without compromising on privacy. Open Gretel site, create an account and generate your API key. In the ‘Dashboard’ page choose a preconfigured blueprint ‘Synthesize tabular data with Navigator Fine Tuning’ and follow the instructions.

Let’s try to generate 10,000 synthetic rows from 100 original DNA sequence rows. After the CSV file promoter_dna_sequence_synthetics_10000.csv was created, we observed some bad generated DNA sequences shown below. It is unclear why the Gretel platform produces these non-functional DNA sequences, given that the original dataset consists solely of specific combinations of the nucleotides A, T, C, and G only.

cattaaaaaaftopsacgcttagccgcalgturhiCGTZguACGCTAATCGCATCTTCC
gcaaataatcaatgtggacttttctgccgtgattatagacatingÐ³Ð°cgacgdicgcg
cfgsrrgctacmgttcggtggã‚¿O retired cttcttctggcgtactccaaga
tgcacgggttGCGCCGTTTCTTGGGTTTTGTGACTAGSFTGSCTACGfATYnCA
atcgctcaaco activities ctttctwct nagc CTCAATSvWyCTXYFAACGTCATT
tabgtcagtttat cc actu cut emettt hhgtgt taacgtagactaccguÃ©rÃ©kt

To fix these synthetics DNA sequence errors and create the final ML CSV file a data ETL preprocessing pipeline will be required. A good practical implementation of ETL package development can be found at “ELT Package Development with 3-Tier Architecture for Data Engineering” paper.

6. DNA Sequences ETL Preprocess Pipeline

A DNA sequence ETL preprocessing pipeline library was developed to create the final CSV file to be applied in ML algorithms. This pipeline contains three steps.

DNA sequences original preprocess file
DNA sequences synthetic preprocess file
DNA sequences final ML file development

The main run_etl_pipeline() function code is shown below.

def run_etl_pipeline(self):
    try:
        self.logger.info("Running ETL pipeline...")
        self.dna_extract_data_csv()
        self.dna_remove_leading_trailing_spaces()
        self.dna_convert_upper_case()
        self.dna_validate_nucleotides()
        self.dna_remove_empty_row()            
        self.dna_truncate_column()
        self.dna_padding_column()
        self.dna_load_data_csv()
        self.logger.info("ETL Pipeline completed successfully")
    except:
        self.logger.info("ETL Pipeline stopped due to an error.")

Now we have two CSV DNA sequence files: the original and the synthetic one.

‘promoter_dna_sequence_original_100.csv’
‘promoter_dna_sequence_synthetics_10000.csv’

These two CSV files need to be processed using the ETL pipeline procedures. Below are the results of the run_etl_pipeline() function for both CSV files.

2025-01-02 08:56:35 - INFO: Running ETL pipeline...
2025-01-02 08:56:35 - INFO: Extracting csv input file: promoter_dna_sequence_original_100.csv...
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Removing leading and trailing spaces...        
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Converting DNA sequence string to upper case...
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Validating DNA nucleotides ATCG...
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Removing empty (NaN or None) rows...
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Truncating sequence column...
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Padding sequence column...
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: Loading csv output file: promoter_dna_sequence_original_100_preprocess.csv...
2025-01-02 08:56:35 - INFO: CSV output file promoter_dna_sequence_original_100_preprocess.csv created successfully.
2025-01-02 08:56:35 - INFO: Dataframe shape: (100, 2)
2025-01-02 08:56:35 - INFO: ETL Pipeline completed successfully
2025-01-02 08:56:35 - INFO: Running ETL pipeline...
2025-01-02 08:56:35 - INFO: Extracting csv input file: promoter_dna_sequence_synthetic_10000.csv...
2025-01-02 08:56:35 - INFO: Dataframe shape: (10000, 2)
2025-01-02 08:56:35 - INFO: Removing leading and trailing spaces...
2025-01-02 08:56:35 - INFO: Dataframe shape: (10000, 2)
2025-01-02 08:56:35 - INFO: Converting DNA sequence string to upper case...
2025-01-02 08:56:35 - INFO: Dataframe shape: (10000, 2)       
2025-01-02 08:56:35 - INFO: Validating DNA nucleotides ATCG...
2025-01-02 08:56:35 - INFO: Dataframe shape: (8911, 2)
2025-01-02 08:56:35 - INFO: Removing empty (NaN or None) rows...
2025-01-02 08:56:35 - INFO: Dataframe shape: (8911, 2)
2025-01-02 08:56:35 - INFO: Truncating sequence column...       
2025-01-02 08:56:35 - INFO: Dataframe shape: (8911, 2)
2025-01-02 08:56:35 - INFO: Padding sequence column...
2025-01-02 08:56:35 - INFO: Dataframe shape: (8911, 2)
2025-01-02 08:56:35 - INFO: Loading csv output file: promoter_dna_sequence_original_synthetic_10000_preprocess.csv...
2025-01-02 08:56:35 - INFO: CSV output file promoter_dna_sequence_original_100_preprocess_synthetic_preprocess_yes_duplicates.csv created successfully.
2025-01-02 08:56:35 - INFO: Dataframe shape: (8911, 2)
2025-01-02 08:56:35 - INFO: ETL Pipeline completed successfully
Final ML output file: promoter_dna_sequence_ml_final.csv created successfully.
Dataframe shape: (9011, 2)

As you can see, the original CSV file remains intact, as the DataFrame shape values (100, 2) are the same before and after applying the ETL pipeline procedures. For the synthetic CSV file, however, the DataFrame shape values are different. The Gretel platform generated 10,000 synthetic rows as specified. After applying the ETL pipeline procedures, the DataFrame shape was reduced to (8,911, 2), meaning 1,089 rows were removed. The final ML CSV file will have a DataFrame shape of (9,011, 2).

Now, let’s run the same NLP algorithm on this final ML CSV file. As you can observe, the classes are imbalanced, so the SMOTE algorithm will need to be applied to balance the classes.

After applying SMOTE the training data got balanced and ready for the Machine Learning models as shown below.

Here are the final results.

validate metrics
classification accuracy score:  
97.45

classification confusion matrix:
[[405  16]
 [  7 473]]

classification report:
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       421
           1       0.97      0.99      0.98       480

    accuracy                           0.97       901
   macro avg       0.98      0.97      0.97       901
weighted avg       0.97      0.97      0.97       901

test metrics
classification accuracy score:
97.23

classification confusion matrix:
[[402  12]
 [ 13 475]]

classification report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       414
           1       0.98      0.97      0.97       488

    accuracy                           0.97       902
   macro avg       0.97      0.97      0.97       902
weighted avg       0.97      0.97      0.97       902

Both validation and test metrics show high performance with accuracy above 97%. Class 1 achieves slightly higher F1-scores due to better recall. Minimal errors are observed in confusion matrices with fewer false positives and negatives in both validation and test datasets. These excellent results prove that the model is not overfitted, as the validation and test accuracy scores are very close. However, I believe we still need to validate whether our ML model can be used in production with new, original DNA sequences. Let’s deploy the model and find out.

7. DNA Sequence Production Machine Learning Model Test

The production model test implementation is shown below.

def main():       
    # dna sequence production dataset loading (6 dna rows, 3 with classs 1 and 3 with class 0)
    csv_path_file = os.path.join(config.CSV_FOLDER_PATH, "promoter_dna_sequence_original_3_preprocess_production_data.csv")
    df = pd.read_csv(filepath_or_buffer=csv_path_file)

    # pickle deserialize tfidf vectorizer object
    tfidf_vectorizer_pickle_path = os.path.join(config.PICKLE_FOLDER_PATH, "tfidf_vectorizer.pkl")
    pickle_class = PickleClass(tfidf_vectorizer_pickle_path)
    tfidf_vectorizer = pickle_class.pickle_deserialize_object()

    # pickle deserialize model classifier object
    model_classifier_pickle_path = os.path.join(config.PICKLE_FOLDER_PATH, "model_classifier.pkl")
    pickle_class = PickleClass(model_classifier_pickle_path)
    model_classifier = pickle_class.pickle_deserialize_object()

    # predict the dna label for each sequence
    df['predicted_label'] = df['dna_sequence'].apply(lambda sequence: predict_dna_label(sequence, config.K_MERS_LENGTH, tfidf_vectorizer, model_classifier))
    print(df)

    # create a csv file to compare the original dna labels with the predicted labels
    csv_file_predicted = "promoter_dna_sequence_original_3_preprocess_production_data_predicted.csv"
    csv_path_file = os.path.join(config.CSV_FOLDER_PATH, csv_file_predicted)
    df.to_csv(path_or_buf=csv_path_file, index=False)
    print(f"CSV file {csv_file_predicted} has been saved successfully")

if __name__ == '__main__':
    main()

Below is the final predicted DNA sequence file. As you can see, the values in the columns ‘dna_label_original’ and ‘dna_label_predicted’ are the same. This confirms that the developed ETL preprocessing pipeline is essential for finalizing the synthetic DNA sequence dataset process for the ML project workflow. I believe this is good news for Bioinformatics and Life Sciences research and applications, especially when there is not enough data to apply Statistical Analysis and Machine Learning algorithms.

8. Conclusions

1. Dataset Selection and Preparation:

Utilized promoter gene sequences from the UCI ML Repository (106 instances, balanced classes).
Addressed data scarcity by removing 6 rows for production testing, ensuring robust deployment validation.

2. Machine Learning Pipeline:

Developed a Random Forest model leveraging NLP techniques like DNA k-mer concatenation and TF-IDF vectorization.
Highlighted overfitting issues with the original dataset and introduced synthetic data generation as a solution.

3. Synthetic Data Generation with Gretel:

Generated 10,000 synthetic DNA sequences, identifying inconsistencies in synthetic sequence generation, particularly with platforms like Gretel. Some synthetic sequences contained errors that required extensive preprocessing through an ETL pipeline.
Combined 100 original sequences with 8,911 valid synthetic sequences, resulting in a final ML dataset of 9,011 rows.

4. Class Imbalance Resolution and Model Optimization:

Applied SMOTE to balance the dataset, improving training and test performance.
Achieved a validation accuracy of 97.45% and a test accuracy of 97.23%, demonstrating robust generalization.

5. Production Deployment and Testing:

Validated model performance on real-world production data, ensuring reliability and accuracy.
Provided reproducible workflows and code implementations for further development in DNA sequence analysis.