Advanced Paternity DNA Sequence Classification Using Dynamic Programming and Machine Learning Algorithms. Part 2

Ernest Bonat, Ph.D.
18 min readJan 19, 2025

Ernest Bonat, Ph.D., Mohamed Abdulaziz Eisa, B.Sc. AI

1. Overview
2. Data Collection
3. Data Preparation
4. Data Merging
5. Feature Engineering
6. Genetic Similarity in Full Siblings
7. DNA Natural Language Processing (NLP)
8. DNA K-Mers Processing
9. NLP Feature Transformation
10. DNA Bag of Words Processing
11. Feature Scaling
12. Data Splitting
13. Machine Learning Models
14. Summary and Observations
15. Project Coding APIs
16. Feature Work
17. References

1. Overview

DNA forensics has been revolutionized by high-throughput sequencing (HTS) technologies, particularly through the analysis of single nucleotide polymorphisms (SNPs). This study explores a structured approach to processing, merging, and analyzing SNP datasets spanning eight generations, aimed at developing efficient machine learning models for paternity testing and other genetic analyses. Emphasis is placed on feature engineering, DNA natural language processing, and the practical deployment of these models. This paper is part 2 of “Advanced Paternity DNA Sequence Classification Using Dynamic Programming and Machine Learning Algorithms. Part 1”. Ernest Bonat, Ph.D., Mohamed Abdulaziz Eisa, B.Sc. AI, Sep 12, 2024.

2. Data Collection

The high-throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) has opened new avenues in DNA forensics, offering applications such as identification, mixture analysis, kinship prediction, and biogeographic ancestry prediction. The dataset under consideration, generated on May 14, 2018, provides a valuable resource for the development and testing of advanced DNA forensics algorithms. (11 million SNP Profiles datasets (Harvard Data verse))

The profiles are associated with individuals having defined ethnicities and family relationships. The dataset spans eight generations, providing a comprehensive genetic landscape. Each generation dataset encompasses approximately 21,892 rows. Each row represents an individual, providing a manageable yet comprehensive sample size. The dataset is structured into 10 columns, capturing various genetic and demographic attributes.

Figure 1: Dataset Example

The first question to address is: How can we Prepare this data to make it suitable for a Machine Learning (ML) model? Below is the answer to this question.

3. Data Preparation

Data Preparation is the process of preparing raw data so that it is suitable for further processing and analysis. Key steps include collecting, cleaning, and labeling raw data into a form suitable for Machine Learning (ML) algorithms and then exploring and visualizing the data. Data preparation can take up to 80% of the time spent on an ML project.

The data was integrated from two distinct datasets: the child dataset and the parent dataset. This section details the systematic process employed to merge these datasets, ensuring the integrity and relevance of the merged data for subsequent analysis. Before merging the datasets, we prepared the data by truncating and concatenating sequences to standardize the data structure across both datasets.

The data pipeline follows a systematic approach to process genetic sequence data and establish parent-child relationships in the dataset. Each sequence is standardized to a specific length (1000 bases per allele) before merging, ensuring consistency in the final analysis.

Figure 2: DNA Sequence First Pipeline of Data Preparation

In our case, before merging the datasets, we prepared the data by truncating and concatenating sequences to standardize the data structure across both datasets.

Step-by-Step Process:

  1. Truncate Sequences: Each allele sequence in both child and parent datasets was truncated to the first 1000 bases. This step was crucial to maintain a consistent sequence length for analysis.
  2. Concatenate Sequences: For each individual, the truncated sequences of Allele1 and Allele2 were concatenated to form a full sequence of DNA.

For the child dataset we have.

Figure 3: Child Truncate DNA sequence
Figure 4: Length of DNA after Truncate for child

For the parent dataset we have.

Figure 5: Parent Truncate DNA Sequence
Figure 6: Length of DNA after Truncate for Parent

4. Data Merging

The next step was to merge the child and parent DataFrames using the parent identifiers provided in the child dataset. Here are the objectives.

  • To create a dataset that links each child to their respective parents.
  • To ensure that the merged data retains the genetic information necessary for further analysis.

The merging strategy could be defined in the following steps.

  1. Father’s Data: Merged the parent dataset with the child dataset on the ParentF (Father) identifier.
  2. Mother’s Data: Merged the parent dataset with the child dataset on the ParentM (Mother) identifier.
  3. Combine Merged Data: Combined the results of the two merges into a single DataFrame, ensuring that each child is associated with both parents.
Figure 7: Merging Strategy for data of father
Figure 8: Merging Strategy for data of mother
Figure 9: Clean up and combine the merged DataFrames

To ensure the accuracy and integrity of the merged dataset, we conducted the following validations:

  • Length Check: Verified the length of sequences to confirm that truncation and concatenation were correctly applied.
  • Sample Inspection: Randomly sampled the merged DataFrame to visually inspect the correctness of parent-child associations.
Figure 10: Sample of merged DataFrames

5. Feature Engineering

Feature Engineering is the process of transforming raw data into a format that enhances the performance of machine learning models. It aims to improve model accuracy by providing more meaningful and relevant information, and it is considered one of the most valuable techniques. The key steps in feature engineering include feature creation, transformation, feature extraction, and feature selection.

The objective is to perform feature extraction to create a likelihood feature and a target label feature for paternity testing based on DNA sequence similarity. The feature creation steps are provided below:

Step 1: Creating a Global Weighted Alignment Sequence Feature

· Implement the Needleman-Wunsch algorithm to compare DNA sequences.

· Calculate the similarity score between parent and child DNA sequences.

Figure 11: Illustration for Target Labels Method
Figure 12: Applying Needleman-Wunsch algorithm

How similar is the DNA of siblings?

The concept of ‘shared’ DNA differs depending on how much DNA is being compared. When comparing our whole genome, all 3 billion letters, we share about 99.9% with every other human. However, we can also analyze shared DNA by narrowing our scope and just comparing the 3 million genetic differences we know exist between individuals. Of these 3 million differences, on average we share about 50 percent of those with our full siblings.

Children inherit half of their DNA from their mother and half from their father. However, unless they are identical twins, siblings won’t inherit exactly the same DNA. Depending on their biological sex, a parent will produce either a sperm or egg cell by meiosis. Meiosis is a process where a single cell divides twice to produce four cells containing half the original amount of genetic information.

Step 2: Generating a Target Data Feature

  • Generate a binary target label based on the similarity score.
Figure 13: Generating a Target Feature

We have a lot of technical stuff for different datasets. In our case, we are using the K-Mer counting algorithm for feature extraction.

6. Genetic Similarity in Full Siblings

  • Full siblings share about 50% of their genetic differences (the 3 million SNPs) on average.
  • This sharing is due to the random combination of alleles inherited from the parents.
  • Siblings are genetically distinct unless they are identical twins, who share 100% of their DNA.
Figure 14: Final Shape of Data before Processing

7. DNA Natural Language Processing (NLP)

DNA NLP is a novel interdisciplinary field that draws parallels between DNA sequences and natural language to apply computational linguistics techniques to genomic data. DNA sequences can be thought of as “texts” written in a four-letter alphabet (A, T, C, G), and the analysis of these sequences can benefit from methods traditionally used in NLP.

Figure 15: DNA Natural Language Processing Steps

8. DNA K-Mers Processing

A K-mer is a substring of length k derived from a longer DNA or RNA sequence. K-mers are fundamental in bioinformatics for analyzing nucleotide sequences such as DNA or RNA. They allow the representation, comparison, and manipulation of sequences in a computationally efficient way.

A window of length 6 was created and slide it from left to right, shifting one character at a time. If the length of the given DNA sequence is N, we would end up with N — k+1 total of K-Mers.

Figure 16: DNA Sequence using K-mers

Examples of DNA sequences.

Representation of the K-mers in the dataset.

Processing Steps

  1. The function uses a list comprehension to iterate over the sequence.
  2. sequence[i:i+k] extracts a substring (k-mer) starting at position i and of length k.

The range for i is from 0 to len(sequence) - k + 1, ensuring that the last k-mer generated is exactly of length k. The function returns a list of K-mers of length k.

Figure 16: k-mers of length k

Let’s go through an example to illustrate how the function works.

Example Input

  • sequence = "ACGTACGT"
  • k = 3

Execution Steps

  • The length of the sequence is 8, and k is 3.
  • The range for i will be from 0 to 8 - 3 + 1 = 6. This means i will take values 0, 1, 2, 3, 4, 5.
  • For each value of i, the substring sequence [i: i+3] is extracted.

Let’s see each step in detail:

  • When i = 0, sequence [0:3] is "ACG".
  • When i = 1, sequence [1:4] is "CGT".
  • When i = 2, sequence [2:5] is "GTA".
  • When i = 3, sequence [3:6] is "TAC".
  • When i = 4, sequence [4:7] is "ACG".
  • When i = 5, sequence [5:8] is "CGT".

9. NLP Feature Transformation

From NLP the count vectorizer method is used. It is used to transform a given text into a vector based on the frequency (count) of each word that occurs in the entire text.

What is word embedding (vectorization)?

Word embeddings are the texts converted into numbers and there may be different numerical representations of the same text. But before we dive into the details of Word Embeddings. It’s known that many Machine Learning algorithms and almost all Deep Learning Architectures are not capable of processing strings or plain text in their raw form. In a broad sense, they require numerical numbers as inputs to perform any sort of task, such as classification, regression, clustering, etc. Also, from the huge amount of data that is present in the text format, it is imperative to extract some knowledge out of it and build any useful applications.

What is Count Vectorizer?

Count Vectorizer method creates a document term matrix, which is a set of dummy variables that indicates if a particular word appears in the document. It fits and learn the word vocabulary and try to create a document term matrix in which the individual cells denote the frequency of that word in a particular document, which is also known as term frequency, and the columns are dedicated to each word in the corpus.

Figure 17: Count vectorizer method

Methods also we try to use it and make good results with data preprocessing

10. DNA Bag of Words Processing

The DNA Bag of Words (BoW) is a technique inspired by the classic Bag of Words model in Natural Language Processing (NLP), adapted to analyze DNA or genomic sequences. This approach represents DNA sequences as collections of “words” (e.g., K-mers), ignoring their order but capturing their frequency or presence.

Steps:

1. Convert Column to List

  • word_list = list(word_column): Converts the input column to a list of strings.

2. Join Characters with Spaces

  • The function iterates through the list and joins characters in each string with spaces. This step seems unnecessary if the input is already a list of words, but it may be useful if the input is a list of individual characters.

3. Create Count Vectorizer

  • count_vectorizer = CountVectorizer(ngram_range=(word_ngram, word_ngram)): Initializes a CountVectorizer that will create n-grams of the specified length.

4. Fit and Transform

  • X = count_vectorizer.fit_transform(word_list): Fits the CountVectorizer to the word list and transforms it into a sparse matrix of n-gram counts.

5. Return Result:

  • Returns the sparse matrix X which contains the n-gram counts.

11. Feature Scaling

Feature scaling is an essential step in developing precise and effective machine learning models. A crucial aspect of feature engineering includes techniques such as scaling, normalization. These methods transform data to enhance its suitability for modeling, improve model performance, mitigate the influence of outliers, and ensure that data features are on a consistent scale.

Dealing with our dataset that contain features of varying scales, it’s important to normalize the data to ensure optimal model performance. One effective normalization technique is the MaxAbsScaler. The MaxAbsScaler scales each feature by its maximum absolute value, ensuring that all features range between -1 and 1 without altering the data’s sparsity. This approach is particularly beneficial for data with varying magnitudes and when preserving zero entries is crucial. By employing MaxAbsScaler, we can enhance the convergence and accuracy of machine learning algorithms, particularly in scenarios involving sparse datasets or models sensitive to feature scaling.

12. Data Splitting

The ‘train_validation_test_split’ function splits a dataset into training, validation, and test sets. The function allows control over the proportions of the dataset assigned to each set and provides options for stratification to ensure that class distributions are maintained across splits.

Figure 18: Train test split

Our dataset with 43,784 samples, the function performs the following steps to split the data:

1. Initial Split:

  • Training set: 80% of the dataset.
  • Combined test-validation set: 20% of the dataset.

2. Second Split:

  • Validation set: 50% of the combined test-validation set (10% of the total dataset).
  • Test set: 50% of the combined test-validation set (10% of the total dataset).

Calculation Steps:

1. Initial Split:

  • Training set size: 0.8×43,784=35,027.20.8 \times 43,784 = 35,027.20.8×43,784=35,027.2 (rounded to 35,027 samples).
  • Combined test-validation set size: 0.2×43,784=8,756.80.2 \times 43,784 = 8,756.80.2×43,784=8,756.8 (rounded to 8,757 samples).

2. Second Split:

  • Validation set size: 0.5×8,757=4,378.50.5 \times 8,757 = 4,378.50.5×8,757=4,378.5 (rounded to 4,378 samples).
  • Test set size: 0.5×8,757=4,378.50.5 \times 8,757 = 4,378.50.5×8,757=4,378.5 (rounded to 4,379 samples).

Split Sizes

The Training Set: Used to train the model, learning hidden features and patterns through repeated exposure in each epoch.

The Validation Set: Separate from the training set, it is used to tune model hyperparameters and configurations, preventing overfitting by evaluating performance during training.

The Test Set: A separate dataset used to provide an unbiased evaluation of the final model’s performance after training.

  • Training Set: 35,027 samples
  • Validation Set: 4,378 samples
  • Test Set: 4,379 samples
Figure 19: Training data/validation/test split

13. Machine Learning Models

In this section, we outline the process of developing a ML model to classify DNA sequences. The objective is to identify the most effective model and optimal k-mer length for accurate classification.

Model Selection

Three ML models were selected for this project:

  • Random Forest (baseline): An ensemble learning method that constructs multiple decision trees and outputs the mode of their predictions.
  • XGBoost: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
  • LightGBM: A gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the capability of handling large-scale data.

Model Hyperparameters

To fine-tuned the hyperparameters of these models the GridSearchCV was used. After extensive testing, including varying learning rates from 0.1 to 0.5, n_estimators from 5000 to 20000, max depth from 5 to 20, and other parameters, I identified the optimal settings that provided the best accuracy for each model.

Model Hyperparameters

To fine-tuned the hyperparameters of these models the GridSearchCV was used. After extensive testing, including varying learning rates from 0.1 to 0.5, n_estimators from 5000 to 20000, max depth from 5 to 20, and other parameters, I identified the optimal settings that provided the best accuracy for each model.

Model Training

Each model was trained using different K-mer lengths (3, 6, and 7) to determine the optimal length that provides the best classification performance. The training process involved splitting the data into training and validation sets and fitting the models on the training data.

Performance Metrics

The models were evaluated using several metrics to assess their performance:

  • Validation Accuracy: The percentage of correct predictions on the validation set.
  • Test Accuracy: The percentage of correct predictions on the test set.
  • Confusion Matrix: A table used to describe the performance of a classification model, displaying the true positives, true negatives, false positives, and false negatives.
  • Precision, Recall, and F1-Score: Precision is the ratio of correctly predicted positive observations to the total predicted positives. Recall is the ratio of correctly predicted positive observations to all the observations in the actual class. F1-Score is the weighted average of Precision and Recall.

Overall Metrics

The performance metrics for the different K-mer lengths and models are summarized below:

Detailed Metrics for K-mer = 3, 6, and 7.

For each K-mer length, detailed metrics including precision, recall, and F1-scores for both classes (0 and 1) were computed. Additionally, macro and weighted averages of these metrics were calculated to provide a holistic view of the model performance.

Figure 20: Validation accuracy comparison
Figure 21: Testing accuracy comparison

Validation and Test Metrics for K-mer = 3.

Validation metrics diagrams of Model Performance for K-mer = 3.

1- Confusion matrix (Validation)

For LigtGBM model.

For Random Forest model.

2- Classification matrices for Class 1

For LigtGBM model:

For Random Forest model.

For XGBoost model.

3- Calibration Curve Validation for Highest Score

For LigtGBM Model

For Random Forest Model

For XGBoost model.

Analysis of Model Performance for k-mer = 3:

· Accuracy Score: LightGBM has the highest accuracy on both the validation and test sets, it is the best model among the three for our classification task.

· Confusion Matrix: LightGBM generally has fewer false positives and false negatives compared to Random Forest and XGBoost, indicating better performance in correctly classifying both classes.

· Precision, Recall, and F1-Score: All three models have very similar precision, recall, and F1-scores for both classes, indicating comparable performance in terms of classification quality. However, LightGBM slightly edges out in most metrics.

14. Summary and Observations

Summary:

  • The analysis focused on using traditional machine learning models (Random Forest, XGBoost, and LightGBM) to classify DNA sequences for paternity testing. The models were evaluated using different k-mer lengths (3, 6, and 7) to determine their impact on performance metrics.

Performance Metrics:

  • Validation Accuracy and Test Accuracy were the primary metrics used for evaluation.
  • Confusion Matrices were analyzed for true positives, true negatives, false positives, and false negatives.
  • Precision, Recall, and F1-Score were calculated for both classes and averaged for macro and weighted metrics.

Conclusion:

  • LightGBM consistently outperformed both Random Forest and XGBoost across all k-mer lengths, indicating its superior capability in handling DNA sequence classification for paternity testing.
  • Increasing the k-mer length generally improved the models’ accuracy, with the highest performance observed at k-mer = 7.
  • Attempting k-mer lengths of 8 or higher caused kernel crashes, suggesting a need for more powerful computational resources for further improvements.
  • With a more powerful GPU, it is anticipated that higher k-mer lengths could be explored, potentially leading to even better model accuracy and performance.

15. Project Coding APIs

Model Deployment

We use a code Python script that implements a Flask web application for DNA sequence matching , identification and predicting relative or not. It uses machine learning models to compare a given DNA sequence with sequences stored in a database and files present in a folder. The code retrieves DNA sequences from the database, reads sequences from files, and performs sequence matching and identification predication using LightGBM and other techniques.

Dependencies

The code relies on several external libraries and modules, which need to be installed before running the script:

  • Flask
  • Python3.8 or Higher
  • Bio (Biopython)
  • requests
  • pickle
  • numpy
  • pandas
  • scikit-learn
  • lightgbm
  • joblib
  • os
  • scipy

Ensure these dependencies are installed in your Python environment before executing the code.

Code Structure

The code is modular, ensuring clarity and ease of maintenance. It is divided into several sections, each serving a specific purpose. This document outlines the steps involved in setting up, running, and understanding the functionality of each section of the code.

1. Importing Required Libraries

Objective: Import essential libraries for the application to function.

  • Flask: Web framework for creating the web application.
  • Biopython: Used for DNA sequence alignment.
  • Requests: For making API calls.
  • Various Machine Learning and Utility Libraries: For data processing and machine learning tasks.

2. Flask App Configuration

Objective: Set up and configure the Flask application.

  • Create Flask Application: Initialize the Flask app with specified template and static folder locations.
  • Set API URL: Define the API endpoint for retrieving population data.

3. DNA Sequence Alignment

Objective: Define a function to compare two DNA sequences.

  • Function Definition: Implement needleman_wunsch_similarity to compare DNA sequences using the Needleman-Wunsch algorithm.
  • Parameters: Accept sequences and alignment scoring parameters (match, mismatch, gap penalties).
  • Output: Return a similarity score.

4. DNA Sequence Retrieval

Objective: Retrieve DNA sequences from files or an external API.

  • Retrieve from File: Define retrieve_dna_sequence_from_file to read and process DNA sequences from uploaded files.
  • Retrieve from API: Implement retrieve_api_data to fetch population data from the specified API endpoint.

5. Worker Function

Objective: Handle background tasks.

  • Function Definition: Implement a basic worker function to demonstrate background processing capability.

6. Flask Routes

6.1 Home Page

Objective: Render the home page of the application.

  • Route Definition: Define a route to render index.html.

6.2 DNA Sequence Comparison

Objective: Handle DNA sequence comparison requests.

  • Comparison Page: Define a route to render compare.html.
  • Comparison Logic:
  • Validate file uploads.
  • Read and process DNA sequences.
  • Calculate similarity using needleman_wunsch_similarity.
  • Return results as JSON.

6.3 DNA Sequence Identification

Objective: Handle DNA sequence identification requests.

  • Identification Page: Define a route to render identify.html.
  • Identification Logic:
  • Validate file uploads.
  • Retrieve population data from API.
  • Filter and compare sequences.
  • Return results as JSON.

6.4 Missing DNA Sequence

Objective: Handle cases of missing DNA sequences.

  • Missing Page: Define a route to render missing.html.
  • Missing Sequence Logic:
  • Validate file uploads.
  • Retrieve population data from API.
  • Compare sequences to find matches and potential children.
  • Return results as JSON.

6.5 Predicting DNA Sequence

Objective: Handle DNA paternity prediction requests.

  • Prediction Page: Define a route to render a page for predicting DNA paternity.
  • Prediction Form:

§ Render a form in HTML (predict.html) that allows users to upload two DNA files: one for the parent and one for the child.

§ Use the following HTML code for predict.html:

16. Feature Work

1. We will introduce a deep learning Convolutional Neural Networks (CNNs) for Local sequence alignment. The objective is to enhance the accuracy of sequence alignment beyond the capabilities of classical machine learning techniques.

2. Working in some Advanced topics in Bioinformatics like:

  • Finding Hidden Messages in DNA
  • Finding Mutations in DNA
  • Clustering DNA (genome) Sequence

3. Making new model to predict the DNA Next Generation Sequence (NGS), it’s really will help to make new vaccine that’s hold immense promise for transformative breakthroughs in the understanding and treatment of various diseases.

17. References

--

--

Ernest Bonat, Ph.D.
Ernest Bonat, Ph.D.

Written by Ernest Bonat, Ph.D.

I’m a Senior Machine Learnig Developer. I work on Machine Learning application projects for Life Sciences using Python and Python Data Ecosystem.

No responses yet