Advanced Paternity DNA Sequence Classification Using Dynamic Programming and Machine Learning Algorithms. Part 1
Ernest Bonat, Ph.D., Mohamed Abdulaziz Eisa, B.Sc. AI
1. Overview
2. Research Limitations
3. Machine Learning Challenges
4. Biological Sequence Alignment
5. Types of Alignment Algorithms
6. Importance of Biological Sequence Alignment
7. Overview of Pairwise Sequence Alignment Techniques
8. Dynamic Programming Algorithm for Sequence Alignment Steps
9. Scoring Scheme
10. Global Alignments
11. Local Alignments
12. Table of Comparison of Alignment Methods
13. Paternity DNA Sequence Alignment
14. Conclusions
15. Future Work Plan (Part 2)
1. Overview
A paternity test is a type of genetic test that determines whether a man is the biological father of a child. Paternity tests are used to establish a biological relationship between a father and a child. This can be important for legal, medical, or personal reasons. DNA Testing is the most common and accurate method. It involves collecting DNA samples from the alleged father and the child, usually through a cheek swab or blood sample. In this paper, Dynamic Programming and Machine Learning algorithms will be used to for paternity DNA sequence classification.
The primary aim of the proposed research is to advance the efficiency of sequence alignment algorithms through complete parallelization. This approach targets commonly used algorithms, specifically the Smith-Waterman and Needleman-Wunsch algorithms, leveraging efficient and low-cost hardware and software platforms to address and mitigate the challenges associated with dynamic programming and traditional hardware used in hospitals and laboratories.
2. Research Limitations
The proposed approach has several limitations:
- Sequence Length Restriction: The method relies on using equal-length sequences, specifically multiples of four (e.g., 4, 8, 12), which aligns with the four nucleobases in DNA sequences. This restriction does not directly apply to protein sequences, which consist of 20 amino acids.
- Algorithm Flexibility: Although the proposed method can be adapted to other local or global alignment algorithms such as Gotoh, Miller-Myers, and Hirschberg, its effectiveness is contingent upon the assumption that the parallel software implementation will be more efficient than traditional methods. The approach aims to be faster and less computationally and power-intensive, but this efficiency may vary depending on the specific algorithm and implementation context.
In our research, we focused on pair-wise sequence alignment. Specifically, we utilized the Smith-Waterman (SW) algorithm for local alignment and the Needleman-Wunsch (NW) algorithm for global alignment. These techniques are designed to align sequences by comparing them in pairs, providing detailed insights into their similarities and differences.
3. Machine Learning Challenges
We began developing a machine learning-based DNA classifier for paternity testing, which presents several key challenges.
- Large and Long Sequences: The classifier must handle a large number of DNA sequences, each of which can be very long. This poses significant computational and memory challenges.
- Computational Complexity: Sequence alignment algorithms used in this context have a time complexity of O(MN), where M and N represent the lengths of the sequences. This complexity can lead to high computational costs and processing times.
- Hardware Limitations: Traditional hardware implementations of sequence alignment algorithms may not be effective for handling sequential process problems, which can impact system performance and speed.
Solutions:
- Efficient Data Processing: Employ data preprocessing techniques such as sequence chunking and compression to manage large datasets and reduce memory usage.
- Optimized Algorithms: Utilize optimized algorithms and approximations to reduce computational complexity. Techniques such as heuristic methods and algorithmic improvements can help speed up processing.
- Parallel Computing: Leverage parallel computing and hardware acceleration, such as GPUs, to handle large-scale computations more efficiently and improve processing times.
4. Biological Sequence Alignment
We can define biological sequence alignment as the process of arranging and comparing DNA, RNA, or protein sequences to identify regions of similarity. These similarities can reveal important functional, structural, or evolutionary relationships between the sequences.
The purpose of sequence alignment is:
- Similarity Identification: The primary goal is to detect regions of similarity between sequences. These similarities can indicate shared functions or evolutionary origins.
- Functional and Evolutionary Insights: Identifying conserved regions can help infer the function of genes or proteins, understand evolutionary relationships, and predict the structure and function of new sequences.
5. Types of Alignment Algorithms
We will explore the fundamentals and techniques of sequence alignment in bioinformatics, focusing on how to compare and analyze biological sequences to uncover functional, structural, and evolutionary relationships.
a. Pairwise Sequence Alignment
This involves aligning two sequences at a time. The aim is to find the best alignment that maximizes the number of matches and minimizes mismatches and gaps.
· Local vs. Global Alignment
- Local Alignment: Focuses on finding the most similar region within two sequences. The Smith-Waterman algorithm is commonly used for local alignment, which is useful when sequences share a common sub-region but differ significantly overall.
- Global Alignment: Aligns the entire length of two sequences, accounting for possible gaps at the ends. The Needleman-Wunsch algorithm is typically used for global alignment, suitable when comparing sequences of similar length and overall similarity.
· Applications: Used for comparing two sequences to understand their functional similarity or evolutionary relationship. For instance, comparing a gene sequence to a database to identify potential homologous genes.
b. Multiple Sequence Alignment
This method aligns three or more sequences simultaneously. The objective is to find a common alignment that maximizes overall similarity across all sequences.
• Algorithm Examples:
- ClustalW: A widely used tool for constructing multiple sequence alignments. It uses progressive alignment techniques to build up the alignment step-by-step.
- MUSCLE: An algorithm known for its speed and accuracy in generating multiple sequence alignments.
- T-Coffee: Combines multiple alignment methods to improve accuracy by integrating information from different alignments.
- Applications: Essential for studying the evolutionary relationships among a group of sequences, identifying conserved motifs, and understanding functional domains shared by related proteins or genes. MSA helps in predicting the secondary and tertiary structures of proteins by aligning homologous sequences.
6. Importance of Biological Sequence Alignment
In this topic, we will explain the significance and methodologies of sequence alignment in bioinformatics. We will explore how sequence alignment helps in various fields of biological research and practical applications. Specifically, we will cover:
- Evolutionary Studies: Helps in constructing phylogenetic trees to understand the evolutionary relationships among species or genes.
- Functional Annotation: Aids in predicting the function of genes or proteins by identifying conserved regions that are crucial for their biological roles.
- Drug Discovery and Development: Identifying conserved motifs or domains can help in designing drugs that target specific biological functions or interactions.
- Genomic Research: Useful in genome annotation, identifying gene families, and understanding the genetic basis of diseases.
By aligning biological sequences, researchers can uncover valuable information about the similarities and differences between sequences, leading to a better understanding of biological functions and evolutionary processes.
7. Overview of Pairwise Sequence Alignment Techniques
Let’s look at these figures.
1. Heuristic Programming
We will explain heuristic programming as techniques used for sequence alignment in bioinformatics. Heuristic methods provide quick and approximate solutions to sequence alignment problems, making them ideal for searching large databases. We will discuss the following tools and their applications:
1.1 BLAST (Basic Local Alignment Search Tool)
- A widely used heuristic tool for finding regions of local similarity between sequences. BLAST searches databases for sequences that align with a query sequence, providing quick and approximate results.
- Type: Local Alignment
- Usage: Ideal for searching large databases to identify homologous sequences or functional similarities.
1.2 FASTA
- Another heuristic method used to search for local alignments between a query sequence and sequences in a database. FASTA uses a scoring matrix and gap penalties to identify regions of similarity.
- Type: Local Alignment
- Usage: Useful for identifying similar sequences in large databases and performing quick alignments.
1.3 SIM2
- Designed for aligning protein sequences to a database. SIM2 improves alignment accuracy by incorporating additional scoring systems tailored for protein sequences.
- Type: Local Alignment
- Usage: Enhances local alignment accuracy, particularly for protein sequences.
2. Dynamic Programming
We will explore dynamic programming techniques used for sequence alignment in bioinformatics. Dynamic programming provides rigorous and accurate solutions for aligning sequences, making it essential for detailed analysis. We will discuss the following algorithms and their applications:
2.1 Needleman-Wunsch (NW)
- A dynamic programming algorithm for finding the optimal global alignment between two sequences. It aligns sequences from end to end, accounting for all gaps and mismatches.
- Type: Global Alignment
- Usage: Suitable for sequences that are of similar length and overall similarity, providing a comprehensive alignment.
2.2 Smith-Waterman (SW)
- A dynamic programming algorithm used for finding the optimal local alignment between two sequences. It focuses on aligning the most similar sub-region within the sequences.
- Type: Local Alignment
- Usage: Effective for identifying highly similar regions within sequences that may differ significantly in other areas.
2.3 Hirschberg
- An optimized variant of the Needleman-Wunsch algorithm that reduces memory usage by using a divide-and-conquer approach. It still performs global alignment but with reduced space complexity.
- Type: Global Alignment (Optimized for space)
- Usage: Useful when aligning long sequences where memory is a constraint.
2.4 Gotoh:
- Description: An enhancement of the Needleman-Wunsch algorithm that includes affine gap penalties, providing a more accurate alignment by accounting for the cost of gaps more effectively.
- Type: Global Alignment
- Usage: Improves global alignment accuracy by refining gap penalties.
2.5 Miller-Myers:
- Description: An extension of the Needleman-Wunsch algorithm designed to improve alignment efficiency by incorporating additional heuristics for gap penalties and mismatches.
- Type: Global Alignment
- Usage: Suitable for aligning sequences with varying lengths and gap penalties.
3. Specialized Techniques
We will explore specialized techniques used for sequence alignment in bioinformatics. These techniques offer unique methods for visualizing and analyzing sequence alignments, providing valuable insights into the relationships between sequences. We will discuss the following technique and its application:
3.1 Dot-Matrix Technique
- Description: A graphical method for visualizing sequence alignments by plotting a matrix where dots represent matching characters. This technique helps in identifying regions of similarity and differences.
- Type: Global and Local Alignment
- Usage: Useful for visual inspection and initial exploration of sequence similarities and differences.
Heuristic Methods (BLAST, FASTA, SIM2): Provide fast, approximate alignments, suitable for large-scale database searches and quick comparisons.
- Dynamic Programming (Needleman-Wunsch, Smith-Waterman, Hirschberg, Gotoh, Miller-Myers): Offer accurate alignments with various optimizations for space and gap penalties, suitable for detailed analysis of sequence similarities.
- Specialized Techniques (Dot-Matrix): Provide visual and exploratory tools for understanding sequence relationships and similarities.
8. Dynamic Programming Algorithm for Sequence Alignment Steps
As we discussed earlier, dynamic programming in sequence alignment is a technique used to determine the optimal alignment of two or more sequences — such as DNA, RNA, or protein sequences — by breaking the problem into simpler, manageable subproblems. Here’s a simplified explanation:
- Set the reference sequence across the top of the N*M matrix and the read sequence along the side.
- Initialize the first row and first column of the score matrix (values depend on algorithm)
- For each element, derived scores from neighboring above, above left, and left units.
- For each element, compute match and mismatch scores on above-left score and gap score on above and left scores. Choose maximum of computed score as final score.
- Once all elements in matrix are filled, find the highest score, which is where the last base in the alignment occurs.
General Workflow
In this section, we will explain the steps involved in sequence alignment using the Smith-Waterman (SW) and Needleman-Wunsch (NW) algorithms.\
Step 1: The Basic Concept
Both SW and NW algorithms use a matrix to align two sequences. Let’s call them Seq1 (M) and Seq2 (N). The matrix helps in scoring alignments based on matches, mismatches, and gaps.
- Match: +1 point
- Mismatch: -1 point
- Gap: -2 points
Step 2: Matrix Initialization
We start by setting up a matrix where:
- The rows represent one sequence (Seq2 = ATCGT).
- The columns represent the other sequence (Seq1 = TGGGTG).
Example:
Seq1: T G G G T G
Seq2:
A
T
C
G
T
The first row and column are initialized with gap penalties:
T G G G T G
0 -2 -4 -6 -8 -10 -12
A -2
T -4
C -6
G -8
T-10
Step 3: Matrix Filling
For each cell in the matrix, we calculate scores based on possible alignments:
- Diagonal (match/mismatch): Aligns the characters from both sequences.
- Top (gap in Seq1): Aligns a gap in Seq1 with a character in Seq2.
- Left (gap in Seq2): Aligns a character in Seq1 with a gap in Seq2.
Formula:
T[i,j] = max { T[i-1, j-1] + σ(S1[i], S2[j]), T[i-1, j] + gap penalty, T[i, j-1] + gap penalty }
Step 4: Traceback and Result Generation
Once the matrix is filled, we trace back from the highest score to generate the optimal alignment. This process involves following the path of the highest scores backward through the matrix.
Example Explained
Let’s align Seq1 (TGGGTG) and Seq2 (ATCGT) using our scoring scheme. Here’s how we fill the matrix step by step.
1. Initialization:
- The first row and column are filled with multiples of the gap penalty (-2).
2. Filling the Matrix:
- For cell (i=4, j=3) aligning G (from Seq1) with C (from Seq2), we calculate:
— Diagonal: T[3,2] — 1 (mismatch)
— Top: T[3,3] — 2 (gap)
— Left: T[4,2] — 2 (gap)
— Result: max(-3, -3, -6) = -3
3. Traceback:
- After filling the matrix, we find the highest score and trace back to form the optimal alignment.
9. Scoring Scheme
Let’s explore more and understand the scoring scheme used in sequence alignment. A scoring scheme assigns numerical values to matches, mismatches, and gaps to determine the optimal alignment between sequences.
Example Scoring Scheme
- Match: +1 point
- Mismatch: -1 point
- Gap: 0 points
Example Sequences
- Sequence 1: GATTACA
- Sequence 2: GCTAC
Alignment Example
GATTACA
| | |
GCT-AC
Scoring Details:
1. G vs G: Match = +1
2. A vs C: Mismatch = -1
3. T vs T: Match = +1
4. T vs A: Mismatch = -1
5. A vs C: Mismatch = -1
6. C vs -: Gap = 0
7. A vs -: Gap = 0
Total Score Calculation
- Matches: +1 (G-G) + +1 (T-T) = +2
- Mismatches: -1 (A-C) + -1 (T-A) + -1 (A-C) = -3
- Gaps: 0 (C-) + 0 (A-) = 0
Final Alignment Score: +2 (matches) — 3 (mismatches) + 0 (gaps) = -1
New Example Alignment
- Sequence A1: A-TGAG
- Query: ATGGCGSequence A2: ATG-AG
Total Score Calculation
- Final Alignment Score (A1) with Query: +3 (matches) — 2 (mismatches) + 1 (gaps) = +1
- Final Alignment Score (A2) with Query: +4 (matches) — 1 (mismatches) + 1 (gaps) = +3
10. Global Alignments
Binary Global Alignment
In this topic we will discuss in deep Binary Global Alignment is a method used to compare two sequences by aligning them completely from start to finish. It finds the optimal match by considering all possible alignments, including gaps, and scores them based on matches, mismatches, and gaps. This helps in identifying the best overall alignment between the sequences.
1. Same Sequences
If both the sequences is exactly same then:
- Match Values scoring = 1
- Mismatch value Scoring = 0
- Gap penalty is not considered then let’s show how it act with code.
It’s the same exactly score if we calculated it 39 out of 39.
2. Different Sequences
Now, let us take two different sequence one from one person generation and another from another generation (different states).
Here we can see that score is 21 out of 39.
Weighted Global Alignment
In this topic, we will explore Weighted Global Alignment, an advanced method for comparing two sequences. Unlike basic alignment methods, Weighted Global Alignment assigns different weights to matches, mismatches, and gaps, reflecting their varying significance in different contexts. This technique ensures a more precise and tailored alignment by considering the importance of specific alignments. By optimizing the alignment scores with these weights, Weighted Global Alignment provides a more accurate and meaningful comparison between sequences, making it invaluable in various fields such as bioinformatics and computational biology.
1.Same Sequences
Here, each and every possible type of scoring is given a weight.
- Match Scoring = 2
- Mismatching Scoring = -1
- Gap Penalty = (open = -0.5 , extend = -0.1)
we get this score (39 * 2 ) = 78 out of 78.
2. Different Sequences
Let’s try to test with different sequences and see the results for weighted global alignment.
Score = 31.2
11. Local Alignments
Binary Local Alignment
we will explain Binary Local Alignment, a method used to find the optimal alignment between the most similar subregions of two sequences. Unlike global alignment, which aligns sequences from end to end, Binary Local Alignment focuses on aligning the best matching parts, allowing for more flexibility and accuracy in identifying regions of high similarity. By scoring matches, mismatches, and gaps, this method helps in pinpointing the most relevant and significant local alignments between sequences.
1. Same Sequences
If both the sequences is exactly same then
- Match Values scoring = 1
- Mismatch value Scoring = 0
- Gap penalty is not considered then let’s show how it acts with code.
It’s the same exactly score if we calculated it 39 out of 39.
2. Different Sequences
Let’s try to test with different sequences and see the results for binary local alignment.
Here we can see that score is 21 out of 39.
Weighted Local Alignment
We will discuss Weighted Local Alignment, a method used to compare specific parts of two sequences. Unlike global alignment, which aligns sequences from start to finish, Weighted Local Alignment focuses on the most similar sub-regions. It uses different weights for matches, mismatches, and gaps to find the best local alignment. This approach helps identify the most relevant and meaningful similarities between sequences.
1. Same Sequences
Here, each and every possible type of scoring is given a weight.
- Match Scoring = 2
- Mismatching Scoring = -1
- Gap Penalty = (open = -0.5 , extend = -0.1)
It’s the same Score = 78 like globalms (weighted global alignment).
2. Different Sequences
Let’s try to test with different sequences and see the results for weighted Local alignment.
Score = 32.2, it’s different score from weighted global with different sequences. Example: for Local Alignment in Dynamic Programming “Smith-Waterman (SW)”.
12. Table of Comparison of Alignment Methods
The table that local alignment is more adaptable when dealing with different sequences, focusing on the most similar regions to provide better alignment scores. In contrast, global alignment is most effective for aligning sequences that are highly similar across their entire lengths.
Alignment Types
1. Global Alignment
- Same Sequences: Global alignment aligns the entire length of two sequences, maximizing the match across their full lengths. For identical sequences, the binary score is 39/39, and the weighted score is 78/78, indicating perfect alignment.
- Different Sequences: When aligning different sequences globally, the scores drop, reflecting the mismatches and gaps introduced. Here, the binary score is 21/39, and the weighted score is 31.2/78, showing that while some alignment is possible, it is less effective due to differences in the sequences.
2. Local Alignment
- Same Sequences: Local alignment finds the best matching subsequence within the sequences. For identical sequences, it effectively behaves like global alignment, yielding a binary score of 39/39 and a weighted score of 78/78.
- Different Sequences: For different sequences, local alignment focuses on regions with the highest similarity, resulting in a better alignment score than global alignment. The binary score here is 21/39, but the weighted score is slightly higher at 32.5/78, indicating more effective alignment of the most similar regions.
3. Scoring Schemes
- Binary Score: Measures the number of aligned positions, indicating how many matches occurred.
- Weighted Score: Considers the quality of matches, mismatches, and gaps, providing a more nuanced view of alignment effectiveness.
13. Paternity DNA Sequence Alignment
- In the context of paternity testing, local alignment is generally preferred due to its suitability for identifying specific regions or markers used to establish genetic relationships between individuals. This method allows for a focus on the most relevant portions of DNA sequences, such as short tandem repeats (STRs) or other genetic markers commonly utilized in such tests.
- However, our case involves a different approach. Given that our data comprises very large sequences, specifically around 2000 letters of single nucleotide polymorphism (SNPs) rather than markers or STRs, local alignment is not the optimal method. Instead, we require a technique suitable for comparing entire sequences due to the expected high similarity across the full length of these sequences, akin to whole genome comparison.
- Therefore, our alignment strategy must accommodate the extensive nature and high similarity of the SNP sequences to accurately reflect the genetic relationships in our dataset.
14. Conclusions
1. The research focused on improving the efficiency of DNA sequence alignment algorithms, specifically the Smith-Waterman and Needleman-Wunsch algorithms, through parallelization.
2. Addressed computational and memory challenges associated with large DNA sequences, emphasizing the need for optimized algorithms and advanced hardware.
3. Employed dynamic programming to manage sequence alignment, showcasing its effectiveness in identifying similarities between sequences for paternity testing.
4. Weighted Global Alignment: This method was chosen for its ability to provide more accurate alignments by assigning different weights to matches, mismatches, and gaps, thus better handling the variability and complexity of DNA sequences.
Importance of Weighted Global Alignment:
- Accuracy: By assigning specific weights, the algorithm can more accurately reflect biological realities, leading to more precise alignments.
- Flexibility: Weighted global alignment can be adapted to different types of sequences and alignment needs, making it versatile for various applications.
- Performance: Optimization and parallelization of this method improve computational efficiency, making it suitable for large-scale DNA data processing.
15. Future Work Plan (Part 2)
- Code Implementation
- Develop Algorithms: Implement the weighted global alignment algorithm in a programming language suitable for handling large data sets, such as Python.
- Optimize for Performance: Utilize libraries and frameworks that support parallel computing, such as CUDA for GPU acceleration.
2. Preparing Data
- Data Collection: Gather a diverse set of DNA sequences to ensure the robustness of the algorithm.
- Preprocessing: Implement techniques such as sequence chunking and compression to manage memory usage and improve processing speed.
3. Research Notebooks
- Documentation: Create detailed research notebooks documenting the implementation, experiments, and results.
- Visualizations: Include visualizations of sequence alignments and performance metrics to illustrate the effectiveness of the algorithm.
4. Deployment and Testing
- Real-World Data Testing: Apply the algorithm to real-world DNA data to validate its accuracy and efficiency.
- Continuous Improvement: Iterate on the algorithm based on test results, optimizing for accuracy and performance.
5. Final Presentation
- Compile Results: Summarize findings, methods, and results in a comprehensive report.
- Demonstration: Prepare a live demonstration of the algorithm processing DNA sequences, showcasing its practical application in paternity testing and genomic research.