Sequence Alignment as Dynamic Programming — Needleman-Wunsch for CS Folks

publiclearnv1·by Andre Pacheco·May 14, 2026productoplaybook

Tiempo de lectura:12 min

Prereqs:DP básico, recursión

Dificultad:Media

Lenguaje:Pseudocódigo

You already know dynamic programming. You've solved Longest Common Subsequence in an interview, smiled smugly, and moved on. Good news: biologists have been using almost the exact same algorithm since 1970, and they call it Needleman-Wunsch. It aligns DNA, proteins, and any sequence where insertions and deletions matter. This doc maps the bio vocabulary onto the CS vocabulary you already own.

TL;DR: Needleman-Wunsch is LCS with a richer scoring scheme and a gap penalty. Same O(nm) table. Same trace-back. Different domain, same beast.

What you'll get from this doc

Why DNA alignment is a string-matching problem in disguise

How the DP recurrence works (with the bio scoring twist)

Reading the trace-back to recover the actual alignment

When to reach for global vs local alignment

Affine gap penalties (Gotoh 1982)

Multi-sequence alignment (use a real library)

The setup: two strings, one biology problem

You've got two DNA sequences:

S1 = GATTACA
S2 = GCATGCU

A biologist wants to know: how related are these? Specifically — what's the best way to line them up so matching characters stack, allowing for insertions (a base appeared) and deletions (a base went missing) along the way?

That "best way" is the alignment. One candidate:

G - A T T A C A
G C A T - G C U

Three matches (G, A, T, C), one mismatch (A/G, A/U), two gaps (-). Is that the best alignment? You need a score function and an algorithm to maximize it. That's where DP enters.

The CS analogy you already have

Biology says	CS hears
Sequence	String
Base / residue	Character
Match score (+1)	LCS reward for equal chars
Mismatch penalty (-1)	Substitution cost in edit distance
Gap penalty (-2)	Insert/delete cost in edit distance
Optimal alignment	Trace-back from DP table
Substitution matrix (BLOSUM62)	Per-pair cost lookup table

If you've written edit distance, you've already written 80% of Needleman-Wunsch. The only differences: (1) you're maximizing instead of minimizing, (2) the scoring is asymmetric and domain-specific.

The recurrence

Let H[i][j] be the optimal alignment score of S1[1..i] against S2[1..j]. Then:

H[i][j] = max(
  H[i-1][j-1] + s(S1[i], S2[j]),   // diagonal: match or mismatch
  H[i-1][j]   + gap,                // up:       gap in S2
  H[i][j-1]   + gap                 // left:     gap in S1
)

with base cases H[0][0] = 0, H[i][0] = i * gap, H[0][j] = j * gap.

Yes — it's literally LCS with a scoring tweak.

The algorithm in 4 acts

✓
Initialize
Allocate (n+1)×(m+1) table. Fill row 0 and column 0 with cumulative gap penalties.
✓
Fill
Sweep row by row, left to right. Each cell is max of three predecessors. O(nm) cells × O(1) work each.
✓
Score
H[n][m] is the optimal alignment score. That single number ranks how related S1 and S2 are.
4
Trace back
Walk from H[n][m] to H[0][0] following the choices that produced each max. Recover the alignment string.

Why DP beats naive recursion (the part you tell freshmen)

The naive recursive solution explores every possible alignment — at each step you either match, gap-up, or gap-left. That's a ternary tree of depth n+m. Cosmically expensive.

	Naive recursion	Needleman-Wunsch (DP)
Time complexity	O(3^(n+m))	O(n·m)
Space complexity	O(n+m) stack	O(n·m) table
Recomputes subproblems	Yes — exponentially	No — memoized
Practical for n=1000	Heat death of universe	~1M ops, ~ms

For two 1000-base sequences, naive recursion is roughly 3^2000 operations. NW is one million. You will publish before the naive version returns.

Memory trick: If you only need the score (not the alignment), you can drop to O(min(n,m)) space by keeping only the previous row. Hirschberg's algorithm even gives you the alignment in O(n·m) time and O(n+m) space — divide-and-conquer on the DP. Worth knowing.

The DP table, visualized

For S1 = GATTACA aligned to S2 = GCATGCU with match=+1, mismatch=-1, gap=-2:

When you trace back, diagonal = match or mismatch (consume both chars), up = gap in S2 (consume only S1), left = gap in S1 (consume only S2). Reverse the path string and you've got the alignment.

Global vs local — when to switch

Needleman-Wunsch is global: it aligns the entire length of both sequences. That's right when the sequences are roughly the same length and you suspect they're homologous end-to-end. When you're searching for a short motif inside a long genome, you want Smith-Waterman instead — same DP shape, but with a max(0, ...) clamp and trace-back from the highest cell anywhere in the table, not the corner.

	Needleman-Wunsch	Smith-Waterman
Alignment scope	Global (end-to-end)	Local (best subsequence)
Negative scores allowed?	Yes	No (clamped to 0)
Trace-back start	Bottom-right corner	Highest cell anywhere
Best for	Whole-gene alignment	Motif / domain search

Heads up: Real biology uses substitution matrices (BLOSUM62 for proteins, PAM for evolutionary distance) instead of flat ±1 scoring. The DP is identical — just swap your s(a,b) lookup. Production tools also use affine gap penalties (different cost for opening vs extending a gap). Add Gotoh's O(nm) extension when you need that.

Where to take it from here

Next step

Implement NW in your language of choice with the GATTACA / GCATGCU example, then read the BLAST paper to see how heuristic seeding makes alignment tractable on billion-base genomes.

Wikipedia: Needleman-Wunsch →

You now have the bridge: every alignment algorithm in computational biology — Needleman-Wunsch, Smith-Waterman, Gotoh, Hirschberg — is a DP table you already know how to fill. The biology is just a richer scoring function over the same recurrence.

Welcome to the field. Bring snacks; the genomes are big.