Introduction (Understanding the Concept)

What is Minimum Edit Distance?

In simple words: A mathematical way to measure the "difference" between two strings by counting the minimum number of operations needed to transform one string into another.

Example:

String A = "CAT"

String B = "CAR"

Difference? Only one letter (T → R). Therefore distance = 1

How Do We Measure It?

Three operations are allowed (each operation typically costs 1):

Real-Life Applications

What Will You Learn in This Lab?

  1. Understanding the edit distance formula and DP table
  2. Implementing from scratch in Python
  3. Backtracing to find which operations were used
  4. Building a real spelling corrector
  5. Customizing costs for different applications

Visual Example: "CAT" → "CAR"

Step-by-step visualization:

Method 1 (Optimal):
CAT
 │
 └─> Substitute 'T' with 'R'
     │
     CAR
Distance = 1 (only one operation)

Method 2 (Not optimal):
CAT → CA (delete 'T') → CAR (insert 'R')
Distance = 2 (two operations)

Method 3 (Not optimal):
CAT → CART (insert 'R') → CAR (delete 'T')
Distance = 2 (two operations)

Practical Lab (Step-by-Step)

Step 1: Understanding the DP Table Manually

Strings: "CAT" (source) → "CAR" (target)

Table Size: (len(src)+1) × (len(target)+1) = 4 × 4

Rules:

  • Row 0, Column 0 = 0
  • Row 0, Column j = j (need j insertions to reach target)
  • Column 0, Row i = i (need i deletions from source)

The Recurrence Formula:


D[i][j] = minimum of:
    D[i-1][j] + 1     # Delete from source
    D[i][j-1] + 1     # Insert into source
    D[i-1][j-1] + cost # Substitute (0 if same, 1 if different)

Initialize the table:

     ''  C   A   R
''    0   1   2   3
C     1   ?   ?   ?
A     2   ?   ?   ?
T     3   ?   ?   ?

Fill Cell (1,1): Compare 'C' (source) with 'C' (target) → Same character

Delete: D[0][1] + 1 = 1 + 1 = 2
Insert: D[1][0] + 1 = 1 + 1 = 2
Substitute: D[0][0] + 0 = 0 + 0 = 0
Minimum = 0

Updated table:

     ''  C   A   R
''    0   1   2   3
C     1   0   ?   ?
A     2   ?   ?   ?
T     3   ?   ?   ?

Fill Cell (1,2): Compare 'C' (source) with 'A' (target) → Different

Delete: D[0][2] + 1 = 2 + 1 = 3
Insert: D[1][1] + 1 = 0 + 1 = 1
Substitute: D[0][1] + 1 = 1 + 1 = 2
Minimum = 1

Continue filling all cells...

Final completed table:

     ''  C   A   R
''    0   1   2   3
C     1   0   1   2
A     2   1   0   1
T     3   2   1   1

Answer: Bottom-right cell (3,3) = 1

So the minimum edit distance from "CAT" to "CAR" is 1.

Step 2: Python Implementation (Simple Version)

def min_edit_distance(src, target):
    """
    Calculate minimum edit distance between src and target
    Default costs: insertion=1, deletion=1, substitution=1
    
    Args:
        src: Source string
        target: Target string
    
    Returns:
        int: Minimum edit distance
    """
    n, m = len(src), len(target)
    
    # Create DP table with zeros
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    # Initialize base cases
    for i in range(n + 1):
        dp[i][0] = i  # Delete all characters from src
    for j in range(m + 1):
        dp[0][j] = j  # Insert all characters into src
    
    # Fill the table using the recurrence relation
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if src[i-1] == target[j-1]:
                cost = 0  # Same character, no substitution needed
            else:
                cost = 1  # Different character, need substitution
            
            dp[i][j] = min(
                dp[i-1][j] + 1,      # Deletion
                dp[i][j-1] + 1,      # Insertion
                dp[i-1][j-1] + cost  # Substitution
            )
    
    # Print the DP table for understanding (optional)
    print("\nDP Table:")
    print("    ", "  ".join([''] + list(target)))
    for i in range(n + 1):
        if i == 0:
            print(f"   {dp[i]}")
        else:
            print(f"{src[i-1]}  {dp[i]}")
    
    return dp[n][m]

# Test the function
print("=" * 50)
print("Testing 'CAT' → 'CAR':")
dist = min_edit_distance("CAT", "CAR")
print(f"\nMinimum Edit Distance: {dist}")

print("\n" + "=" * 50)
print("Testing 'kitten' → 'sitting':")
dist = min_edit_distance("kitten", "sitting")
print(f"\nMinimum Edit Distance: {dist}")

Step 3: Backtrace to Show Operations

def edit_distance_with_operations(src, target):
    """
    Calculate edit distance AND show the sequence of operations
    
    Args:
        src: Source string
        target: Target string
    
    Returns:
        tuple: (distance, list_of_operations)
    """
    n, m = len(src), len(target)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    # Initialize
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j
    
    # Store which operation gave the minimum value
    ops = [[''] * (m + 1) for _ in range(n + 1)]
    
    # Fill the table
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if src[i-1] == target[j-1]:
                dp[i][j] = dp[i-1][j-1]
                ops[i][j] = 'M'  # Match (keep character)
            else:
                # Try all three operations
                delete_cost = dp[i-1][j] + 1
                insert_cost = dp[i][j-1] + 1
                substitute_cost = dp[i-1][j-1] + 1
                
                if delete_cost <= insert_cost and delete_cost <= substitute_cost:
                    dp[i][j] = delete_cost
                    ops[i][j] = 'D'  # Delete
                elif insert_cost <= delete_cost and insert_cost <= substitute_cost:
                    dp[i][j] = insert_cost
                    ops[i][j] = 'I'  # Insert
                else:
                    dp[i][j] = substitute_cost
                    ops[i][j] = 'S'  # Substitute
    
    # Backtrace to find the sequence of operations
    i, j = n, m
    sequence = []
    while i > 0 or j > 0:
        if i > 0 and j > 0 and ops[i][j] == 'M':
            sequence.append(f"Keep '{src[i-1]}'")
            i -= 1
            j -= 1
        elif i > 0 and j > 0 and ops[i][j] == 'S':
            sequence.append(f"Substitute '{src[i-1]}' → '{target[j-1]}'")
            i -= 1
            j -= 1
        elif i > 0 and ops[i][j] == 'D':
            sequence.append(f"Delete '{src[i-1]}'")
            i -= 1
        elif j > 0 and ops[i][j] == 'I':
            sequence.append(f"Insert '{target[j-1]}'")
            j -= 1
    
    return dp[n][m], list(reversed(sequence))

# Test the function
print("=" * 60)
print("Edit Distance with Operations")
print("=" * 60)

src, tgt = "CAT", "CAR"
dist, operations = edit_distance_with_operations(src, tgt)
print(f"\nConverting '{src}' → '{tgt}'")
print(f"Minimum Edit Distance: {dist}")
print("Operations performed:")
for i, op in enumerate(operations, 1):
    print(f"  {i}. {op}")

# Another example
print("\n" + "-" * 60)
src, tgt = "kitten", "sitting"
dist, operations = edit_distance_with_operations(src, tgt)
print(f"\nConverting '{src}' → '{tgt}'")
print(f"Minimum Edit Distance: {dist}")
print("Operations performed:")
for i, op in enumerate(operations, 1):
    print(f"  {i}. {op}")

Step 4: Real Application - Spelling Corrector

def spell_check(word, dictionary):
    """
    Find the closest word in dictionary to the misspelled word
    
    Args:
        word: Misspelled word
        dictionary: List of correctly spelled words
    
    Returns:
        tuple: (best_match, distance)
    """
    best_word = None
    min_dist = float('inf')
    
    for correct_word in dictionary:
        dist = min_edit_distance(correct_word, word)
        if dist < min_dist:
            min_dist = dist
            best_word = correct_word
    
    return best_word, min_dist

# Dictionary of correctly spelled words
dictionary = [
    "apple", "apply", "apricot", "app", "appeal", 
    "application", "appreciate", "approach", "apartment"
]

# Misspelled words to correct
misspelled_words = ["aple", "aply", "apel", "appl", "apricot", "apreciate"]

print("=" * 50)
print("Spell Checker in Action")
print("=" * 50)

for wrong in misspelled_words:
    correct, dist = spell_check(wrong, dictionary)
    print(f"'{wrong}' → '{correct}' (Edit Distance: {dist})")

Step 5: Customizable Costs Version

Different applications may need different costs. For example:

  • In DNA analysis, substitution might cost more than insertion/deletion
  • In typing, insertion might be cheaper than deletion
def min_edit_distance_custom(src, target, insert_cost=1, delete_cost=1, substitute_cost=1):
    """
    Calculate edit distance with custom costs for each operation
    
    Args:
        src: Source string
        target: Target string
        insert_cost: Cost of inserting a character
        delete_cost: Cost of deleting a character
        substitute_cost: Cost of substituting a character
    
    Returns:
        int: Minimum edit distance with custom costs
    """
    n, m = len(src), len(target)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    # Initialize with custom costs
    for i in range(n + 1):
        dp[i][0] = i * delete_cost
    for j in range(m + 1):
        dp[0][j] = j * insert_cost
    
    # Fill table with custom costs
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if src[i-1] == target[j-1]:
                sub_cost = 0
            else:
                sub_cost = substitute_cost
            
            dp[i][j] = min(
                dp[i-1][j] + delete_cost,      # Deletion
                dp[i][j-1] + insert_cost,      # Insertion
                dp[i-1][j-1] + sub_cost        # Substitution
            )
    
    return dp[n][m]

# Test with different costs
print("\n" + "=" * 60)
print("Custom Costs Example")
print("=" * 60)

src, tgt = "START", "STATE"

print(f"\nTransforming '{src}' → '{tgt}'")
print(f"Default costs (insert=1, delete=1, substitute=1): {min_edit_distance_custom(src, tgt)}")
print(f"Expensive substitution (insert=1, delete=1, substitute=5): {min_edit_distance_custom(src, tgt, substitute_cost=5)}")
print(f"Expensive insertion (insert=5, delete=1, substitute=1): {min_edit_distance_custom(src, tgt, insert_cost=5)}")
print(f"Expensive deletion (insert=1, delete=5, substitute=1): {min_edit_distance_custom(src, tgt, delete_cost=5)}")

Step 6: Find Top-K Closest Words

def spell_check_top_k(word, dictionary, k=3):
    """
    Find top K closest words from dictionary
    
    Args:
        word: Misspelled word
        dictionary: List of correctly spelled words
        k: Number of suggestions to return
    
    Returns:
        list: List of (word, distance) tuples
    """
    distances = []
    for correct_word in dictionary:
        dist = min_edit_distance(correct_word, word)
        distances.append((correct_word, dist))
    
    # Sort by distance and return top k
    distances.sort(key=lambda x: x[1])
    return distances[:k]

# Test with larger dictionary
large_dict = [
    "hello", "help", "hero", "hell", "held", "hill", 
    "heal", "heel", "halo", "hole", "whole", "hollow",
    "hello", "hollow", "hall", "hull", "hilly"
]

misspelled = "helo"

print("\n" + "=" * 60)
print("Top-K Closest Words")
print("=" * 60)

print(f"Misspelled word: '{misspelled}'")
print(f"\nTop 5 suggestions:")
suggestions = spell_check_top_k(misspelled, large_dict, k=5)
for i, (word, dist) in enumerate(suggestions, 1):
    print(f"  {i}. '{word}' (distance: {dist})")

Step 7: Visualizing the Alignment

def align_strings(src, target):
    """
    Show the alignment between source and target strings
    
    Args:
        src: Source string
        target: Target string
    
    Returns:
        str: Visual alignment
    """
    n, m = len(src), len(target)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    # Initialize
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j
    
    # Fill table
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if src[i-1] == target[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) + 1
    
    # Backtrace to create alignment
    i, j = n, m
    src_aligned = []
    tgt_aligned = []
    
    while i > 0 or j > 0:
        if i > 0 and j > 0 and src[i-1] == target[j-1]:
            src_aligned.append(src[i-1])
            tgt_aligned.append(target[j-1])
            i -= 1
            j -= 1
        elif i > 0 and j > 0 and dp[i][j] == dp[i-1][j-1] + 1:
            src_aligned.append(src[i-1])
            tgt_aligned.append(target[j-1])
            i -= 1
            j -= 1
        elif i > 0 and dp[i][j] == dp[i-1][j] + 1:
            src_aligned.append(src[i-1])
            tgt_aligned.append('-')
            i -= 1
        else:
            src_aligned.append('-')
            tgt_aligned.append(target[j-1])
            j -= 1
    
    # Reverse to get correct order
    src_aligned = ''.join(reversed(src_aligned))
    tgt_aligned = ''.join(reversed(tgt_aligned))
    
    # Create alignment visualization
    alignment = []
    for s, t in zip(src_aligned, tgt_aligned):
        if s == t:
            alignment.append('|')
        elif s == '-':
            alignment.append(' ')
        elif t == '-':
            alignment.append(' ')
        else:
            alignment.append('X')
    
    result = f"""
Source:   {src_aligned}
Target:   {tgt_aligned}
          {''.join(alignment)}
Legend: | = match, X = substitution, space = gap
"""
    return result

# Test alignment
print("\n" + "=" * 60)
print("String Alignment Visualization")
print("=" * 60)

print(align_strings("CAT", "CAR"))
print(align_strings("kitten", "sitting"))
print(align_strings("algorithm", "logarithm"))

Lab Assignments

Assignment 1: Custom Costs Implementation

Implement edit distance with the following costs:

  • Insertion cost = 2
  • Deletion cost = 1
  • Substitution cost = 3

Test on: "START" → "STATE"

def custom_edit_distance(src, target):
    """
    Implement edit distance with custom costs
    Insertion cost = 2
    Deletion cost = 1
    Substitution cost = 3
    """
    # Your code here
    pass

# Test your function
print(custom_edit_distance("START", "STATE"))
# Expected output: (calculate manually to verify)

Assignment 2: Autocorrect System

Create an autocorrect function that:

  • Takes a misspelled word and a dictionary
  • Returns top 3 closest words with their distances
  • Handles case insensitivity
def autocorrect(misspelled, dictionary):
    """
    Returns top 3 closest words with distances
    
    Args:
        misspelled: The misspelled word
        dictionary: List of correct words
    
    Returns:
        list: List of (word, distance) tuples for top 3 matches
    """
    # Your code here
    pass

# Test with:
dictionary = ["hello", "help", "hero", "hell", "held", "hill", "hollow"]
print(autocorrect("helo", dictionary))
# Expected: [('hello', 1), ('hell', 1), ('help', 2)]

Assignment 3: DNA Sequence Comparison

Two DNA sequences: "AGCTAGCT" and "AGCTGGCT"

Find the edit distance considering:

  • Substitution cost = 2 (different nucleotides)
  • Insertion/Deletion cost = 1
def dna_distance(seq1, seq2):
    """
    Calculate DNA sequence distance with custom costs
    Substitution = 2, Insertion/Deletion = 1
    """
    # Your code here
    pass

# Test
seq1 = "AGCTAGCT"
seq2 = "AGCTGGCT"
print(f"DNA Distance: {dna_distance(seq1, seq2)}")

Assignment 4: Word Similarity Threshold

Write a function that checks if two words are "similar enough" (edit distance <= threshold) and prints the operations needed.

def are_similar(word1, word2, threshold=2):
    """
    Check if two words are similar within threshold
    
    Returns:
        tuple: (bool, distance, operations)
    """
    # Your code here
    pass

# Test cases
print(are_similar("cat", "car", threshold=1))  # Should return (True, 1, [...])
print(are_similar("computer", "compute", threshold=2))  # Should return (True, 1, [...])
print(are_similar("hello", "goodbye", threshold=2))  # Should return (False, 5, [...])