StringProblem 25 of 43

Boyer Moore Algorithm for Pattern Searching

Brute Force
Time: O(n * m)
Space: O(1)
Optimal
Time: O(n / m) average, O(n * m) worst
Space: O(m + σ)

Problem Statement

Given a text string and a pattern string, find all occurrences of the pattern in the text using the Boyer-Moore algorithm. This algorithm is one of the most efficient string matching algorithms for practical use cases.

Example:

  • Input: text = "ABAAABCDABC", pattern = "ABC"
  • Output: Pattern found at indices [4, 8]
  • Explanation: "ABC" appears at positions 4 and 8 in the text

Approach 1: Brute Force (Naive Pattern Matching)

Intuition

Slide the pattern over the text one character at a time and check for a match at each position.

Algorithm

  1. For each position i from 0 to n-m in text
  2. Check if pattern matches starting at position i
  3. If all characters match, record the position
  4. Move to next position
java
import java.util.*;

public class Solution {
    public List<Integer> naiveSearch(String text, String pattern) {
        List<Integer> result = new ArrayList<>();
        int n = text.length();
        int m = pattern.length();
        
        for (int i = 0; i <= n - m; i++) {
            int j;
            for (j = 0; j < m; j++) {
                if (text.charAt(i + j) != pattern.charAt(j)) {
                    break;
                }
            }
            if (j == m) {
                result.add(i);
            }
        }
        
        return result;
    }
}

Complexity Analysis

  • Time Complexity: O(n * m) - For each position, we may compare up to m characters
  • Space Complexity: O(1) - No extra space needed

Approach 2: Boyer-Moore Algorithm (Optimal)

Intuition

Boyer-Moore works by comparing the pattern from right to left and using two heuristics to skip alignments:

  1. Bad Character Rule: When a mismatch occurs, shift the pattern to align the bad character in text with its rightmost occurrence in the pattern
  2. Good Suffix Rule: When a mismatch occurs, use information about the matched suffix to determine the shift

Algorithm

  1. Preprocess pattern for bad character heuristic
  2. Preprocess pattern for good suffix heuristic
  3. Align pattern at the beginning of text
  4. Compare from right to left
  5. On mismatch, shift by max of both heuristics
  6. Repeat until pattern goes past text end
java
import java.util.*;

public class BoyerMoore {
    private static final int ALPHABET_SIZE = 256;
    
    private int[] preprocessBadCharacter(String pattern) {
        int m = pattern.length();
        int[] badChar = new int[ALPHABET_SIZE];
        
        Arrays.fill(badChar, -1);
        
        for (int i = 0; i < m; i++) {
            badChar[pattern.charAt(i)] = i;
        }
        
        return badChar;
    }
    
    private int[] preprocessGoodSuffix(String pattern) {
        int m = pattern.length();
        int[] shift = new int[m + 1];
        int[] borderPos = new int[m + 1];
        
        int i = m, j = m + 1;
        borderPos[i] = j;
        
        while (i > 0) {
            while (j <= m && pattern.charAt(i - 1) != pattern.charAt(j - 1)) {
                if (shift[j] == 0) {
                    shift[j] = j - i;
                }
                j = borderPos[j];
            }
            i--;
            j--;
            borderPos[i] = j;
        }
        
        // Case 2: Prefix matches suffix
        j = borderPos[0];
        for (i = 0; i <= m; i++) {
            if (shift[i] == 0) {
                shift[i] = j;
            }
            if (i == j) {
                j = borderPos[j];
            }
        }
        
        return shift;
    }
    
    public List<Integer> search(String text, String pattern) {
        List<Integer> result = new ArrayList<>();
        int n = text.length();
        int m = pattern.length();
        
        if (m == 0 || n < m) return result;
        
        int[] badChar = preprocessBadCharacter(pattern);
        int[] shift = preprocessGoodSuffix(pattern);
        
        int s = 0;
        
        while (s <= n - m) {
            int j = m - 1;
            
            while (j >= 0 && pattern.charAt(j) == text.charAt(s + j)) {
                j--;
            }
            
            if (j < 0) {
                result.add(s);
                s += shift[0];
            } else {
                int badCharShift = j - badChar[text.charAt(s + j)];
                int goodSuffixShift = shift[j + 1];
                s += Math.max(badCharShift, goodSuffixShift);
            }
        }
        
        return result;
    }
}

Complexity Analysis

  • Time Complexity:
    • Average case: O(n/m) - Sublinear due to skipping
    • Worst case: O(n * m) - When pattern has many repeated characters
  • Space Complexity: O(m + σ) - Where σ is alphabet size (256 for ASCII)

Boyer-Moore: Why It Works

Bad Character Rule

When we find a mismatch at position j in pattern:

  • If the mismatched character doesn't exist in pattern, we can shift past it entirely
  • If it exists, we align it with its rightmost occurrence
Text: ...XABCDEF... Pattern: ABXABC ^mismatch at 'D' vs 'B' 'D' not in pattern, shift entire pattern past 'D'

Good Suffix Rule

When a suffix of the pattern has matched but there's a mismatch:

  • Look for another occurrence of the matched suffix in the pattern
  • Align that occurrence with the matched text
Text: ...CABAB... Pattern: BABAB ^^matched "AB" Shift to align another "AB" in pattern

Key Takeaways

  1. Boyer-Moore is often the fastest string matching algorithm in practice
  2. Right-to-left comparison allows larger shifts on mismatch
  3. Two heuristics complement each other: Bad character for random text, good suffix for repetitive patterns
  4. Preprocessing is O(m + σ) but pays off for long texts
  5. Used in text editors, search tools (grep), and databases
  6. For single pattern search in large texts, Boyer-Moore often outperforms KMP