StringProblem 25 of 43
Boyer Moore Algorithm for Pattern Searching
Problem Statement
Given a text string and a pattern string, find all occurrences of the pattern in the text using the Boyer-Moore algorithm. This algorithm is one of the most efficient string matching algorithms for practical use cases.
Example:
- Input: text = "ABAAABCDABC", pattern = "ABC"
- Output: Pattern found at indices [4, 8]
- Explanation: "ABC" appears at positions 4 and 8 in the text
Approach 1: Brute Force (Naive Pattern Matching)
Intuition
Slide the pattern over the text one character at a time and check for a match at each position.
Algorithm
- For each position i from 0 to n-m in text
- Check if pattern matches starting at position i
- If all characters match, record the position
- Move to next position
java
import java.util.*;
public class Solution {
public List<Integer> naiveSearch(String text, String pattern) {
List<Integer> result = new ArrayList<>();
int n = text.length();
int m = pattern.length();
for (int i = 0; i <= n - m; i++) {
int j;
for (j = 0; j < m; j++) {
if (text.charAt(i + j) != pattern.charAt(j)) {
break;
}
}
if (j == m) {
result.add(i);
}
}
return result;
}
}Complexity Analysis
- Time Complexity: O(n * m) - For each position, we may compare up to m characters
- Space Complexity: O(1) - No extra space needed
Approach 2: Boyer-Moore Algorithm (Optimal)
Intuition
Boyer-Moore works by comparing the pattern from right to left and using two heuristics to skip alignments:
- Bad Character Rule: When a mismatch occurs, shift the pattern to align the bad character in text with its rightmost occurrence in the pattern
- Good Suffix Rule: When a mismatch occurs, use information about the matched suffix to determine the shift
Algorithm
- Preprocess pattern for bad character heuristic
- Preprocess pattern for good suffix heuristic
- Align pattern at the beginning of text
- Compare from right to left
- On mismatch, shift by max of both heuristics
- Repeat until pattern goes past text end
java
import java.util.*;
public class BoyerMoore {
private static final int ALPHABET_SIZE = 256;
private int[] preprocessBadCharacter(String pattern) {
int m = pattern.length();
int[] badChar = new int[ALPHABET_SIZE];
Arrays.fill(badChar, -1);
for (int i = 0; i < m; i++) {
badChar[pattern.charAt(i)] = i;
}
return badChar;
}
private int[] preprocessGoodSuffix(String pattern) {
int m = pattern.length();
int[] shift = new int[m + 1];
int[] borderPos = new int[m + 1];
int i = m, j = m + 1;
borderPos[i] = j;
while (i > 0) {
while (j <= m && pattern.charAt(i - 1) != pattern.charAt(j - 1)) {
if (shift[j] == 0) {
shift[j] = j - i;
}
j = borderPos[j];
}
i--;
j--;
borderPos[i] = j;
}
// Case 2: Prefix matches suffix
j = borderPos[0];
for (i = 0; i <= m; i++) {
if (shift[i] == 0) {
shift[i] = j;
}
if (i == j) {
j = borderPos[j];
}
}
return shift;
}
public List<Integer> search(String text, String pattern) {
List<Integer> result = new ArrayList<>();
int n = text.length();
int m = pattern.length();
if (m == 0 || n < m) return result;
int[] badChar = preprocessBadCharacter(pattern);
int[] shift = preprocessGoodSuffix(pattern);
int s = 0;
while (s <= n - m) {
int j = m - 1;
while (j >= 0 && pattern.charAt(j) == text.charAt(s + j)) {
j--;
}
if (j < 0) {
result.add(s);
s += shift[0];
} else {
int badCharShift = j - badChar[text.charAt(s + j)];
int goodSuffixShift = shift[j + 1];
s += Math.max(badCharShift, goodSuffixShift);
}
}
return result;
}
}Complexity Analysis
- Time Complexity:
- Average case: O(n/m) - Sublinear due to skipping
- Worst case: O(n * m) - When pattern has many repeated characters
- Space Complexity: O(m + σ) - Where σ is alphabet size (256 for ASCII)
Boyer-Moore: Why It Works
Bad Character Rule
When we find a mismatch at position j in pattern:
- If the mismatched character doesn't exist in pattern, we can shift past it entirely
- If it exists, we align it with its rightmost occurrence
Text: ...XABCDEF...
Pattern: ABXABC
^mismatch at 'D' vs 'B'
'D' not in pattern, shift entire pattern past 'D'
Good Suffix Rule
When a suffix of the pattern has matched but there's a mismatch:
- Look for another occurrence of the matched suffix in the pattern
- Align that occurrence with the matched text
Text: ...CABAB...
Pattern: BABAB
^^matched "AB"
Shift to align another "AB" in pattern
Key Takeaways
- Boyer-Moore is often the fastest string matching algorithm in practice
- Right-to-left comparison allows larger shifts on mismatch
- Two heuristics complement each other: Bad character for random text, good suffix for repetitive patterns
- Preprocessing is O(m + σ) but pays off for long texts
- Used in text editors, search tools (grep), and databases
- For single pattern search in large texts, Boyer-Moore often outperforms KMP