String searching

To keep my algorithmic thinking skills from degrading I’ve decided to implement simple algorithms in C. I decided to start really easy and decided to solve the classic string search problem. The number of subtleties surprised me.

The problem can be casually stated as “Give a string, find the first occurrence within a text.”

My naive solution was actually incorrect. I read the input character by character and checked if the characters of the pattern string matched one at a time. If they did not, I simply restarted the procedure at the character which failed the match.

Finding a counter-example to the above incorrect algorithm is not immediately obvious if you think about English words. Given an arbitrary text and search pattern, however one quickly finds that searching for AAB in the text AAAB fails since after failing on the 3rd ‘A’ it starts at the 3rd ‘A’ and fails to back-track. It is an interesting question however to try to figure out which conditions on the pattern string need to be imposed for the algorithm to be correct. In fact, one is likely to stumble upon a much faster algorithm than the correct naive approach, the KMP algorithm.