I've read a few blog posts recently that mentioned the concept of Levenshtein distance. It's a measure of the difference between two strings defined as "the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other". The definition is very straightforward but when I thought about calculating it I saw no immediately obvious way. Rather than looking it up, I decided to discover an algorithm on my own, with nothing but the definition to start from.
After a period of requisite head-scratching and ad-hoc attempts I identified two trivial corner cases: identical words (d=0) and words with no letters in common (d=max(first.length, second.length), let's call them orthogonal). Then came the crucial realization: any pair of words can be chopped up into a sequence of identical and orthogonal sub-words:
darted
stare
The total distance is then the sum of the distances of the orthogonal parts. Note that an orthogonal pair may consist of one empty and one non-empty string as well, such as "t" vs. "" in the example above.
Trouble is, there may be more than one way to slice the words:
barbara
abracadabra
barbara
abracadabra
barbara
abracadabra
and so on. The distances corresponding to these splits are 10, 11 and 8, respectively. The actual minimal distance is 6; discovering the correct split (or splits) is left as an exercise for the reader. The way my algorithm goes about it is a straightforward exhaustive trawling of the solution space. In pseudo-Python:
def distance(left, right):
result = max(left.length, right.length)
for each index x in left:
letter = left[x]
for each location y of letter in right:
subLeft = left[x:]
subRight = right[y:]
beforeMatch = max(x, y)
match = length of the identical prefix of subLeft and subRight
afterMatch = distance(subLeft[match:], subRight[match:])
newDistance = beforeMatch + afterMatch
result = min(result, newDistance)
return result
As you can see, it's a recursive function not amenable to tail-call optimization so it's prone to overflowing the stack, among other things. There's a ton of potential for performance improvement. One thing I actually have in my implementation is that when beforeMatch >= result I don't go into recursion as it can't possibly produce a lower newDistance. This mini-optimization is omitted from the pseudo-code for clarity.
Other than that, proper ordering seems to be the key. The algorithm is asymmetric in that it "starts from" the left word and tries to "reach" the right one. Should it always start from the shorter word or from the longer one? Or from the word with fewer unique letters or more unique letters? Should the letters with most locations in the starting word be tried first? Or the ones closest to its beginning?
A proper complexity analysis (or a robust set of test pairs with known Levenshtein distances) would answer those questions, increasing the likelihood of encountering good matches early, cutting off bigger branches of the search tree. Alas, I have no time for such work, no matter how much fun it would be and how much I'd learn from it. I've solved the puzzle itself and the optimization paths are all well trodden by more capable explorers, I'm sure. Also, given that my solution completely ignores the distribution of letters in the "target" word, there's bound to be a fundamentally better, more symmetric approach. I'm looking forward to reading up on it :-)
My original implementation is in JavaScript with a web page for easy testing of correctness. I later re-wrote it as command-line scripts in node.js, Python and Go in order to compare performance. Surprisingly, Go seems to be only about 33 % faster than both node.js and Python. Mind you, I don't know any of those languages intimately enough to be sure I didn't screw something up performance-wise so the comparison is very rough and not serious in any way. Tasting the different langages' flavors was great fun, though, and I'm itching for C and Drools implementations if I find the time. A functional variant in Scala or Clojure would also be nice but swapping between the imperative and functional mind-sets is really time-consuming.