Articles tagged with "algorithms"

The Levenshtein rules

I continued my exploration of the Levenshtein distance by writing an implementation in the Drools Rule Language a.k.a. DRL. It turned out to be very interesting because the properties of DRL forced me to formulate the problem in a completely different way.

This isn't the right place to delve into rule engines but I'll try to explain the basics. A DRL program is executed in a working memory containing facts and rules. A rule is a simple statement of the form

when
    the working memory is in such and such state
then
    perform these actions
    (potentially affecting the working memory)

The when part contains a fact pattern. When a fact is inserted into the working memory all rules whose when parts match the new fact get their then parts executed. This may cause other facts to appear in the working memory, triggering a cascade of rule firing. Well-written rules adhere to two important principles:

  1. The then part should contain no conditional logic (as Drools creator Mark Proctor says, it should be "if this then that", not "if this then maybe that"). All decision-making should be expressed in the when sections.
  2. The rules should have the same semantics regardless of their execution order (e.g. when a new fact matches two rules it shouldn't matter which fires first).

As you can see, a rule author gives up a lot of control over the program flow. The idea is to specify what should happen and let the rule engine figure out how to do it. The way it looks in practice is that you decompose your input into very small parts that are straightforward to reason about. From that you can formulate rules that let the engine construct the desired result.

My original Levenshtein distance algorithm used the concepts of identical and orthogonal sub-words. Those are not really suitable for a rule engine because their discovery is in itself quite complex. I replaced them with the idea of character locations. A character location is a simple object that says "there is an 'a' at offset 2 in the word 'banana'". Converting a word into character locations is trivial and I can then throw them into the working memory as new facts (the examples use pseudo-code rather than actual DRL syntax):

when
    word : String
then
    for offset from 1 to word.length
    insert CharacterLocation(word, offset)

The rule will be triggered for each of the words as they are inserted into the working memory. Armed with a bunch of CharacterLocations, I can identify character matches:

when
    location1, location2 : CharacterLocation
    location1.character == location2.character
    location1.word != location2.word
then
    insert Match(location1, location2)

This rule, in turn, will be triggered for each suitable pair of CharacterLocations, generating all possible Matches:

invalid match set

For the Levenshtein distance I need a combination of Matches that covers as much of the two words as possible. Not every combination makes sense:

invalid match set

so I'm actually looking for sequences of strictly ordered Matches, such as this one:

valid match set

Generating valid sequences takes a bit of induction. I first create "seeds" - sequences containing just two Matches:

when
    x, y : Match
    x < y
    not exists Sequence s (s.contains(x, y))
then
    insert Sequence(x, y)

I proceed to grow each Sequence from the middle, using the visited set to avoid creating the same one twice:

when
    x, y, candidate : Match
    s : Sequence
    x < candidate < y
    !s.visited.contains(candidate)
    !s.contains(candidate)
    s.contains(x, y)
    s.indexOf(y) == s.indexOf(x) + 1
then
    insert s.clone(visited += candidate)
    s.insert(candidate between x and y)

The distance corresponding to a Sequence is determined by the gaps it leaves open:

sequence with gaps

so once all valid Sequences have been generated, I simply pick the best one:

when
    there are no more other rules to run
    s : set of all Sequence instances
then
    print "Found distance %s" % min(s.map(_.distance))

And that's it. From a complexity point of view, the algorithm is quite a pig. It explores the entire solution space and doesn't even use the best-known result for pruning. It isn't even easily parallelizable, with all the each-on-each semantics going on. It does, however, stick to the rule-based declarative approach so performance is the rule engine's problem ;-)


The Levenshtein puzzle

I've read a few blog posts recently that mentioned the concept of Levenshtein distance. It's a measure of the difference between two strings defined as "the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other". The definition is very straightforward but when I thought about calculating it I saw no immediately obvious way. Rather than looking it up, I decided to discover an algorithm on my own, with nothing but the definition to start from.

After a period of requisite head-scratching and ad-hoc attempts I identified two trivial corner cases: identical words (d=0) and words with no letters in common (d=max(first.length, second.length), let's call them orthogonal). Then came the crucial realization: any pair of words can be chopped up into a sequence of identical and orthogonal sub-words:

darted
stare

The total distance is then the sum of the distances of the orthogonal parts. Note that an orthogonal pair may consist of one empty and one non-empty string as well, such as "t" vs. "" in the example above.

Trouble is, there may be more than one way to slice the words:

barbara
abracadabra

barbara
abracadabra

barbara
abracadabra

and so on. The distances corresponding to these splits are 10, 11 and 8, respectively. The actual minimal distance is 6; discovering the correct split (or splits) is left as an exercise for the reader. The way my algorithm goes about it is a straightforward exhaustive trawling of the solution space. In pseudo-Python:

def distance(left, right):
  result = max(left.length, right.length)
  for each index x in left:
     letter = left[x]
     for each location y of letter in right:
       subLeft  = left[x:]
       subRight = right[y:]
       beforeMatch = max(x, y)
       match = length of the identical prefix of subLeft and subRight
       afterMatch = distance(subLeft[match:], subRight[match:])
       newDistance = beforeMatch + afterMatch
       result = min(result, newDistance)
  return result

As you can see, it's a recursive function not amenable to tail-call optimization so it's prone to overflowing the stack, among other things. There's a ton of potential for performance improvement. One thing I actually have in my implementation is that when beforeMatch >= result I don't go into recursion as it can't possibly produce a lower newDistance. This mini-optimization is omitted from the pseudo-code for clarity.

Other than that, proper ordering seems to be the key. The algorithm is asymmetric in that it "starts from" the left word and tries to "reach" the right one. Should it always start from the shorter word or from the longer one? Or from the word with fewer unique letters or more unique letters? Should the letters with most locations in the starting word be tried first? Or the ones closest to its beginning?

A proper complexity analysis (or a robust set of test pairs with known Levenshtein distances) would answer those questions, increasing the likelihood of encountering good matches early, cutting off bigger branches of the search tree. Alas, I have no time for such work, no matter how much fun it would be and how much I'd learn from it. I've solved the puzzle itself and the optimization paths are all well trodden by more capable explorers, I'm sure. Also, given that my solution completely ignores the distribution of letters in the "target" word, there's bound to be a fundamentally better, more symmetric approach. I'm looking forward to reading up on it :-)

My original implementation is in JavaScript with a web page for easy testing of correctness. I later re-wrote it as command-line scripts in node.js, Python and Go in order to compare performance. Surprisingly, Go seems to be only about 33 % faster than both node.js and Python. Mind you, I don't know any of those languages intimately enough to be sure I didn't screw something up performance-wise so the comparison is very rough and not serious in any way. Tasting the different langages' flavors was great fun, though, and I'm itching for C and Drools implementations if I find the time. A functional variant in Scala or Clojure would also be nice but swapping between the imperative and functional mind-sets is really time-consuming.


« Page 1 / 1 »
Proudly powered by Pelican, which takes great advantage of Python.