The Levenshtein rules

I continued my exploration of the Levenshtein distance by writing an implementation in the Drools Rule Language a.k.a. DRL. It turned out to be very interesting because the properties of DRL forced me to formulate the problem in a completely different way.

This isn't the right place to delve into rule engines but I'll try to explain the basics. A DRL program is executed in a working memory containing facts and rules. A rule is a simple statement of the form

when
    the working memory is in such and such state
then
    perform these actions
    (potentially affecting the working memory)

The when part contains a fact pattern. When a fact is inserted into the working memory all rules whose when parts match the new fact get their then parts executed. This may cause other facts to appear in the working memory, triggering a cascade of rule firing. Well-written rules adhere to two important principles:

  1. The then part should contain no conditional logic (as Drools creator Mark Proctor says, it should be "if this then that", not "if this then maybe that"). All decision-making should be expressed in the when sections.
  2. The rules should have the same semantics regardless of their execution order (e.g. when a new fact matches two rules it shouldn't matter which fires first).

As you can see, a rule author gives up a lot of control over the program flow. The idea is to specify what should happen and let the rule engine figure out how to do it. The way it looks in practice is that you decompose your input into very small parts that are straightforward to reason about. From that you can formulate rules that let the engine construct the desired result.

My original Levenshtein distance algorithm used the concepts of identical and orthogonal sub-words. Those are not really suitable for a rule engine because their discovery is in itself quite complex. I replaced them with the idea of character locations. A character location is a simple object that says "there is an 'a' at offset 2 in the word 'banana'". Converting a word into character locations is trivial and I can then throw them into the working memory as new facts (the examples use pseudo-code rather than actual DRL syntax):

when
    word : String
then
    for offset from 1 to word.length
    insert CharacterLocation(word, offset)

The rule will be triggered for each of the words as they are inserted into the working memory. Armed with a bunch of CharacterLocations, I can identify character matches:

when
    location1, location2 : CharacterLocation
    location1.character == location2.character
    location1.word != location2.word
then
    insert Match(location1, location2)

This rule, in turn, will be triggered for each suitable pair of CharacterLocations, generating all possible Matches:

invalid match set

For the Levenshtein distance I need a combination of Matches that covers as much of the two words as possible. Not every combination makes sense:

invalid match set

so I'm actually looking for sequences of strictly ordered Matches, such as this one:

valid match set

Generating valid sequences takes a bit of induction. I first create "seeds" - sequences containing just two Matches:

when
    x, y : Match
    x < y
    not exists Sequence s (s.contains(x, y))
then
    insert Sequence(x, y)

I proceed to grow each Sequence from the middle, using the visited set to avoid creating the same one twice:

when
    x, y, candidate : Match
    s : Sequence
    x < candidate < y
    !s.visited.contains(candidate)
    !s.contains(candidate)
    s.contains(x, y)
    s.indexOf(y) == s.indexOf(x) + 1
then
    insert s.clone(visited += candidate)
    s.insert(candidate between x and y)

The distance corresponding to a Sequence is determined by the gaps it leaves open:

sequence with gaps

so once all valid Sequences have been generated, I simply pick the best one:

when
    there are no more other rules to run
    s : set of all Sequence instances
then
    print "Found distance %s" % min(s.map(_.distance))

And that's it. From a complexity point of view, the algorithm is quite a pig. It explores the entire solution space and doesn't even use the best-known result for pruning. It isn't even easily parallelizable, with all the each-on-each semantics going on. It does, however, stick to the rule-based declarative approach so performance is the rule engine's problem ;-)

Proudly powered by Pelican, which takes great advantage of Python.