I continued my exploration of the Levenshtein distance by writing an implementation in the Drools Rule Language a.k.a. DRL. It turned out to be very interesting because the properties of DRL forced me to formulate the problem in a completely different way.
This isn't the right place to delve into rule engines but I'll try to explain the basics. A DRL program is executed in a working memory containing facts and rules. A rule is a simple statement of the form
when
the working memory is in such and such state
then
perform these actions
(potentially affecting the working memory)
The when part contains a fact pattern. When a fact is inserted into the working memory all rules whose when parts match the new fact get their then parts executed. This may cause other facts to appear in the working memory, triggering a cascade of rule firing. Well-written rules adhere to two important principles:
- The then part should contain no conditional logic (as Drools creator Mark Proctor says, it should be "if this then that", not "if this then maybe that"). All decision-making should be expressed in the when sections.
- The rules should have the same semantics regardless of their execution order (e.g. when a new fact matches two rules it shouldn't matter which fires first).
As you can see, a rule author gives up a lot of control over the program flow. The idea is to specify what should happen and let the rule engine figure out how to do it. The way it looks in practice is that you decompose your input into very small parts that are straightforward to reason about. From that you can formulate rules that let the engine construct the desired result.
My original Levenshtein distance algorithm used the concepts of identical and orthogonal sub-words. Those are not really suitable for a rule engine because their discovery is in itself quite complex. I replaced them with the idea of character locations. A character location is a simple object that says "there is an 'a' at offset 2 in the word 'banana'". Converting a word into character locations is trivial and I can then throw them into the working memory as new facts (the examples use pseudo-code rather than actual DRL syntax):
when
word : String
then
for offset from 1 to word.length
insert CharacterLocation(word, offset)
The rule will be triggered for each of the words as they are inserted into the working memory. Armed with a bunch of CharacterLocations, I can identify character matches:
when
location1, location2 : CharacterLocation
location1.character == location2.character
location1.word != location2.word
then
insert Match(location1, location2)
This rule, in turn, will be triggered for each suitable pair of CharacterLocations, generating all possible Matches:
For the Levenshtein distance I need a combination of Matches that covers as much of the two words as possible. Not every combination makes sense:
so I'm actually looking for sequences of strictly ordered Matches, such as this one:
Generating valid sequences takes a bit of induction. I first create "seeds" - sequences containing just two Matches:
when
x, y : Match
x < y
not exists Sequence s (s.contains(x, y))
then
insert Sequence(x, y)
I proceed to grow each Sequence from the middle, using the visited
set to avoid creating the same one twice:
when
x, y, candidate : Match
s : Sequence
x < candidate < y
!s.visited.contains(candidate)
!s.contains(candidate)
s.contains(x, y)
s.indexOf(y) == s.indexOf(x) + 1
then
insert s.clone(visited += candidate)
s.insert(candidate between x and y)
The distance corresponding to a Sequence is determined by the gaps it leaves open:
so once all valid Sequences have been generated, I simply pick the best one:
when
there are no more other rules to run
s : set of all Sequence instances
then
print "Found distance %s" % min(s.map(_.distance))
And that's it. From a complexity point of view, the algorithm is quite a pig. It explores the entire solution space and doesn't even use the best-known result for pruning. It isn't even easily parallelizable, with all the each-on-each semantics going on. It does, however, stick to the rule-based declarative approach so performance is the rule engine's problem ;-)