meandering journey

The Levenshtein rules

I continued my exploration of the Levenshtein distance by writing an implementation in the Drools Rule Language a.k.a. DRL. It turned out to be very interesting because the properties of DRL forced me to formulate the problem in a completely different way.

This isn't the right place to delve into rule engines but I'll try to explain the basics. A DRL program is executed in a working memory containing facts and rules. A rule is a simple statement of the form

when
    the working memory is in such and such state
then
    perform these actions
    (potentially affecting the working memory)

The when part contains a fact pattern. When a fact is inserted into the working memory all rules whose when parts match the new fact get their then parts executed. This may cause other facts to appear in the working memory, triggering a cascade of rule firing. Well-written rules adhere to two important principles:

The then part should contain no conditional logic (as Drools creator Mark Proctor says, it should be "if this then that", not "if this then maybe that"). All decision-making should be expressed in the when sections.
The rules should have the same semantics regardless of their execution order (e.g. when a new fact matches two rules it shouldn't matter which fires first).

As you can see, a rule author gives up a lot of control over the program flow. The idea is to specify what should happen and let the rule engine figure out how to do it. The way it looks in practice is that you decompose your input into very small parts that are straightforward to reason about. From that you can formulate rules that let the engine construct the desired result.

My original Levenshtein distance algorithm used the concepts of identical and orthogonal sub-words. Those are not really suitable for a rule engine because their discovery is in itself quite complex. I replaced them with the idea of character locations. A character location is a simple object that says "there is an 'a' at offset 2 in the word 'banana'". Converting a word into character locations is trivial and I can then throw them into the working memory as new facts (the examples use pseudo-code rather than actual DRL syntax):

when
    word : String
then
    for offset from 1 to word.length
    insert CharacterLocation(word, offset)

The rule will be triggered for each of the words as they are inserted into the working memory. Armed with a bunch of CharacterLocations, I can identify character matches:

when
    location1, location2 : CharacterLocation
    location1.character == location2.character
    location1.word != location2.word
then
    insert Match(location1, location2)

This rule, in turn, will be triggered for each suitable pair of CharacterLocations, generating all possible Matches:

invalid match set

For the Levenshtein distance I need a combination of Matches that covers as much of the two words as possible. Not every combination makes sense:

invalid match set

so I'm actually looking for sequences of strictly ordered Matches, such as this one:

valid match set

Generating valid sequences takes a bit of induction. I first create "seeds" - sequences containing just two Matches:

when
    x, y : Match
    x < y
    not exists Sequence s (s.contains(x, y))
then
    insert Sequence(x, y)

I proceed to grow each Sequence from the middle, using the visited set to avoid creating the same one twice:

when
    x, y, candidate : Match
    s : Sequence
    x < candidate < y
    !s.visited.contains(candidate)
    !s.contains(candidate)
    s.contains(x, y)
    s.indexOf(y) == s.indexOf(x) + 1
then
    insert s.clone(visited += candidate)
    s.insert(candidate between x and y)

The distance corresponding to a Sequence is determined by the gaps it leaves open:

sequence with gaps

so once all valid Sequences have been generated, I simply pick the best one:

when
    there are no more other rules to run
    s : set of all Sequence instances
then
    print "Found distance %s" % min(s.map(_.distance))

And that's it. From a complexity point of view, the algorithm is quite a pig. It explores the entire solution space and doesn't even use the best-known result for pruning. It isn't even easily parallelizable, with all the each-on-each semantics going on. It does, however, stick to the rule-based declarative approach so performance is the rule engine's problem ;-)

The Levenshtein puzzle

I've read a few blog posts recently that mentioned the concept of Levenshtein distance. It's a measure of the difference between two strings defined as "the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other". The definition is very straightforward but when I thought about calculating it I saw no immediately obvious way. Rather than looking it up, I decided to discover an algorithm on my own, with nothing but the definition to start from.

After a period of requisite head-scratching and ad-hoc attempts I identified two trivial corner cases: identical words (d=0) and words with no letters in common (d=max(first.length, second.length), let's call them orthogonal). Then came the crucial realization: any pair of words can be chopped up into a sequence of identical and orthogonal sub-words:

darted
stare

The total distance is then the sum of the distances of the orthogonal parts. Note that an orthogonal pair may consist of one empty and one non-empty string as well, such as "t" vs. "" in the example above.

Trouble is, there may be more than one way to slice the words:

barbara
abracadabra

and so on. The distances corresponding to these splits are 10, 11 and 8, respectively. The actual minimal distance is 6; discovering the correct split (or splits) is left as an exercise for the reader. The way my algorithm goes about it is a straightforward exhaustive trawling of the solution space. In pseudo-Python:

def distance(left, right):
  result = max(left.length, right.length)
  for each index x in left:
     letter = left[x]
     for each location y of letter in right:
       subLeft  = left[x:]
       subRight = right[y:]
       beforeMatch = max(x, y)
       match = length of the identical prefix of subLeft and subRight
       afterMatch = distance(subLeft[match:], subRight[match:])
       newDistance = beforeMatch + afterMatch
       result = min(result, newDistance)
  return result

As you can see, it's a recursive function not amenable to tail-call optimization so it's prone to overflowing the stack, among other things. There's a ton of potential for performance improvement. One thing I actually have in my implementation is that when beforeMatch >= result I don't go into recursion as it can't possibly produce a lower newDistance. This mini-optimization is omitted from the pseudo-code for clarity.

Other than that, proper ordering seems to be the key. The algorithm is asymmetric in that it "starts from" the left word and tries to "reach" the right one. Should it always start from the shorter word or from the longer one? Or from the word with fewer unique letters or more unique letters? Should the letters with most locations in the starting word be tried first? Or the ones closest to its beginning?

A proper complexity analysis (or a robust set of test pairs with known Levenshtein distances) would answer those questions, increasing the likelihood of encountering good matches early, cutting off bigger branches of the search tree. Alas, I have no time for such work, no matter how much fun it would be and how much I'd learn from it. I've solved the puzzle itself and the optimization paths are all well trodden by more capable explorers, I'm sure. Also, given that my solution completely ignores the distribution of letters in the "target" word, there's bound to be a fundamentally better, more symmetric approach. I'm looking forward to reading up on it :-)

My original implementation is in JavaScript with a web page for easy testing of correctness. I later re-wrote it as command-line scripts in node.js, Python and Go in order to compare performance. Surprisingly, Go seems to be only about 33 % faster than both node.js and Python. Mind you, I don't know any of those languages intimately enough to be sure I didn't screw something up performance-wise so the comparison is very rough and not serious in any way. Tasting the different langages' flavors was great fun, though, and I'm itching for C and Drools implementations if I find the time. A functional variant in Scala or Clojure would also be nice but swapping between the imperative and functional mind-sets is really time-consuming.

Software as knowledge repository

Summary Software is the most relevant representation of knowledge about a business. It is crucial that this knowledge be accessible to decision-makers. When managers resent refactoring as "not adding business value", they overlook the value of software as a knowledge repository.

Software is the most relevant representation of knowledge about a business. Although it only implements automated workflows, those form an ever-growing part of business activities. Source code is the only 100% accurate and up-to-date documentation of such workflows and often reveals a lot about manual processes as well.

More importantly, software is executable knowledge. It doesn't simply describe automated processes - software becomes the process once running in a production environment.

The problem of access

It follows that a decision-maker trying to truly master the processes in a business needs fast, flexible and efficient access to software-bound knowledge. What's more, she needs read-write access in order to augment, extend and deepen the knowledge. It's a non-trivial requirement:

In software, business-specific knowledge is intertwined with lots of other concerns (infrastructure, ergonomics, security, governance, ...).
Source code addresses several audiences at once (end user, other programmers, compiler).
Software tends to be heterogeneous even in modestly-sized businesses (different authors, different programming languages and platforms, ...).
Software has to be precise enough for a computer to execute (i.e. extremely precise by human measure).

The way software is usually managed presents additional difficulties. Business knowledge gets "baked in" at various stages of a complicated dance involving decision-makers, analysts, architects, programmers, testers - oh, and hopefully also end users. The transformation progresses via specification documents, UML diagrams, Scrum stories, bug reports or what have you. Given sufficient skill and will on all sides, there is reasonable certainty that the software does what decision-makers intend it to do. The process is neither flexible, fast nor efficient, however. Improving this situation surely means competitive advantage.

Distilling knowledge

Many approaches have been tried to bring stakeholders closer to software (and many more will be tried as software grows more important). They often start by isolating business-specific parts of a program (a.k.a. "business logic") from the rest. The idea is sensible enough yet fraught with pitfalls:

The boundary between generic and business-specific code is often blurred.
Business-specific algorithms can exist at many levels of abstraction.
Local peculiarities creep into unforeseen places (masked as validation criteria etc.).
Isolating parameters is much easier than isolating an algorithm.

The last point is especially relevant. Many businesses end up with an agreed set of easy-to-tweak configuration settings, complemented by a change-request process that supposedly balances flexibility with maintainability. Such a set-up makes it all too easy to make the wrong trade-offs under schedule pressure. Another possible outcome is programming by configuration, i.e. spiking a program with so many parameters as to make both the code and the configuration extremely complex (this is the shadowy domain of SAP consultants).

Matters of presentation

When it comes to giving decision-makers useful access, extracting algorithms as well as parameters is clearly essential but won't suffice unless the CEO is fluent in the language in which the code is written. Making business logic palatable to non-technical folks is therefore a major area of activity featuring its own array of methods and tools, with Business Process Management (BPM) apparently in vogue these days, complemented by Business Rule Mangement (BRM).

Such tools are definitely useful but they only go part of the way. Even with declarative approaches such as BRM, understanding software requires algorithmic thinking. It's a hurdle no programming language nor diagramming technique will overcome. Algorithmic thinking is a skill and can be learned, however. The recent debate regarding whether everyone should learn to code is, at its core, about letting people comprehend the world around them as software pervades it. The concern is especially relevant for business decision-makers.

It will take a generation shift (maybe several) for algorithmic thinking to become common among non-programmers. Mediation between decision-makers and software is therefore poised to remain a feature of the business world for the foreseeable future. As mentioned, the process tends to be rather chaotic in practice and varies immensely between businesses. What's universal is the ever-stronger imperative to make it fast, flexible and efficient.

Perils of mediation

This is where agile methodologies enter the picture. The Agile Manifesto reads like a sigh of frustration with the communication barriers between programmers and other stakeholders. Yet it tackles precisely the interface between decision-makers and the software they pay for - which, as is hopefully clear by now, is a really tough nut to crack.

Agile has endured some backlash over the years but it represents great progress nevertheless. It clarifies that there are aspects to developing software that don't have technical solutions. That's always been obvious to managers but putting it in words programmers can understand was certainly commendable. Overall, agile practices do result in improvements when applied properly.

The most common cause of failure in agile projects seems to be misunderstanding and mistrust of the agile principles on the part of key stakeholders. The principles are thus not applied consistently and the project ends up a half-hearted faux-agile mess.

Two agile values are particularly difficult yet vital from the perspective of software as knowledge repository:

Working software tends to be misunderstood by non-programmers - they don't appreciate how fragile working software is and how easy it is to introduce bugs. Hence the ignorance of test coverage and the general disregard for software quality in the presence of other priorities.
Responding to change tends to be misinterpreted by programmers and non-programmers alike as giving in to schedule pressure. Requirement analysis becomes sloppy, necessary maintenance is skipped.

Both pitfalls lead to low-quality code that makes business-specific knowledge catastrophically opaque. The problem can arise surprisingly quickly and the consequences are serious. Requirements are not met, changes take lots of time and introduce bizarre bugs, developers quitting cause undue disruption - a typical project in disarray.

Refactoring to the rescue

The agile practice of refactoring aims to achieve high software quality without slowing down the pace of development, even increasing it. Briefly, refactoring means changing software in small incremental steps and confirming the changes through automated test runs (the test runs ensure the behavior of the software stays the same as its structure improves).

Refactoring is intended to be used early and often, preventing the build-up of hard-to-understand code. The proposition is counter-intuitive because it sounds like extra work. As a result, many teams never adopt the practice. It's a shame: refactoring does speed up coding by slashing the cognitive load imposed on the programmer. In other words, there are far fewer things to worry about when modifications are modest and unit tests are robust.

Why no-one does it

What happens in practice is that software quality is left to atrophy until a developer comes to a decision-maker and says "Hey, we really need to refactor this." Time is alotted but it's not enough so big chunks of the code are cleaned up haphazardly with no test coverage. A flurry of new bugs appears and refactoring gets another blemish on its reputation even though it was never brought to bear. Overcoming this anti-pattern requires adaptation on both sides:

Software developers must get serious practical exposure to true refactoring as part of their training. The impact has to be experienced to be appreciated.
Decision-makers must admit to themselves how critical sotware is to their organization and stop treating it as a "cost center" or an obstacle to achieving business objectives.

Why you should do it

A well-maintained body of code is valuable whether or not it's directly readable by non-technical decision-makers. Knowledge is at hand, answers are quick, changes are smooth. The company becomes nimble in a way it never could without automation. At a software startup, this notion forms the core of its competitive strategy - that's why working at a startup is so alluring to programmers. It also drives startups' disruptive potential. Decision-makers at conventional companies face the tough task of integrating the value of software into their cultures even if they don't compete with startups. Once they do it may be too late.

Reputation as a measure of success

I prefer definitions of success that don't involve exclusivity. The problem with exclusivity is that it doesn't scale. When success is defined as being "the best in the world", for example, the number of successful people is limited by the number of categories in which one can be "the best in the world". Many companies thus present themselves as "the global leader" in whatever absurd bombastic-sounding niche they dream up for themselves. In addition, exclusivity is ethically questionable - in the words of Scatman John, "how can someone win when winning means that someone loses?".

I believe everyone deserves the confidence and satisfaction that comes with success. Definitions that don't require excluding anyone are preferable from this point of view. For a business, "being profitable" may be one such definition of success. "Growing consistently" may be another one, although growth does become exclusive once a market matures.

The other extreme - feel-good notions of "success" that don't require any effort - is even more problematic. Slight positive bias in one's self image is said to help achieve goals but the goals have to be there in the first place. (Interesting aside: does presence of goals indicate absence of success? Is success a state or a process? Why do we want to succeed, anyway?)

I became aware of these issues quite early in my youth and decided my definition of career success would be "achieving respect in a community of competent professionals". This was before the Internet. I can now say I've been achieving this success through most of my career if the "community" is defined as one's workplace and its circle of competent professionals.

That's no longer enough. For years, I have been standing on the sidelines of the great community that is the Internet. I would love to achieve a measure of respect there but it's quite scary. As the Red Woman says, the Net is vast and full of strangers, many of them jerks or worse. Even the sub-Internet of "competent software development professionals" is vast and full of strangers, many of them jerks or worse.

This brings up an awkward fact: when a community becomes large enough, respect of peers becomes exclusive. Respecting someone requires being aware of their existence, achievements and other attributes. Awareness is a limited resource. In my team at work there are so few of us we can comfortably judge each other's competence and award respect to everyone who deserves it. On the internet, however, I compete for the respect of my peers just as they compete for mine.

What to do about this? It's obvious that my youthful definition of success was flawed as it didn't correspond to my own ethics. I need to formulate another definition fully immune from exclusion. Perhaps something like "creating works of high quality useful to customers and delightful to users", as mundane as that sounds. (Of course, the previous definition did mention "works of high quality" between the lines: it spoke of "competent professionals" rather than "gullible fools".)

Having said that, my craving for "respect" doesn't feel like a symptom of vanity. I'd say it reflects a pretty basic human need for acceptance within the group I identify with. When such acceptance is a scarcity I can either give up on being accepted, pursue the acceptance to the exclusion of others or choose a different, smaller community to participate in. I don't feel like giving up but both of the other options involve, well, talking to strangers. Oh my...

The psychology of casual street cleaning

Summary Picking up other people's trash is an empowering gesture that turns you from a whiner into a fixer.

I've developed a peculiar habit. Sometimes when I'm approaching our apartment block I scan the ground for small pieces of trash. If I spot one not too far from my trajectory I pick it up and put it where it belongs - in a nearby dumpster. This behavior is much less common than you might think. It's also surprisingly rewarding.

Street litter has always been a common problem here in Bratislava and it used to upset me. That's the prevailing attitude in this city. People don't like dirty streets and will tell you so when asked. Few, however, do anything about it. There are several psychological barriers:

It is widely acknowledged that someone else should do the cleaning. Some - one might call them "conservatives" - would tell you it's up to those who did the littering. Others - "socialists" perhaps - would maintain it's the city administration's responsibility. Both notions are rather naive. The underlying attitude is that it's simply not fair that we, upstanding citizens who never litter the streets, should clean up after those who do.
When most people think of street cleaning they imagine removing all the litter from a substantial area. That's obviously a lot of work which needs many people if it's to be finished reasonably soon. Coordination is required and before you know it there's a project to manage.
Another mental block stems from the sheer futility of the effort. When a street does get cleaned up by municipal workers or by "spring cleaning" volunteers it doesn't take long for new litter to bloom. And it doesn't take a lot of litter to make a street look messy.
There is the simple unpleasantness of trash itself. Picking up someone else's cigarette butt and carrying it to a trash can means overcoming a hint of revulsion. It's ironic: the very feeling that motivates street cleaning also makes it difficult.

Coping with these inhibitions is all about awareness. Street litter presents no immediate threat nor opportunity so it's blocked out by the subconscious most of the time, along with many other details of the urban exterior. The blocking takes work, however. Garbage is visually loud - it mostly consists of discarded packaging designed to stand out on store shelves. Navigating dirty streets thus incurs a subliminal mental cost we're mostly unaware of. Once we recognize the full magnitude of the cost we become more willing to deal with the problem.

Another thing to realize is that removing even a single piece of trash reduces the mental cost in a tangible way. The street may be quite as dirty as before but the piece we picked up must have caught our attention which means it was somehow "more important" than the rest of the environment, amplifying the cost reduction. The immediacy of the reduction gives it even more impact (the reptilian brain is a sucker for immediate rewards).

From a more long-term perspective, the effort to remove one piece of trash is a one-time investment which pays off every time we visit the affected place. This is a delayed reward further compromised by new garbage appearing all the time, so it's not very significant. What's more important is that if we experience the immediate reward often enough a habit starts to form, lowering the mental cost of the act itself and making the reward even more attractive. A virtuous cycle forms.

All of this speculation may sound rather abstract but the psychological benefit I've experienced is real and substantial. When I notice a piece of litter these days it doesn't bother me anymore. I either pick it up and dispose of it properly or concede to myself that it's too far off my path. There's a sober honesty and clarity about it which does feel liberating. At the same time, I get regular experiences of doing noticeable good with modest effort.

In conclusion, I can recommend casual street cleaning as a worthwhile activity (given proper sanitary precautions, of course). Next time you find yourself angry at the unknown hooligans, why not try undoing their carelessness? You will help yourself more than anyone else.

The minimal patch fallacy

One of the goals of refactoring is to improve the readability of code. I have noticed, however, that I sometimes forsake a refactoring in order to preserve the readability of the resulting changeset. It tends to happen when I'm making a sensitive change in a complicated spot - the point is to make future "detective work" easier. I have evolved this approach after delving into Mercurial history way too many times while trying to understand opaque code.

At first blush, it seems there is a genuine trade-off at play. A refactoring - by definition - introduces changes that don't impact the observable behavior of a program. Mixing such modifications into a bug-fixing or feature-implementing changeset necessarily obscures its purpose. Clarity of the code is at odds with the clarity of its history.

Closer inspection reveals that the trade-off is pure nonsense. Code gets executed - not history. The present code must hence be readable on its own, in its present context. The very need to consult version control only arises when there are code smells all over the place. The urge to keep history clean is thus nothing more than a symptom of fear - fear of disrupting the fragile mess the code has become. The same fear refactoring works to obliterate.

Having said that, isolating refactorings in their own changesets is certainly a good idea - it makes the history clearer with no negative impact on the code. It may not be always possible, though (the need for a clean-up often arises in the middle of other work).

I fell prey to the minimal patch fallacy while working on an allegedly agile project. Despite the efforts of all involved there came a point when deadlines defeated test coverage, technical debt payments were postponed and all that remained were stand-ups. I started leaning on code history not long thereafter. Goes to show that cargo-cult agility ruins not just the code but also programming habits...

PIMp my calendar: Aftermath

I spent some of my free time over the past few weeks trying to get the Calendar app on my Android phone to run with a non-Google and non-MS-Exchange account (see part 1, part 2, part 3, part 4 and part 5). As with my printer woes, it turned out to be an enlightening excercise. I ended up running the Radicale CALDav server on my home server and Marten Gajda's CalDAV adapter on the little machine. I discovered fun facts about Android and its development tools along the way (the highlight was discovering that changing the minimum API level in Eclipse ADT fails to trigger a re-build).

As for the lessons learned, I guess there is a balance to be struck when troubleshooting a technical issue. One approach involves poking and prodding the system, observing responses and formulating hypotheses. Another option is to study information about the system - source code, documentation, online forums and other resources. Combining these modes optimally is rather delicate and I do tend to err on the side of experimentation. The knowledge it yields may be overly specific and not very reusable but the process fosters a problem-solving mind-set that comes in handy in other situations as well. Besides, documentation may be out of date and, as we've seen, even source code may not be quite what it seems. In fact, what hindered me in resolving the calendaring problem was that I failed to recognize the extent of the system I was working with - namely that it included the development environment.

PIMp my calendar, part 5

The self-hosted calendaring solution I had been trying to set up turned out not to work due to an incompatibility between my Android phone and the CalDAV client I tried to run on it. My options at this point consisted of

upgrading the firmware in my phone to Ice Cream Sandwich or higher
buying a new phone
porting the CalDAV Adapter to the unofficial and unsupported APIs
choosing a different calendaring solution

My handset is a Geeksphone One - a sturdy little machine well supported by Cyanogen Mod yet simply too weak to handle anything beyond 2.3.x. It's condemned to stay in the Gingerbread zombie army until I retire it. Buying a new phone is plausible in the mid- to long-term but I do want a slide-out QWERTY keyboard. Looks like I'll have to overcome my allergy to Sony.

I looked into back-porting the CalDAV adapter to the unofficial APIs available in Gingerbread. It seemed like a ton of work with dubious benefits - especially when I found out the work had already been done. You see, the adapter I'd been working with (written by Gérald Garcia) was not the only one I'd found - there is also another adapter written by Marten Gajda which I hadn't considered since it isn't distributed as open-source. It does work under Gingerbread, however, which made a proper impression on me given everything I knew at this stage.

I ended up doing something I hadn't done in years - I purchased and installed a piece of closed-source software. One thing that convinced me was Marten's website which is simple and sticks to the point; the same can be said of the software. Unfortunately, even with a bona-fide polished product on the Android side it wasn't smooth sailing. I had a Radicale instance installed on my notebook for debugging and it talked to the adapter just fine - unlike the home server. Both were running Radicale 0.7 by this point so I compared their OS-specific patches (using apt-get src on my Debian notebook and the ports tree on the OpenBSD home server). One of the Debian patches added automatic creation of calendars which was lacking in the OpenBSD version (this functionality is mentioned in current Radicale documentation but that supposedly refers to 0.7.1; the experiment with Mozilla Lightning in part 3 had worked because the Debian version was involved). Porting the patch and installing from ports was a matter of minutes. After that, everything worked like a charm.

« Page 2 / 6 »