meandering journey

The battle of the C5280, part 1

The other day I posted a frustrated rant about my troubles getting the printer to work under OpenBSD. This has evolved into a massive struggle which is still ongoing. On the one hand, it's ridiculous how much time I'm devoting to this. On the other, I've dusted off my C chops, built basic proficiency with gdb and discovered a bunch of curious facts about OpenBSD and USB which I'll probably never use again and forget in a matter of weeks. So it's perhaps worth writing some of them down.

First, a bit of background. My home server runs on OpenBSD. One of its duties is providing access to a HP Photosmart C5280 printer via CUPS and it works like a charm. The version of OpenBSD I'm running is getting old, however, so I've purchased the 5.2 CD set and I'm setting it up on another machine.

Everything went well until I tried to add the printer via CUPS' web interface. No local connection was detected. I turned on extra logging (by saying LogLevel debug in /etc/cups/cupsd.conf), clicked around in the web UI and found this in /var/log/cups/error_log:

libusb_get_device_list=0

It looked like libusb was having trouble finding the printer. I looked up libusb on the web, looked at the source code and confirmed my suspicion. I also found out there was an OpenBSD-specific file (openbsd_usb.c) which looked for /dev/ugen* devices. Those were present on my system in abundance, so a missing device wasn't a problem.

The error log also mentioned /usr/local/libexec/cups/backend/usb and browsing CUPS help gave me a hint of what backends were about. So I ran the USB backend directly on the command line. To my surprise, it had no problem finding the device. Something was wrong in the way CUPS was calling the backend.

I decided to set up OpenBSD ports and compile libusb with some trace statements. I wrote a minimal program to test things:

#include <stdio.h>
#include <libusb.h>

int main(int argc, char **argv) {
    libusb_device **list;
    int count;

    libusb_init(NULL);
    count = libusb_get_device_list(NULL, &list);
    fprintf(stdout, "Devices: %d\n", count);
    libusb_free_device_list(list, 1);
    libusb_exit(NULL);
    return 0;
}

I hadn't done C hacking in a while so I spent a fair amount of time just crafting the proper cc command line but eventually I could read a nice "Devices: 7" in the output. I then started sprinkling trace statements over libusb. I naively thought using fprintf(stderr, ...) would be easier than figuring out how to activate the existing logging functions (usbi_warn, usbi_dbg etc.). No matter what I did, the traces wouldn't show up. Eventually I did activate libusb's logging (commenting out .ifdef DEBUG in the port's Makefile). Didn't help.

After much wailing and gnashing of teeth, it turned out it was a linking problem. I'd trusted cc would link the library statically by default (I have no idea how to override the default anyway). The bastard used dynamic linking instead which meant that upon running the program I was talking to /usr/local/lib/libusb-1.0.so.1.0 from the installed libusb package. Running with

PORT_PATH=/usr/obj/ports/libusb1-1.0.9/libusb-1.0.9
LD_LIBRARY_PATH=$PORT_PATH/libusb/.libs:/usr/lib

brought a revelation. Continued in part 2.

Alice in Printerland

I've just spent about 12 hours trying to set up CUPS on my OpenBSD home server, with no success. The USB backend didn't want to recognize the local printer. After setting up OpenBSD ports, inserting trace statements into libusb sources and forcing ld to talk to the newly compiled library instead of the one from the standard package, I found out that the backend stumbled on "Permission denied". I made the /dev entries global-readable and -writable to see if it would help. The printer was recognized and I could install it properly.

Then I tried printing a test page and LIBUSB_ERROR_ACCESS became LIBUSB_ERROR_OTHER.

There are 13 places in libusb where the _OTHER could come from and I have lost the will to see where the rabbit hole leads. Time for a fresh start.

What I learned about Pelican

initialize a site: pelican-quickstart (launches an interactive questionnaire)
write content: just put .md files in content/
generate site: make html
autorefresh mode: make devserver (also hosts the site at localhost:8000)
further information: docs.getpelican.com

Idiosyncrasies in Python's timezone handling

As I work to fix timezone logic in upsheet, I've become somewhat frustrated with date and time handling in Python. Coming from a Java background, I know this stuff is awfully difficult to get right. Python has a large community driven by sophisticated users, however, so I expected things to be pretty smooth. I was wrong:

datetime vs. time Two modules with overlapping functionality and no clear separation of scope.
datetime.datetime A class with the same name as its module. This is probably common in the Python library but it makes for weird expressions such as datetime.datetime.strptime(...). Kind of nudges you to drop the module prefix with from datetime import *.
timezone handling in datetime is a red-headed stepchild The datetime module tactfully avoids the gritty details of coming up with the local system's current timezone. You can supply a timezone but you have to provide your own tzinfo subclass to do it.
timezone handling in time is a red-headed stepchild You get timezone for the non-DST system timezone, altzone for the DST system timezone, daylight to see if altzone is even applicable and localtime().tm_isdst to decide (provided altzone is applicable) whether timezone or altzone is the local system's current timezone. Whew! I mean, it's wonderfully precise and flexible but kind of all over the place.
timezone offsets are represented inconsistently While datetime.tzinfo.utcoffset() must return its result in minutes, both time.timezone and time.altzone provide the offset in seconds. Trivial yet irritating.
datetime.time vs. time A class in one module with the same name as a related module. When you use datetime without the module prefix, this forces you to either also use time without the prefix - further polluting the naming scope - or alias it using import time as sillytime.
inconsistent timezone offsets While time.timezone and time.altzone specify timezone offsets in seconds west of UTC, tzinfo.utcoffset() must return minutes east of UTC.
mischievous documentation One example: the description of struct_time.tm_isdst says "0, 1 or -1; see below". Of course, there's no further mention of tm_isdst anywhere in the document. A sentence buried in the next paragraph does say "A -1 argument as the daylight savings flag, passed to mktime() will usually result in the correct daylight savings state to be filled in.". Which means the -1 is utterly irrelevant except for one specific use-case.

Having figured all of this out, I can now construct a datetime.datetime instance equipped with the correct tzinfo for the current local timezone. This goes to show how hard it is to design a really good API.

I do hope it all works the same way under Windows...

Stuck in a timezone

My timesheet uploader micro-project started off as a spontaneous scratch-your-own-itch effort with very little polish. No error handling, liberal copy-paste, vague structure - you know the story. The code obviously involved some parsing and formatting of dates and times. It wasn't very complicated so I avoided the big guns, sticking to string manipulation and simple arithmetic (figuring out that 0830-1030 means 120 minutes etc.).

Timesheet entries are fed to JIRA as JSON documents via its REST API. JSON doesn't define a standard timestamp format but examples in the documentation use ISO8601 with a timezone suffix - something like 2012-08-14T08:30:00.000+0200. I dutifully slapped on the suffix as a constant to all my outgoing timestamp data.

One of the things I wanted from the script was to detect timesheet entries that were already in JIRA to prevent duplication in the worklogs. This involved downloading the worklog for the relevant issue and comparing it with the entry that was to be uploaded. When parsing timestamps in the worklog I happily ignored the timezone suffix. I figured I would always be in the same timezone as my JIRA server.

After some time I noticed this collision detection was often failing. I had no time to analyze it so I simply took care to upload my timesheet all at once and re-upload only the failed entries if there were errors. It was a pain but still much better than stuffing JIRA manually.

When I got more free time in January I discovered the true nature of the bug. Both me and my JIRA server had effectively switched timezones when the Daylight Savings Time ended in October. The script, however, still had +0200 in its outgoing timestamps. This meant that I uploaded an entry with 08:30:00.000+0200 but then it came back as 09:30:00.000+0100. Ignoring the timezone, the script detected no overlap (as long as duration was under 61 minutes) and proceeded to re-upload the entry, again with 08:30:00.000+0200 in the timestamp.

It turns out that no, I can't ignore timezones in my calculations after all. Besides, the code is already on GitHub and right now collision detection works only for the lucky users who are two hours East from UTC together with their JIRA servers. Good thing I haven't told anyone yet :-)

The Heroku mess

On February 12, 2013, a startup called RapGenius accused the Heroku cloud host of knowingly stifling the performance of the RapGenius Rails app, causing the need for extra instances and subsequent higher service charges.

RapGenius' blog post is impressive in that it describes the issues clearly yet provides sufficient details. It is really worth a read which is why I'm not going to reproduce nor summarize it here.

I believe none of the parties involved (Heroku, RapGenius and the monitoring outfit New Relic) come out looking good:

Heroku failed to announce a substantial platform change to customers. The company provides Platform as a Service (several platforms, actually). A platform is a system of components working together to provide essential services for applications. Heroku switched their Rails platform from an optimal combination of components (single-request dynos, intelligent routing) to a substantially sub-optimal one (single-request dynos, random routing). They failed to notify customers of the switch in advance. They either failed to measure the performance impact of the switch or purposefully withheld knowledge of the impact from their customers. They failed to reflect the switch in their documentation.
New Relic simply failed to account for all latencies in their monitoring. It appears they measured many internal values within the platform but failed to cross-check them against a black-box view from the outside. It seems quite unprofessional, especially given the pricing of their service.
RapGenius put too much trust in an external entity. Instead of an independent monitoring service they used one sold as an add-on to their platform by the platform provider. They took more than two years to detect a major performance degradation in their platform (in fairness, the problem only becomes apparent when the number of dynos reaches a certain threshold).

What's truly puzzling is that none of Heroku's other Rails customers seems to have encountered this issue before RapGenius. One possibility is that RapGenius is the first Heroku Rails app to achieve sufficient scale for the problem to occur. Another is that some customers did indeed catch the problem right at the start, adapted to the new configuration and simply never thought to give a heads-up to the rest of the community.

The morale of the story for me is that it's foolish to use a monitoring service with business ties to the entity being monitored (or, indeed, to outsource monitoring in the first place - it's a critical QA function, after all). An independent monitoring service must still be given enough access to the monitored system to provide necessary detail. This is not trivial as monitoring can interfere with the monitored system. Some form of meta-monitoring is perhaps needed. Ah, the ever-fascinating world of IT :-)

Catching a boot-time kernel oops call trace

Another issue my HTPC exhibited recently was a tendency to crash early in the boot process. It became apparent it was suffering from an initialization issue in the iMon driver. I couldn't confirm it, though, since the call trace was too long to fit on the screen and it didn't turn up in any of the files under /var/log.

I studied various options for capturing the call trace, including netconsole and pstore but since I had access to an RS-232 cable with switched RX/TX pins (a.k.a. "null-modem cable"?) I ended up choosing the good old serial console. This involved connecting the HTPC to my notebook with the serial cable, starting cutecom on the notebook and doing the following on the HTPC:

sudo vi /etc/sysconfig/grub
# GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,19200n8 console=tty0"
sudo update-grub
sudo reboot

Of course, "19200n8" must match the parameters set in cutecom on the other side of the link. With this setup, it only took a few reboots to reproduce the crash, capture the call trace and confirm the iMon driver as the culprit. Fortunately, a patch for the problem already existed so it was only a matter of either waiting until it reached my aptosid kernel or patching the kernel on my own. In the meantime, I'm booting into the previous kernel.

Update on Logitech diNovo Edge and iMON PAD

My HTPC lived a few months without an update to its aptosid installation. When I subsequently did perform an update, I ran into some of the same problems I used to have with my Logitech wireless keyboard and iMON remote control. With the help of my previous blog posts, I was able to set things straight rather quickly.

The solution to my original keyboard problem had been to replace /usr/sbin/hid2hci with a dummy version. In the meantime, /usr/sbin/hid2hci was superseded by /lib/udev/hid2hci. I simply replaced the new file with a dummy version as well.

As for the remote control, the current problem resembled a broken "nomouse" mode of the imon kernel module that I had encountered before. The module seemed to be OK this time, however, so I kept looking and eventually the LIRC manual offered a great description of a LIRC-HAL interaction that seemed to match my symptoms. I tried following the advice from the manual but it just didn't work. It turned out the DEVICE setting in /etc/lirc/hardware.conf had become obsolete. The correct new setting is

DEVICE=/dev/input/by-id/usb-15c2_ffdc-event-if00

and the remote works like a charm again (for a while, anyway...)

« Page 4 / 6 »