More Organs → More Human

Stupid things I've figured out so that you don't have to.


Site Feed

Thursday, December 29, 2005

XPath Insanity

Yet another entry under the heading of "stupid stuff I figured out so that you don't have to". The other day, a fellow student in my program came to me with what seemed to be a very simple XPath problem— accessing a namespaced node. I explained to him that dealing with namespaces in XPath is pretty straightforward— you just prefix the node name or attribute name you're after with whatever prefix you assigned the namespace. E.g., if the node is described in your document as foo:someNode, you would simply use that in your XPath. He replied that he tried that, and that it wasn't working. We got the sample file loaded onto my computer, and a couple of minutes with Ruby and REXML determined that, in fact, that XPath was working. He said something to the effect of: "I'm using Java, should that matter?" I replied "Nah, XPath is XPath." Ha. Ha. Ha. He figured that he must've typed something wrong, and that he'd go back and give it another try. A little while later, he came back saying that he'd triple-checked it, and that it still wasn't working. I went down to his computer, and after several hours of cursing and re-compiling, we finally figured it out.

Without going into the gory details of a very long and heroic debugging story, I'll sum it up by saying this: XPath goes all wonky when your document has both a default namespace as well as other prefixed namespaces. The reasons are fantastically obscure, are rooted in the depths of a W3C spefication document, and don't seem to apply if you're using REXML and Ruby but most definitely do if you're using any Java library based on Jaxen (e.g., Dom4J or DOX). but have a sort of twisted logic to them. For a detailed description, see This article over at XML.com. I'll also walk through the gist of it below.

Consider the following XML document describing our generic friend, Joe Smith:



<?xml version="1.0" encoding="UTF-8"?>
<person>
<name>
<given>Joe</given>
<family>Smith</familiy>
</name>
<contactInfo>
<phone>503-555-1234</phone>
</contactInfo>
</person>



An appropriate XPath expression to get the DOM node containing Joe's surname would be: /person/name/family. Simple, nice, and easy. Let's up the ante a little bit. Let's say that you're a vampire, and this snippet of XML represents an entry in your "donor list". As a discerning gourmet, you want to encode Joe's blood type (hey, some days are AB+ kind of days, some are more O-). Not only are you a discerning vampire, you're a properly-trained programmer (or lazy, take your pick) as well. Luckily for you, the International Brotherhood of Vampires has a published schema for describing blood types:



<?xml version="1.0" encoding="UTF-8"?>
<bloodType>
<type>O</type>
<rhFactor>-</rhFactor>
</bloodType>



Integrating this into your schema is fairly straightforward: add xmlns:vamp="http://www.ibvamps.org/Schema/blood" to your <person> tag. Of course, if you add in a second namespace, you typically want to specify a default namespace to refer to your own part of the document. This is accomplished by adding xmlns="http://www.mydomain.com/Schema/donorEntry right before our xmlns:vamp declaration. Now, our entry looks like this:



<?xml version="1.0" encoding="UTF-8"?>
<person xmlns="http://www.mydomain.com/Schema/donorEntry"
xmlns:vamp="http://www.ibvamps.org/Schema/blood">
<name>
<given>Joe</given>
<family>Smith</familiy>
</name>
<contactInfo>
<phone>503-555-1234</phone>
</contactInfo>
<vamp:bloodType>
<vamp:type>O</vamp:type>
<vamp:rhFactor>-</vamp:rhFactor>
</vamp:bloodType>
</person>



After doing this, try running the XPath expression we laid out earlier to try and get Joe's surname (/person/name/family). If you're using Dom4J, you'll find that it no longer returns any nodes. You'll be able to access our namespaced nodes if you use an expression like /*/vamp:bloodType/vamp:type, but not if you use /person/vamp:bloodType/vamp:type. If you ask your parser to return the document's root node, you'll get a node named person, which appears correct... but if you run /person against your document, you won't get any results. So, what gives?

Without going into more detail than anybody who doesn't work for the W3C cares about, node name matching is done using fully qualified node names (at least, according to the XPath spec). This means that the parser internally translates vamp:type to something along the lines of {http://www.ibvamps.org/Schema/blood}type. This is all well and good for nodes that have a namespace prefix, but what about the nodes that fall under our default namespace? They have no explicit prefix, but a default namespace has been specified, so the parser translates to {}person. Therefore, if we just pass in person, the parser doesn't see a match.

The solution is either to not use a default namespace, or to use a way-to-complicated voodoo workaround involving the XPath local-name() function. It turns out that [local-name()="Person"] gets around this difficult-to-figure-out behavior, so our final XPath for determining Joe's surname becomes: /*[local-name()="Person"]/*[local-name()="name"]/*[local-name()="family"], or simply //*[local-name()="family"], depending on how the document is set up. I have no idea what the performance implicactions of this approach are, but, then again, I have no idea how most XPath engines perform ordinarily. Presumably, this doesn't do anything too awful, expecially for smaller documents. Let's just say that I wouldn't suggest trying this with your 400,000-node XML representation of SNOMED.

So, there you have it: the end-product of several hours of hair-pulling, reduced to a couple of paragraphs. Go forth, Google, and index this page, so that others will be spared the hours of torment that we suffered.

Thursday, December 08, 2005

Vacation! (Almost)

I'm almost free— I took my last in-class final this morning, and am now working on finishing up my last take-home final. After I'm done with that, the quarter is officially over and I'm on vacation. I plan on finishing up a few half-written writing projects and posting them over the course of Winter Break, but until that happens, here's something that was just too funny to pass up. Check out last week's Science, page 1261 (or simply go here, not sure if it's open access or not). The article is about interesting observations regarding touch sensors on bat wings, and describes the research of one John Zook at Ohio University.

Zook has found what appear to be Merkel-like receptor cells which seem to be highly sensitive air turbulence sensors. He found that when he used— wait for it— Nair on bats' wings, they were unable to accurately turn in midair. They could fly in a straight line without incident, but as soon as they had to negotiate a 90-degree turn, their elevation control went haywire. As soon as their hair grew back, their flight patterns returned to normal.

While this is pretty dang cool, it's not where The Funny comes in. Consider the following quote:


Zook also described another type of receptor in the membranous part of bat's wings. Nerve recordings revealed that these receptors respond when the membrane stretches, even slightly. The most sensitive parts of the wing turned out to overlap with the "sweet spots" where the bats prefer to hit the insects they scoop up in midflight. (Zook mapped the sweet spots by videotaping the bats as they gathered mealworms shot out of an air cannon).

(From Science 310:1260-1261, doi: 10.1126/science.310.5752.1260a,
emphasis in quote is my own).


An air cannon. That shoots mealworms. For bats to catch in midair. This, ladies and gentlemen, is why I love science. What other profession would consider this to be anything other than a slightly odd pastime?