More Organs → More Human

Stupid things I've figured out so that you don't have to.


Site Feed

Thursday, December 29, 2005

XPath Insanity

Yet another entry under the heading of "stupid stuff I figured out so that you don't have to". The other day, a fellow student in my program came to me with what seemed to be a very simple XPath problem— accessing a namespaced node. I explained to him that dealing with namespaces in XPath is pretty straightforward— you just prefix the node name or attribute name you're after with whatever prefix you assigned the namespace. E.g., if the node is described in your document as foo:someNode, you would simply use that in your XPath. He replied that he tried that, and that it wasn't working. We got the sample file loaded onto my computer, and a couple of minutes with Ruby and REXML determined that, in fact, that XPath was working. He said something to the effect of: "I'm using Java, should that matter?" I replied "Nah, XPath is XPath." Ha. Ha. Ha. He figured that he must've typed something wrong, and that he'd go back and give it another try. A little while later, he came back saying that he'd triple-checked it, and that it still wasn't working. I went down to his computer, and after several hours of cursing and re-compiling, we finally figured it out.

Without going into the gory details of a very long and heroic debugging story, I'll sum it up by saying this: XPath goes all wonky when your document has both a default namespace as well as other prefixed namespaces. The reasons are fantastically obscure, are rooted in the depths of a W3C spefication document, and don't seem to apply if you're using REXML and Ruby but most definitely do if you're using any Java library based on Jaxen (e.g., Dom4J or DOX). but have a sort of twisted logic to them. For a detailed description, see This article over at XML.com. I'll also walk through the gist of it below.

Consider the following XML document describing our generic friend, Joe Smith:



<?xml version="1.0" encoding="UTF-8"?>
<person>
<name>
<given>Joe</given>
<family>Smith</familiy>
</name>
<contactInfo>
<phone>503-555-1234</phone>
</contactInfo>
</person>



An appropriate XPath expression to get the DOM node containing Joe's surname would be: /person/name/family. Simple, nice, and easy. Let's up the ante a little bit. Let's say that you're a vampire, and this snippet of XML represents an entry in your "donor list". As a discerning gourmet, you want to encode Joe's blood type (hey, some days are AB+ kind of days, some are more O-). Not only are you a discerning vampire, you're a properly-trained programmer (or lazy, take your pick) as well. Luckily for you, the International Brotherhood of Vampires has a published schema for describing blood types:



<?xml version="1.0" encoding="UTF-8"?>
<bloodType>
<type>O</type>
<rhFactor>-</rhFactor>
</bloodType>



Integrating this into your schema is fairly straightforward: add xmlns:vamp="http://www.ibvamps.org/Schema/blood" to your <person> tag. Of course, if you add in a second namespace, you typically want to specify a default namespace to refer to your own part of the document. This is accomplished by adding xmlns="http://www.mydomain.com/Schema/donorEntry right before our xmlns:vamp declaration. Now, our entry looks like this:



<?xml version="1.0" encoding="UTF-8"?>
<person xmlns="http://www.mydomain.com/Schema/donorEntry"
xmlns:vamp="http://www.ibvamps.org/Schema/blood">
<name>
<given>Joe</given>
<family>Smith</familiy>
</name>
<contactInfo>
<phone>503-555-1234</phone>
</contactInfo>
<vamp:bloodType>
<vamp:type>O</vamp:type>
<vamp:rhFactor>-</vamp:rhFactor>
</vamp:bloodType>
</person>



After doing this, try running the XPath expression we laid out earlier to try and get Joe's surname (/person/name/family). If you're using Dom4J, you'll find that it no longer returns any nodes. You'll be able to access our namespaced nodes if you use an expression like /*/vamp:bloodType/vamp:type, but not if you use /person/vamp:bloodType/vamp:type. If you ask your parser to return the document's root node, you'll get a node named person, which appears correct... but if you run /person against your document, you won't get any results. So, what gives?

Without going into more detail than anybody who doesn't work for the W3C cares about, node name matching is done using fully qualified node names (at least, according to the XPath spec). This means that the parser internally translates vamp:type to something along the lines of {http://www.ibvamps.org/Schema/blood}type. This is all well and good for nodes that have a namespace prefix, but what about the nodes that fall under our default namespace? They have no explicit prefix, but a default namespace has been specified, so the parser translates to {}person. Therefore, if we just pass in person, the parser doesn't see a match.

The solution is either to not use a default namespace, or to use a way-to-complicated voodoo workaround involving the XPath local-name() function. It turns out that [local-name()="Person"] gets around this difficult-to-figure-out behavior, so our final XPath for determining Joe's surname becomes: /*[local-name()="Person"]/*[local-name()="name"]/*[local-name()="family"], or simply //*[local-name()="family"], depending on how the document is set up. I have no idea what the performance implicactions of this approach are, but, then again, I have no idea how most XPath engines perform ordinarily. Presumably, this doesn't do anything too awful, expecially for smaller documents. Let's just say that I wouldn't suggest trying this with your 400,000-node XML representation of SNOMED.

So, there you have it: the end-product of several hours of hair-pulling, reduced to a couple of paragraphs. Go forth, Google, and index this page, so that others will be spared the hours of torment that we suffered.

0 Comments:

Post a Comment

<< Home