Common Identifiers or a Common Data Format. What is more important?
I just read this excellent post by Tom Bruce, et. al. from the Legal Information Institute at the Cornell University Law School.
Tom’s post brought to mind something I have long wrestled with. (Actually so long that it was a key part of my job working on CAD systems in the aerospace industry long ago). I sometimes wonder if having common identifiers isn’t more important than having a common data format. The reason is that being able to unambiguously establish relationships is both very difficult and very useful. In fact, one of the reasons you want a common format is so that you can find and establish these identifiers.
I have used a number of schemes for identifiers over the past ten years. Most of the schemes I have used have involved Uniform Resource Names or URNs. A decade ago we designed a URN mechanism for use in the California Legislature. Our LegisWeb product uses a very similar URN schema based on the lessons learned on the California Project. In more recent years I have experimented with the URN:lex proposal and the URL-based proposal that is within Akoma Ntoso. Both of these proposals are based around FRBR. I can’t say I have found the ideal solution.
I favor URN-based mechanism for a number of reasons. URNs are names – that was their intent as defined by the IETF. They are not meant to imply a location or organizational containment (well mostly they aren’t). In theory, these identifiers can be passed around and resolved to find the most relevent representation when needed. But there is a problem. URNs have never really been accepted. While they conceptually are very valuable, their poor acceptance and lack of supporting tools tends to undermine their value.
Akoma Ntoso takes a different approach. It uses URLs intead of URNs. Rather than using URLs as locations though, they are used as if they are identifiers. It is the duty of the webserver and applications plugged into the webserver to intercept the URLs and treat them as identifiers. In doing this, the webserver provides the same resolution functions that URNs were supposed to offer. My upcoming editor implements this functionality. I have built HTTP handlers that convert URLs into repository queries which retrieve and compose the requested documents. I have it working and it works well – as much as I understand the Akoma Ntoso proposal. I’m still not totally crazy about overloading identifier semantics on top of location semantics though. At least the technology support is better in place.
So what issues have I struggled with?
First of all, none of the proposals seem to adequately address how you deal with portions of documents. There are many issues in the area. The biggest of course is the inherent ambiguity within legislative documents. As Tom mentioned in his post, duplicate numbering often occurs. There are usually perfectly good and valid reasons for this – such as different operational conditions. But sometimes these are simply errors that everyone has to accommodate. Being able to specify the information necessary to resolve the ambiguity is not in any proposal I have seen. Add to that the temporal issues that come with renumbering actions. How do you refer to something that is subject to amendment and renumbering? Do you want a reference to specific wording at a specific point in time, or do you want you reference to track with amendments and renumbering?
At this point people often ask me why a hash identifier followed by a cleverly designed element id won’t work. The first thing you have to realize is that the # means something to the effect of “retrieve the document and then scroll to that Id”. The semantic I am looking for is “retreive the document fragment located at Id”. The importance of the difference becomes obvious when you realize that the client browser holds the “#” part of the request and all the server sees is the document URL, minus the hash and identifier. When your document is a thousand pages long and all you want is a single section, that distinction is quite important. Secondly managing ids across renumbering actions is very messy and introduces as many problems as it solves.
Secondly, the referencing mechanism tends to be documented oriented. Certainly, Akoma Ntoso uses virtual URL identifiers to refer to much more than simple documents, but the whole approach gets cumbersome and hard to explain. (If you want to appreciate this, try and explain XML schema’s namespace URI/URL concept to an uninitiated developer.) What’s more, it’s not clear if a common URL mechanism does enough to establish common enough practices for the effort to be useful. For instance, what if I want to refer to the floor vote after the second reading in the Assembly? Is there a reference to that? In California there is. That’s because the results of that vote are reported as a document. But there is nothing that says this should be the case. I have had the need to interrelate a number of ancilliary documents with the legislation. How to do that in a consistent way is not all that clear cut.
The third problem is user acceptance. The URN:Lex proposal, in particular, looks quite daunting. It uses lots of characters like @, $, ;, While end users can be shielded from some of this, my experience has taught me that even software developers rebel against complexity they can’t understand or appreciate. So far, this has been a struggle.
I’m eagerly awaiting Part 2 of Tom’s post on identifiers. It’s a great subject to explore.