Tim Berners-Lee, inventor of the World Wide Web, defines a semantic web quite simply as “a web of data that can be processed directly and indirectly by machines“. In my experience, that simple definition quickly becomes confusing as people add their own wants and desires to the definition. There are technologies like RDF, OWL, and SPARQL that are considered key components of semantic web technology. It seems though that these technologies add so much confusion through abstraction that non-academic people quickly steer as far away from the notion of a semantic web as they can get.
So let’s stick to the simple definition from Tim Berners-Lee. We will simply distinguish the semantic web from our existing web by saying that a semantic web is designed to be meaningful to machines as well as to people. So what does it mean for a web of information to be meaningful to machines? A simple answer is to say that there are two primary things that a machine needs to understand about a web. First of all, what the pages are all about, and secondly what the relationships that connect the pages together are all about.
It turns out that making a machine capable of understanding even the most rudimentary aspects of pages and the links that connect them is quite challenging. Generally, you have to resort to fragile custom-built parsers or sophisticated algorithms that analyze the document pages and the references between them. Going from pages with lots of words connected somehow to other pages to a meaningful information model is quite a chore.
What we need to improve the situation are agreed upon information formats and referencing schemes in a semantic web that can more readily be interpreted by machines. Defining what those formats and schemes are is where the subject of semantic webs starts getting thorny. Before trying to tackle all of this, let’s first consider how this all applies to us.
What could benefit more from a semantic web than legal publishing? Understanding the law is a very complex subject which requires extensive analysis and know-how. This problem could be simplified substantially using a semantic web. Legal documents are an ideal fit to the notion of a semantic web. First of all, the documents are quite structured. Even though each jurisdiction might have their own presentation styles and procedural traditions, the underlying models are all quite similar around the world. Secondly, legal documents are rich with relationships or citations to other documents. Understanding these relationships and what they mean is quite important to understanding the meaning of the documents.
So let’s consider the current state of legal publishing – and from my perspective – legislative publishing. The good news is that the information is almost universally available online in a free and easily accessed format. We are, after all, subject to the law and providing access to that law is the duty of the people that make the laws. However, providing readable access to the documents is often the only objective and any which way of accomplishing that objective is simply the requirement. Documents are often published as PDFs which are nice to read, but really difficult for computers to understand. There is no uniformity between jurisdictions, minimal analysis capability (typically word search), and links connecting references and citations between documents are most often missing. This is a less than ideal situation.
We live in an era where our legal institutions are expected to provide more transparency into their functions. At the same time, we expect more from computers than merely allowing us to read documents online. It is becoming more and more important to have machines interpret and analyze the information within documents – and without error. Today, if you want to provide useful access to legal information by providing value-added analysis capabilities, you must first tackle the task of interpreting all the variations in which laws are published online. This is a monumental task which then subjects you to a barrage of changes as the manner in which the documents are released to the public evolves.
So what if there was a uniform semantic web for legal documents? What standards would be required? What services would be required? Would we need to have uniform standards or could existing fragmented standards be accommodated? Would it all need to come from a single provider, from a group of cooperating providers, or would there be a looser way to federate all the documents being provided by all the sources of law around the world? Should the legal entities that are sources of law assume responsibility for publishing legal documents or should this be left to third party providers? In my coming posts I want to explore these questions.