Sunday, April 27, 2014

Confessions of an RDF agnostic, part 2: I have a dream…

Tim Berners-Lee in thought "I have a dream for the Web… Machines become capable of analyzing all the data on the Web - the content, links, and transactions between people and computers. A 'Semantic Web,' which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy, and our daily lives will be handled by machines talking to machines, leaving humans to provide the inspiration and intuition. The intelligent 'agents' people have touted for ages will finally materialize."

Tim Berners-Lee (1999) "Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web By Its Inventor" p.157-8.

When I was a kid, I really enjoyed reading books about robots and computers. In books like "Andy Buckram's Tin Men" and "I Robot", one constructed robots out of tin cans or whatever you had at hand, and then through the magic of positronic material or computer programming, the robot became a sentient being, capable of thinking and reasoning. I hoped that someday I would actually get to see a real computer. You can imagine my disappointment when I actually saw my first computer in college and discovered that computers only "knew" how to accomplish the things that one programmed them to do. The prospect of the emergence of an "intelligent agent" that can "discover" new information without the intervention of a human programmer is very appealing and if Tim Berners-Lee says it can be done, it certainly should be possible, right?

The prospect of using RDF and its variants RDFS and OWL to enable machines to do semantic reasoning is very alluring and it is easy to jump on the bandwagon and advocate for adopting it without carefully considering its limitations.  So, I'd like to take a moment to step back and summarize a few important facts about RDF. [The rest of this post presupposes some knowledge of the rudiments of RDF at the level of understanding triples and graphs.  For more background, I recommend the W3C's RDF Primer.  For background in the context of biodiversity informatics, I recommend the TDWG RDF Task Group's Beginner's Guide to RDF.  I also shamelessly promote this video upon which I spent/wasted many hours in advance of the TDWG 2013 Semantics of Biodiversity symposium.]

RDF Resource Description Framework Icon

1. RDF is not a programming language. A set of statements in RDF don't "do" anything. Rather, RDF is a way of stating "facts" about things, known as "resources". A single "fact" in RDF is called a triple. A triple can describe a property of a resource. A triple can also describe how a resource is related, or linked to other resources.
2. A set of triples is called an RDF graph. The triples in a graph describe a certain state of affairs. One cannot assume that everything is known about that state of affairs - there is always the potential to acquire additional information about the state of affairs.
3. RDF triples are not just a format for information exchange. Although they are serialized in different formats (XML, Turtle, JSON, etc.) they represent abstract relationships that are independent of the serialization.
4. Actually "doing" something with an RDF graph requires a "semantic client". A semantic client is a computer program that is designed to consume information in the form of triples. The client software is constructed to work according to rules laid out by the standards that define the various flavors of RDF. The semantic client produces some useful result based on rule-based processing of the triples it has consumed.


What does a semantic client "understand" about a triple?











Suppose I state the following:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/">
   <rdf:Description rdf:about="http://bioimages.vanderbilt.edu/baskauf/26828">
     <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">

2003-06-06T08:47:15-05:00</dcterms:created>
   </rdf:Description>
</rdf:RDF>


If this were simply processed as raw XML, it could be interpreted to mean that "2003-06-06T08:47:15-05:00" was some bit of string data that could be understood based on the tags in the markup and through a pre-established understanding between the sender and receiver. 

However, since this XML is valid RDF, a semantic client could understand it to mean that there is a relationship between some thing (i.e. resource) identified by the IRI* http://bioimages.vanderbilt.edu/baskauf/26828, and the instant of time 8:47:15 AM central daylight time on 6 June 2003.  Note that the relationship is NOT between the string "http://bioimages.vanderbilt.edu/baskauf/26828" and the string "2003-06-06T08:47:15-05:00", but rather between the entity identified by the IRI and the time instant encoded by the datatyped string.  The XML is just a means of serializing the abstract relationship described by the RDF triple.  The triple would "mean" exactly the same thing if it were serialized in Turtle syntax as:

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
<http://bioimages.vanderbilt.edu/baskauf/26828>
     dcterms:created "2003-06-06T08:47:15-05:00"^^xsd:dateTime.


Notice that I said that the client "could" understand the object of triple to refer to an instant in time.  A client may (but is not required to) recognize XML Schema datatypes.  Similarly, a client might "understand" that the relationship between the resource and the time is one of creation (i.e. that the time is when the resource was created).  Such an "understanding" could occur because the Dublin Core vocabulary (of which the predicate dcterms:created is part) is well-known and commonly used. 

I could also say something like:

@prefix my: <http://my.xyz/ >.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
<http://bioimages.vanderbilt.edu/baskauf/26828>
     my:x4m5dd2 "2003-06-06T08:47:15-05:00"^^xsd:dateTime.


A client that "understood" XML Schema datatypes could know that there was a relationship between the resource identified by the IRI http://bioimages.vanderbilt.edu/baskauf/26828 and the instant of time 8:47:15 AM central daylight time on 6 June 2003, but would have no idea about the nature of that relationship without further knowledge of the predicate my:x4m5dd2 .  (It is possible for a semantic client to "learn" more about what a predicate means - possibly by dereferencing the IRI, but that's a story for another blog post.)

The point here is that the ability of a client to "understand" a triple depends in part on decisions about the parts of RDF/RDFS/OWL that the client's programmer decides to implement, and in part on a significant social component: both the human responsible for producing the triple and the programmer of the client need to have a common understanding of what the predicate of the triple "means".






What does a semantic client "do"?










If I create a graph of RDF triples and expose it through the Internet, what should I expect a client to do with it?  There is no requirement that any client do anything in particular with triples.  A client encountering a foaf:mbox property in a triple might under some circumstance send an email to the object email address.  A client encountering GEO namespace properties might place a point on a map visible to its user.  Presented with particular combinations of triples, a client might turn on a switch.  A client may facilitate a query or infer additional triples based on existing triples and a set of rules.  But these actions are dependent on the programmer of the client and are not controlled by the creator of the triples, who is simply creating a set of facts about the world according to the creator's perspective. 

Summary:

The idea of "intelligent agents" analyzing data in the form of RDF and taking action based on those data is very exciting and appealing.  However, making that happens depends critically on several factors:
- the availability of useful information in the form of RDF triples.
- decisions made by the programmers of clients about which rules the clients will use to process the triples they encounter.
- a common understanding of the meaning of predicates.
- programming decisions about the actions that will be taken by clients based upon the triples the clients encounter.

All four of these must be in place in order for RDF to become useful.  There is also a fifth factor that is primarily economic.  It is not enough to demonstrate that RDF can actually do something useful in a particular context.  One must also demonstrate that using RDF allows us to do things in that context that are impossible or ineffective with existing implemented technologies.  I believe that this may be the most important reason why little progress has been made in moving toward wider use of RDF within the TDWG community.  There is a cost associated with learning about and adopting a new technology, and that cost must be exceeded by the benefits to be gained through use of that technology.  Just being exciting isn't enough, and it isn't yet clear to me that we have demonstrated compelling things that RDF can do for us that other technologies can't.  How's that for agnosticism?

In subsequent blog posts, I plan to talk in more detail about the factors outlined above.  Next up: What does it mean to "discover new information" in an RDF context?

* "IRI" now used in preference to "URI", see http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-iri

No comments:

Post a Comment