Categories
Semantics

The Bottoms Up RDF Tutorial

Recovered from the Wayback Machine.

When I wrote the first chapter of the book, Practical RDF I used the analogy of the blind men describing an elephant to describe how people see RDF. With the original fable, each blind man would feel a different part of the elephant and make a decision about what the elephant looked like as a whole because of their own experience. One man said the elephant would be like a wall, after feeling its side; another like a spear after feeling its tusk, and so on.

The same can be said for RDF. It was created to provide a language for describing things on the web–a simple premise. Yet to hear RDF described, it is everything from just another metamodel, like the relational model; to the future of the Web; to the most twisted piece of thinking ever to have originated from human minds. In particular, the latter belief seems to dominate most discussions about RDF.

I’m not quite sure why the Resource Description Framework (RDF) was constrained into a defensive position since birth. It could be that the long delays associated with the specification left more than a whiff of vapor. Or perhaps it was the sometimes unrealistic expectations heaped upon the poor beastie that led folks into taking a devil’s advocate position on both the concept and the implementation–positions that have now become rigidly unyielding as if the people had swallowed a shitload of concrete and their asses are firmly pinned to the ground.

I do know that those of us who appreciate how the model can facilitate interoperability, as well as discovery, have spent a great deal of our time defending the model and the implementation of the model, RDF/XML. So much so that if all that time was laid end to end, it would stretch around even the heads of even the biggest egos on the web. More importantly, if we had used that time for creating rather than explaining, the usability of RDF would be a moot point.

This last year, I watched as yet another person defended RDF in response to yet another RDF detractor and I realized reading what the detractor wrote that he didn’t really even know what RDF was, much less how it could be used. He was just repeating what so many other people said, and we were responding right on cue–saying much the same words in defense that we said a year ago, two, three, and more. I realized then that I can never successfully ‘defend’ RDF because a defense implies that decisions haven’t been already made.

Instead, I will respond to questions and I will clarify, but rather than talk about how usable RDF is and point out the many tools available, I’d rather spend most of my time creating something usable with these tools. Luckily, by design, I don’t have to convince the detractors to do the same, because the really great thing about RDF is that its use does not have to be ubiquitous for me to do fun and useful things with the available technology.

This tutorial is based on my own uses of RDF and RDF/XML. Though I’ll cover all the important components of the model, I’m focusing on what I call street RDF–RDF that can be used out of the box to meet a need rather than being targeted to some universal megadata store in the future. In addition, rather than just introduce each aspect as it comes along, and building from the simple to the complex, I’m going to take the arguments against RDF that I’ve heard in the last four years, and address them one at a time, using the components of RDF as I go.

In other words, rather than approach the elephant that is RDF, I am, instead, approaching each of the blind men’s view of RDF, and see if we can’t take what they’re seeing and fit it into the whole.

RDF is Too Complex

If you ask most of the people who don’t like RDF why they feel this way, most will answer that RDF is too complex. They will point to the fact that the specification’s description is spread across six released documents at the W3C, most of which contain an esoteric blend of formal proofs and language obscure enough to give even decided RDFophiles pause. They have a valid point, too: The RDF specifications are not meant to be consumed by the average web user.

However, one can say the same of the HTML or XHTML documents, or those for CSS or any other specification. The documents at the W3C are meant more for those implementing the technology, then the average user. More, they’re meant as a way of ensuring that implementations are consistent, which if one leaves aside anything else about a specification, consistency has to be the critical requirement.

Yet if we strip away the proofs and the checks, the core of RDF is, at heart, very simple. The basic component of RDF is a statement of the form subject predicate object; loosely translated into comparable terms from a sentence in English, the subject is the noun, the predicate the verb, and the object, the thing being acted on by the verb.

The cat ate the mouse.

The photograph’s has a height.

The height is 500 pixels

In the specification, the predicate is considered the property, but there is an implicit very of ‘has’ associated with that property. So, for instance, if this document was written by me, the statement could be:

This document has writer of Shelley

Even though the predicate is technically ‘writer’, there is an implicit verb: has.

RDF refers to these simple statements as triples, each consisting of a required subject, predicate, and object. These triples then form the basis for the for both the language and the specification. If you access the information I have stored about one of my posts, you’ll see that no matter how complex the model, it can be converted to a set of triples.

Of course, now that you have these triples, you have to pull them all together and that’s where the RDF graph comes in. Technically, an RDF model is a node and directed arc graph, where the predicates are on the arcs, and the subject and objects are in the nodes. To see what a graph looks like, access the RDF Validator hosted by the W3C, and type in the URL for one of my RDF files, such ashttp://weblog.burningbird.net/2005/10/27/perceived-barriers/. Change the output options to Graph only and then have the Validator parse the data. Or you can see the generated graph here.

Note that the predicates are given namespace identification, such as http://purl.org/dc/elements/1.1/ for the Dublin Core source predicate. The reason for this is so that I can have a ‘source’ in my schema that is safely differentiated from ‘source’ in your schema, in such a way that both can be used in the same model. A predicate is always identified with a specific namespace: either one completely spelled out, as in the model; or one given an abbreviation of the namespace (which is then defined in the namespace section of the model), such as dc:source; or if none is given, it’s assumed to be part of the default schema for the model. These namespaces can be added to the model anywhere, but are usually defined in the opening RDF element:

<rdf:RDF
xml:base=”http://weblog.burningbird.net#”
xmlns:xml=”http://www.w3.org/XML/1998/namespace”
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:rdfs=”http://www.w3.org/2000/01/rdf-schema#”
xmlns:xsd=”http://www.w3.org/2001/XMLSchema#”
xmlns:owl=”http://www.w3.org/2002/07/owl#”
xmlns:dc=”http://purl.org/dc/elements/1.1/”
xmlns:foaf=”http://xmlns.com/foaf/0.1/”
xmlns:image=”http://jibbering.com/vocabs/image/#”>

Returning to the snapshot of the model, note that some of the nodes are ovals, others are square. The difference is that those nodes that are resources are drawn as oval; those that are literal values are drawn with a rectangle around them. Resources in RDF are objects that are identified by a URI (Uniform Resource Identifier) reference, or URIref. A URIref is basically a way of uniquely identifying an object within a model. For instance, the URIref for this document would be the URL, since a URL is also a specific type of URI. Your weblog could be identified by the URL used to access it. I, as an individual, can be identified by a URL to an about-me page, or a mailto reference to my email address–which leads us to another aspect of RDF that gives people pause: using URIs to identify objects that don’t exist on the web.

Everything can be described using fruit

Everything can be described using fruit if you’re motivated to do so and have both imagination and a sense of consistency. I could be described as apple/lime/pineapple and you as tangerine/strawberry/kiwi. We would most likely run into duplicates, but if we use the fruit system to define locations, then specifying a person’s fruit name and location would most likely serve to identify an individual: I am apple/lime/pineapple from banana/honeydew/orange.

This isn’t as silly as it sounds. Well, yes, it is as silly as it sounds, but the point is we’ve been using arbitrary symbols to identify people and things and places since we first started scratching drawings into cliff walls. And by common agreement, we know that these forms of identification are representative–they aren’t meant to be taken for the ‘real’ thing.

Yet when RDF uses URIs to identify things, some folk tsk-tsked, saying that URIs were meant to identify things that could be fetched from the web. I can’t be fetched from the web, so how can I be identified by a URI?

How can I be identified by fruit? Just because I name myself a fruit name, doesn’t mean you can put me into a blender and make juice out of me. The same applies to using a URI to identify me: though I can’t be ‘fetched’ from the web, I can put up a page that is my representation, my avatar if you will, and use this as a way of identifying me within web-related activities. So in RDF, any object that has a URI is a resource. Any object that doesn’t have a URI is a literal value or a blank node. Speaking of blank nodes, if you think people have kittens over using URIs to access non-web objects, you should see how they feel about blank nodes.

Who am I?

Within the universe that is RDF, there are objects that have no name, because to give names for these objects is meaningless — outside of the sphere of a specific RDF model, these objects have no unique identity.

If you’re like me, you have one drawer set aside as a junk drawer. In this, you put all the crap that you can’t a place for elsewhere: pens, paper clips, rubber bands, that odd plastic knob that fell off something, pizza coupons, which you’ll never use, and so on. In our household, the junk drawer is the small top right-most drawer at the end of in the free-standing unit, which has the range and oven, facing toward the oven.

Given this information, you can identify exactly which drawer is the junk drawer. So if I ask you to please get me that odd plastic knob from the ‘junk drawer’, you won’t have to search all my drawers to find it. But if I were to go into your house and you asked me the same and I don’t know which drawer is the junk drawer in your home, this way of identifying the junk drawer is meaningless.

Oh, by happenstance, the method of identifying the drawers could be the same, but that doesn’t make them the same identical junk drawer–it’s just that, by luck, you and I have the same configurations of kitchens, and this particular drawer has a ‘junk drawer’ appeal to it.

Blank nodes are basically the junk drawers of RDF. Though they are given some ‘dummy’ identifier within the model (such as genid:ARP129805), to identify them uniquely outside of the model in which they’re found makes little sense; the same as identifying a ‘junk drawer’ outside of an individual’s home makes little sense. We don’t want to give a blank node uniqueness outside of the model–because to do so, changes its meaning. It would be the same as formally identifying my junk drawer as “Shelley’s Junk Drawer”, which implies this same junk drawer will always be Shelley’s Junk Drawer, and my junk drawer will always be the one I have now. This just isn’t true. It is only my junk drawer now, in this time, in this place.

Within the model, an identifier not only makes sense, it’s an imperative; otherwise, we have no way of being to access this object consistently in the model. Usually, this is generated by whatever tool is used to build the model. And if two models are merged, whatever tool is used to manage this merging renames the blank nodes so that each is, again, unique within the new model. As before, though, this name isn’t of interest, in and of itself. It’s just a label, a convenience.

For some reason, blank nodes, or bnodes as they are commonly called, cause consternation in some folks who resist their use; saying that bnodes add unnecessary complexity to a model. After all, there’s no reason we can’t use something such as a fragment identifier (i.e. http://weblog.burningbird.net/somepost#someobject, ‘someobject’ being the fragment) to identify the node if it’s dependent on the model.

However, to give a bnode an identifier that allows it to be uniquely identified outside of the context of the model would again change the meaning of the node–it is no longer a bnode, it is Something Else. To return to the analogy of the junk drawers, if I were to marry again someday, and my husband brought his junk drawer to our new home, and I brought mine, our home would then have two junk drawers: identified within this new context as Hubbie’s junk drawer and Shelley’s junk drawer. The latter wouldn’t be the same “Shelley’s junk drawer” I had when I was single; I would, however, treat it exactly the same.

(We could also merge the contents of our junk drawers, and have one combined His-and-Her junk drawer. This is something I just can’t contemplate, though, which is probably why I’ll remain single the rest of my life. My junk is my junk–mixed with another’s, it would no longer be my junk.)

I could merge two models and the program may or may not use the same names for the bnodes in this new, combined model, but it doesn’t matter: how each is used, and each bnode’s contribution to the model as a whole would be the same regardless of its name, because the name doesn’t matter.

If, in the interests of simplification, I’m not willing to do without my bnodes, I am quite happy to do without other aspects of RDF, such as containers and the Big Ugly: reification.

The Big Ugly

RDF containers are objects that, themselves, are assumed to contain other objects (and in fact, being a container is part of the intrinsic meaning of the object). How the contained objects relate to each other depends on the container: in a Bag, the objects can occur in any order, while in a Seq (sequence), order has meaning.

If you open a RSS 1.0 or 1.1 syndication feed you would see a container. For instance, my feed currently has the following:

<items>
<rdf:Seq>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/27/the-theory-of-relativity-explained/”/>

<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/27/perceived-barriers/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/26/that-sucking-sound-you-hear/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/26/dont-mind-me-just-carry-on-as-usual/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/26/pleasedo-evil/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/25/the-theory-of-relativity/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/25/the-heart-of-the-civil-rights-movement/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/25/lets-hear-it-for-bad-ideas/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/24/travel-confirmed/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/23/quiet/”/>
</rdf:Seq>
</items>

In turn, each of the items listed within this Seq container would be listed and defined later in the document.

I never use Containers in my simple RDF models (outside of my syndication feed), primarily because one can use straight RDF statements and achieve the same results. When the syndication feed is output as triples, the Seq becomes a simple statement whereby the container object is a bnode, has a predicate of http://www.w3.org/1999/02/22-rdf-syntax-ns#type, and a object of http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq. Each listed item then becomes a separate statement, with the container bnode as the object, a predicate indicating the position in the sequence, and an object identified by the permalink for each individual post. The real meaning for the container comes from the type predicate and the Seq value — no different than how one can attach any number of other statements about a resource by giving specific predicate and object values.

If there are alternatives to RDF containers, there aren’t for reification, though again, the result is a set of triples. Reification is basically making a statement about a statement. If I make a statement that Microsoft has decided to built Vista on top of Linux, just like Mac OS X, and I record this in RDF, I’ll most likely also want to record who made this statement–it acts as provenance for the statement. In RDF, attaching this provenance to the document is known as reifying the statement.

However, reifying a statement does not make an assertion about its truth. In fact, if you look up the word provenance its definition is:

1. Place of origin; derivation.
2. Proof of authenticity or of past ownership. Used of art works and antiques.

In other words, it can be considered a verification of the source, but not necessarily a verification of the truth. An important concept in RDF, but not necessary for the simple uses of RDF I incorporate into my site.

I also don’t need to incorporate OWL, the ontology language that’s built on top of RDF. OWL stands for Web Ontologoy Language. Yeah, I know — the acronym doesn’t fit. OWL adds another layer of sophistication on top of RDF. Through the use of OWL, not only can we record statements about resources, we can also record other parameters that allow us to infer additional information…without this information being specifically recorded.

For instance, I can define a RDF class called Developer. I can then create a subclass of Developer and call it OpenSourceDeveloper, to classify open source developers. Finally, I can create a subclass off this subclass called LampDeveloper, to classify open source developers who mainly work with LAMP technologies. With this structure defined, I can attach additional statements about Developer and be able to infer the same information about open source developers and developers who use LAMP.

With OWL, we can define class structures, constrain membership, define ranges, assign cardinality, establish the logical relationship between properties and so on, all of which allows us to make inferences based on the found statements. It’s very powerful, yet all of it eventually gets recorded as triples. Plain, old triples–the atom of the semantic web.

I haven’t made extensive use of OWL in version 1.0 of my Metaform extension layer, but I plan on incorporating it into version 2.0 of the plugins and extensions. Still, I have managed to capture a great deal of information. The question than becomes–where do I put it?

(Hey! I managed to work 2.0 into the conversations. That should be triggering all sorts of detectors.)

Where to put the pesky data

As you can see, RDF doesn’t have to be complicated, but it can be very sophisticated. With it, we can record specific pieces of information, which can then be queried using a specialized query language (RDQL, and now SPARQL). We can also record information about the structure of the data itself that allows us to make some rather interesting inferences. But there was one small problem with RDF: where do we put it?

RDF is serialized into RDF/XML, which we’ll get into later. For some files, the RDF/XML used to define the resource can be included directly in the file. The XMP section of a photograph is one such. For others, such as a RSS 1.0 syndication feed, the RDF/XML defines the data, as well as the metadata. However, the most common web resource are web pages, and these are created using HTML or XHTML, and these formats do not have a simple method for embedding XML within the documents.

To work around the limitation, people have used HTML comments to embed RDF/XML and this can work in a very limited way. As an example, the RDF/XML used to provide autodiscovery of trackback within weblog posts is embedded within an HTML comment. But the limitations inherent in this approach are so significant that it’s not considered a viable option.

The W3C is also working on an approach to extend XHTML to allow RDF/XML. Still, others are exploring the concept of using existing XHTML attributes to hold RDF data, which can then be interpreted into RDF/XML using XSLT or some other technology. In this approach, class and other attributes–legitimate to use in XHTML–can be used to hold some of the data

However, all of these options presuppose that web pages are, by nature, static objects. The trend, though, is for dynamic web pages. Most commerce applications now are database driven, as are most commercial sites. As for weblogs, it’s becoming increasingly rare to find static HTML pages.

These pages are served up directly from the database when a specific URL is accessed. For instance, accessing the URL http://weblog.burningbird.net/2005/10/28/truth-hurts/, in combination with a specific rule defined in my .htaccess file, triggers the web server to convert the URL into one that identifies the post name and archive information, which is then passed to a PHP program. This program uses this passed in information to locate the post, incorporating the dynamic data with a template resulting in a page that, to all intents, looks like a typical web page.

It’s then a short step to go from serving up a web page view of the data to serving up an RDF/XML view of the metadata defined for the page. That’s what I do in my site–attaching /rdf/ to the end of a post returns an RDF/XML document if formal metadata is defined for the web page. Unfortunately, this conflicts with WordPress, which determines that /rdf/ should return a RSS 1.0 view of whatever data is available for the object. As such when I converted my Wordform plugins and extensions to WordPress, I used /rdfxml/ to pull out any metadata defined for the document.

This works nicely, and with the increased use of dynamic web pages, seems to me to be the way of the future. Not only could you provide a XHTML view of data, you could provide an RDF/XML view of metadata, and even generate a microformat version of the same metadata information for inclusion within the XHTML tags.

Tag, You’re Not It

Speaking of microformats, the hot metadata technology the last year has been microformats and structured blogging. With microformats, the use of standard XHTML attributes such as class and rel are used to define the metadata and associate directly with the data, in a manner that makes the metadata visible to the web page reader. Structured blogging follows this same premise, except that it proposes to use existing XHTML structures in more meaningful manners, facilitating the process through the use of plugins and other programmatic aids.

Both approaches are useful, but limited when compared to RDF. One could easily take RDF/XML data and generate microformats and/or structured blogging, but the converse isn’t true. Even within the simple plugins that I’ve created for Wordform and WordPress, there isn’t a microformat or structured blogging approach that could replicate the metadata that I record for each of my pages. And I’ve not even tried to stretch the bubble.

In fact, the best approach would be to record the data in RDF/XML, and then use this to annotate the dynamically generated XHTML with microformats, or organize it for structured blogging. This approach then provides three different views of metadata — enough to satisfy even the greediest metadata consumer.

Where be the data

When I first created my RDF plugins for Wordform, I inserted the data into the same MySQL database that held my weblog entries. By the time I was finished, I had over 14,000 rows of data. However, about this time, I also started experimenting with storing the data in files, one file for each URL. I found that over time, the performance of the files-based system was better, and a litte more robust than that for the database. More importantly, using the file approach means that people who use my WordPress weblog plugins don’t have to modify their database to handle the RDF data.

When one considers that MySQL accesses data in the file system and that PHP caching usually makes use of files, storing one model per URL in one file makes a lot of sense. And so far, it also takes up less space overall, as all the periphery data necessary for tables in MySQL actually adds to the load.

Of course, each file is stored as RDF/XML–the topic about I left for last, as there is no other aspect of RDF that generates heated discussion than the format of RDF/XML.

The Final Answer

Ask any person why they don’t want to work with RDF and you’ll hear comments about the “RDF tax” and the complexity and most of all, that RDF/XML is ugly.

We’ve seen that the so-called RDF tax is less taxing than one assumes considering the complaints. The requirements for an RDF model are the minimum needed to ensure that data can be safely added to, and taken from a model without any detrimental impact on the integrity of that model. I can easily grab an entirely new vocabulary and throw it into a plugin, which adds data to the same model other plugins add data to and know that I don’t have to worry about a collision between elements, or loss of data. More than that, anyone can build a bot and consume this data without having to know what vocabularies I’m using. Later on, this same data can be queried using SPARQL and it’s only when searching for specific types of data that the vocabularies supported comes into play. The data exists for one purpose but can feed an infinite number of purposes in the future.

As for the complexity, well, RDF just is: make it as simple or as complex as you need. Just because the specification has some more complex components, doesn’t mean you have to use them. Dumb RDF stores–dumb meaning using RDF for immediate needs rather than long-term semantic goodness–are going to become more popular, as soon as we realize that in comparison with other data models, such as the relational, RDF is more self-contained; has a large number of programming APIs to manipulate; and are lightweight and easy to incorporate into existing applications.

Finally, the issue of the ugly RDF/XML. Oddly enough, my first exposure to XML in any depth was through Mozilla’s early use of RDF/XML. As such, I find little about the structure to be offensive.

Regardless of exposure, though, what does it matter how RDF/XML looks? There may be some unusual people who spend Saturday nights manually writing XML, but for the most part, our XML is generated, and this includes our RDF/XML. As for programmers being concerned about having to understand the syntax in order to generate it or parse it, all they have to understand is how the RDF triple works, because it’s the RDF API developers who work the issues of RDF/XML. As such, there are RDF APIs in PHP, Perl, C#, C, C++, Python, Ruby, Java, Lisp, and other languages; APIs that provide functions to generate or parse the XML, so that all a developer needs worry about is calling a function to create a new statement and add it to the model.

In fact, comparing the technologies to work with straight XML and RDF/XML there is no contest–the RDF/XML APIs handle more of the bits at the lower level, and as such, as much easier to use.

As to why we don’t just generate the triples directly, we’ve just spent the last five years convincing business that XML was the interoperable syntax of the future–why should we change now, and say they need another syntax? You can write the most amazing application using any number of tools, in any number of languages without once having to touch the XML. And, as one of my plugins demonstrates, you can also use XML parsing in addition to RDF processing. Two for the price of one.

So my final answer about the ugliness of RDF/XML is: don’t look at it.

Print Friendly, PDF & Email