Separating presentation from semantics

After all these years, we have finally reached the point where we’ve separated page organization from presentation, and now we’re about to embark on the same mistakes again, but this time with presentation and semantics.

I’ve been following the issues associated with the vocabindex Drupal module, including one where the person submitting the bug stated the vocabindex use of UL was incorrect. We’re supposed to, MXT writes, use definition lists rather than unordered lists for any lists of terms with associated definition.

At first glance, it does seem as if the vocabindex module is using the unordered list incorrectly. After all, look at any of my category pages (such as the one for the Semantic Web)—what you see is a list of “terms” and their associated definitions. An obvious candidate for definitions lists.

Look more closely, though. In my sidebar menu I list the vocabindex terms as links to web page URIs, but there is no definition attached to any item. The description, if one is given, is, instead, added as a title attribute to the item and displayed only when the item has cursor focus. Yet, it’s the same data. Does this mean, then, I’m somehow not properly displaying my menu items? Should I have a huge sidebar, with the item description given underneath?

More importantly, whether I have a given text description for each item is purely optional, some do, some don’t. Yet the items in the list have meaning without any associated description. In fact, each item in the list is really nothing more than a label for a bucket to hold content. I could just as easily use foobarsillyputty as labels, except that I’m trying to use “meaningful” labels in order to enable you all to better find past content.

In the absolutely ancient W3C page where lists are covered a list of ingredients is given as an example of an unordered list:

  • 1 cup sugar
  • 1 cup oil
  • 2 cups flour

However, I don’t see that anyone would have a problem with adding parenthetical information to this list in order to further clarify the items:

  • 1 cup sugar (light brown granulated by C & H)
  • 1 cup oil (canola or corn, but not olive)
  • 2 cups flour (white or mixed white and whole wheat)

This is really nothing more than what the vocabindex is doing with the vocabulary index terms within both the index pages and my sidebar menu: a listing of items and a (parenthetical) description to clarify what that item is. However, it’s only when the description is displayed as a “tooltip” that one sees the item as a clarification phrase, only. When presented in the index pages, it “looks” like a definition list, and so we want to have it marked up this way— mixing up semantics and presentation.

A definition list is assumed to have two pieces of information; the term or phrase and an associated definition. Even when not present, the definition is still assumed to be forthcoming at some point— not having it is the exception, not the rule. An unordered list is just that: a list of items. They can be a list of items to buy at the store, select from in a form, or click on in a web page. There’s no assumption that any additional information is necessary for the item. If there was, Drupal would make this information mandatory rather than optional. The application and associated developers would definitely discourage the use of the items in a tag cloud or other format where only the term is given because the term, by itself, would be meaningless.

Yet we look at how the terms are portrayed in a page like the vocabindex page I linked above, and that’s enough for us to say we should use one form of markup over another because it’s more “semantical”. Further exploration online at other sites who attempt to define the differences between unordered lists and definition lists shows the same thing: if we see two pieces of data, we’re assuming a definition list, because that’s what it looks like—not what it is.

The HTML5 document adds another key element to the discussion of definition lists by stating that definition lists are name-value pairs. From this can we deduce, then, that the name has no meaning without the value. That’s my interpretation: that a definition list is the proper semantic markup only when data is defined within a context of names and associated values, both of which are meaningless without the other. If, however, *HTML5 allows us to list the names without values, or the values without the names, then the HTML5 document is imprecise, and we should just use whatever we want to use— semantics can not be derived from imprecision.

Currently, in Drupal, vocabulary terms are discrete labels, nothing more. Any description is for clarification not definition, and isn’t essential to the meaning of the term. Forget how the vocabindex pages “look”, and focus on what the data means. If we can’t do that, then this whole semantic markup thing is a bit of a farce, really.

*Confirmed: it doesn’t

RDF too

Congratulations to the RDFa folks for rolling out a release candidate of RDFa for XHTML. Now that I’ve finished tweaking site designs, my next step is to see about incorporating smarts into my pages, including the use of RDFa. In addition, I also want to incorporate the RDF Drupal modules more closely into the overall functionality. The SPARQL module still seems broken, but the underlying RDF modules seem to be working now.

The RDFa release candidate is timely, as I gather the BBC has decided to forgo microformats in favor of RDFa. This has re-awakened the “microformats vs. RDFa” beast, which we thought we had lulled to sleep. I guess we need to switch lullabies.

Speaking of lullabies, I had hoped to start work on the second edition of Practical RDF later this year, but it is not to be. The powers-that-be at O’Reilly aren’t comfortable with a second edition and have accepted another book proposal that covers some of what I would have covered in order to make the book livelier. There just isn’t the room for both.

I am disappointed. The first version of “Practical RDF” was difficult because the specification was undergoing change, the semantic web wasn’t generating a lot of interest, and there weren’t that many RDF-based applications available. Now, the specs are mature, we have new additions such as RDFa, increased interest in semantics, and too many applications to fit into one book. I feel as if I started a job, and now won’t be able to finish it.

One issue in the book decision is the “cool” factor. RDF and associated specifications and technologies aren’t “cool”, in that people don’t sit around at one camp or another getting hot and bothered talking about how “RDF is going to change the world!” However, the topic doesn’t necessarily have to be “cool” if the author is “cool”, and I’m not. I don’t Twit-Face-Space-Friend-Camp-Chat-Speak-Shmooze. What I do is sit here quietly in my little corner of waterlogged Missouri, try out new things, and write about them. That’s not really cool, and two not-cools do not make a hot book.

I don’t regret my choice of lifestyle, and not being “cool”. I do regret, though, leaving the “Practical RDF” job undone. Perhaps I’ll do something online with PDFs or Kindle or some such thing.

The Bottoms Up RDF Tutorial

When I wrote the first chapter of the book, Practical RDF I used the analogy of the blind men describing an elephant to describe how people see RDF. With the original fable, each blind man would feel a different part of the elephant and make a decision about what the elephant looked like as a whole because of their own experience. One man said the elephant would be like a wall, after feeling its side; another like a spear after feeling its tusk, and so on.

The same can be said for RDF. It was created to provide a language for describing things on the web–a simple premise. Yet to hear RDF described, it is everything from just another metamodel, like the relational model; to the future of the Web; to the most twisted piece of thinking ever to have originated from human minds. In particular, the latter belief seems to dominate most discussions about RDF.

I’m not quite sure why the Resource Description Framework (RDF) was constrained into a defensive position since birth. It could be that the long delays associated with the specification left more than a whiff of vapor. Or perhaps it was the sometimes unrealistic expectations heaped upon the poor beastie that led folks into taking a devil’s advocate position on both the concept and the implementation–positions that have now become rigidly unyielding as if the people had swallowed a shitload of concrete and their asses are firmly pinned to the ground.

I do know that those of us who appreciate how the model can facilitate interoperability, as well as discovery, have spent a great deal of our time defending the model and the implementation of the model, RDF/XML. So much so that if all that time was laid end to end, it would stretch around even the heads of even the biggest egos on the web. More importantly, if we had used that time for creating rather than explaining, the usability of RDF would be a moot point.

This last year, I watched as yet another person defended RDF in response to yet another RDF detractor and I realized reading what the detractor wrote that he didn’t really even know what RDF was, much less how it could be used. He was just repeating what so many other people said, and we were responding right on cue–saying much the same words in defense that we said a year ago, two, three, and more. I realized then that I can never successfully ‘defend’ RDF because a defense implies that decisions haven’t been already made.

Instead, I will respond to questions and I will clarify, but rather than talk about how usable RDF is and point out the many tools available, I’d rather spend most of my time creating something usable with these tools. Luckily, by design, I don’t have to convince the detractors to do the same, because the really great thing about RDF is that its use does not have to be ubiquitous for me to do fun and useful things with the available technology.

This tutorial is based on my own uses of RDF and RDF/XML. Though I’ll cover all the important components of the model, I’m focusing on what I call street RDF–RDF that can be used out of the box to meet a need rather than being targeted to some universal megadata store in the future. In addition, rather than just introduce each aspect as it comes along, and building from the simple to the complex, I’m going to take the arguments against RDF that I’ve heard in the last four years, and address them one at a time, using the components of RDF as I go.

In other words, rather than approach the elephant that is RDF, I am, instead, approaching each of the blind men’s view of RDF, and see if we can’t take what they’re seeing and fit it into the whole.

RDF is Too Complex

If you ask most of the people who don’t like RDF why they feel this way, most will answer that RDF is too complex. They will point to the fact that the specification’s description is spread across six released documents at the W3C, most of which contain an esoteric blend of formal proofs and language obscure enough to give even decided RDFophiles pause. They have a valid point, too: The RDF specifications are not meant to be consumed by the average web user.

However, one can say the same of the HTML or XHTML documents, or those for CSS or any other specification. The documents at the W3C are meant more for those implementing the technology, then the average user. More, they’re meant as a way of ensuring that implementations are consistent, which if one leaves aside anything else about a specification, consistency has to be the critical requirement.

Yet if we strip away the proofs and the checks, the core of RDF is, at heart, very simple. The basic component of RDF is a statement of the form subject predicate object; loosely translated into comparable terms from a sentence in English, the subject is the noun, the predicate the verb, and the object, the thing being acted on by the verb.

The cat ate the mouse.

The photograph’s has a height.

The height is 500 pixels

In the specification, the predicate is considered the property, but there is an implicit very of ‘has’ associated with that property. So, for instance, if this document was written by me, the statement could be:

This document has writer of Shelley

Even though the predicate is technically ‘writer’, there is an implicit verb: has.

RDF refers to these simple statements as triples, each consisting of a required subject, predicate, and object. These triples then form the basis for the for both the language and the specification. If you access the information I have stored about one of my posts, you’ll see that no matter how complex the model, it can be converted to a set of triples.

Of course, now that you have these triples, you have to pull them all together and that’s where the RDF graph comes in. Technically, an RDF model is a node and directed arc graph, where the predicates are on the arcs, and the subject and objects are in the nodes. To see what a graph looks like, access the RDF Validator hosted by the W3C, and type in the URL for one of my RDF files, such ashttp://weblog.burningbird.net/2005/10/27/perceived-barriers/. Change the output options to Graph only and then have the Validator parse the data. Or you can see the generated graph here.

Note that the predicates are given namespace identification, such as http://purl.org/dc/elements/1.1/ for the Dublin Core source predicate. The reason for this is so that I can have a ‘source’ in my schema that is safely differentiated from ‘source’ in your schema, in such a way that both can be used in the same model. A predicate is always identified with a specific namespace: either one completely spelled out, as in the model; or one given an abbreviation of the namespace (which is then defined in the namespace section of the model), such as dc:source; or if none is given, it’s assumed to be part of the default schema for the model. These namespaces can be added to the model anywhere, but are usually defined in the opening RDF element:

<rdf:RDF
xml:base=”http://weblog.burningbird.net#”
xmlns:xml=”http://www.w3.org/XML/1998/namespace”
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:rdfs=”http://www.w3.org/2000/01/rdf-schema#”
xmlns:xsd=”http://www.w3.org/2001/XMLSchema#”
xmlns:owl=”http://www.w3.org/2002/07/owl#”
xmlns:dc=”http://purl.org/dc/elements/1.1/”
xmlns:foaf=”http://xmlns.com/foaf/0.1/”
xmlns:image=”http://jibbering.com/vocabs/image/#”>

Returning to the snapshot of the model, note that some of the nodes are ovals, others are square. The difference is that those nodes that are resources are drawn as oval; those that are literal values are drawn with a rectangle around them. Resources in RDF are objects that are identified by a URI (Uniform Resource Identifier) reference, or URIref. A URIref is basically a way of uniquely identifying an object within a model. For instance, the URIref for this document would be the URL, since a URL is also a specific type of URI. Your weblog could be identified by the URL used to access it. I, as an individual, can be identified by a URL to an about-me page, or a mailto reference to my email address–which leads us to another aspect of RDF that gives people pause: using URIs to identify objects that don’t exist on the web.

Everything can be described using fruit

Everything can be described using fruit if you’re motivated to do so and have both imagination and a sense of consistency. I could be described as apple/lime/pineapple and you as tangerine/strawberry/kiwi. We would most likely run into duplicates, but if we use the fruit system to define locations, then specifying a person’s fruit name and location would most likely serve to identify an individual: I am apple/lime/pineapple from banana/honeydew/orange.

This isn’t as silly as it sounds. Well, yes, it is as silly as it sounds, but the point is we’ve been using arbitrary symbols to identify people and things and places since we first started scratching drawings into cliff walls. And by common agreement, we know that these forms of identification are representative–they aren’t meant to be taken for the ‘real’ thing.

Yet when RDF uses URIs to identify things, some folk tsk-tsked, saying that URIs were meant to identify things that could be fetched from the web. I can’t be fetched from the web, so how can I be identified by a URI?

How can I be identified by fruit? Just because I name myself a fruit name, doesn’t mean you can put me into a blender and make juice out of me. The same applies to using a URI to identify me: though I can’t be ‘fetched’ from the web, I can put up a page that is my representation, my avatar if you will, and use this as a way of identifying me within web-related activities. So in RDF, any object that has a URI is a resource. Any object that doesn’t have a URI is a literal value or a blank node. Speaking of blank nodes, if you think people have kittens over using URIs to access non-web objects, you should see how they feel about blank nodes.

Who am I?

Within the universe that is RDF, there are objects that have no name, because to give names for these objects is meaningless — outside of the sphere of a specific RDF model, these objects have no unique identity.

If you’re like me, you have one drawer set aside as a junk drawer. In this, you put all the crap that you can’t a place for elsewhere: pens, paper clips, rubber bands, that odd plastic knob that fell off something, pizza coupons, which you’ll never use, and so on. In our household, the junk drawer is the small top right-most drawer at the end of in the free-standing unit, which has the range and oven, facing toward the oven.

Given this information, you can identify exactly which drawer is the junk drawer. So if I ask you to please get me that odd plastic knob from the ‘junk drawer’, you won’t have to search all my drawers to find it. But if I were to go into your house and you asked me the same and I don’t know which drawer is the junk drawer in your home, this way of identifying the junk drawer is meaningless.

Oh, by happenstance, the method of identifying the drawers could be the same, but that doesn’t make them the same identical junk drawer–it’s just that, by luck, you and I have the same configurations of kitchens, and this particular drawer has a ‘junk drawer’ appeal to it.

Blank nodes are basically the junk drawers of RDF. Though they are given some ‘dummy’ identifier within the model (such as genid:ARP129805), to identify them uniquely outside of the model in which they’re found makes little sense; the same as identifying a ‘junk drawer’ outside of an individual’s home makes little sense. We don’t want to give a blank node uniqueness outside of the model–because to do so, changes its meaning. It would be the same as formally identifying my junk drawer as “Shelley’s Junk Drawer”, which implies this same junk drawer will always be Shelley’s Junk Drawer, and my junk drawer will always be the one I have now. This just isn’t true. It is only my junk drawer now, in this time, in this place.

Within the model, an identifier not only makes sense, it’s an imperative; otherwise, we have no way of being to access this object consistently in the model. Usually, this is generated by whatever tool is used to build the model. And if two models are merged, whatever tool is used to manage this merging renames the blank nodes so that each is, again, unique within the new model. As before, though, this name isn’t of interest, in and of itself. It’s just a label, a convenience.

For some reason, blank nodes, or bnodes as they are commonly called, cause consternation in some folks who resist their use; saying that bnodes add unnecessary complexity to a model. After all, there’s no reason we can’t use something such as a fragment identifier (i.e. http://weblog.burningbird.net/somepost#someobject, ‘someobject’ being the fragment) to identify the node if it’s dependent on the model.

However, to give a bnode an identifier that allows it to be uniquely identified outside of the context of the model would again change the meaning of the node–it is no longer a bnode, it is Something Else. To return to the analogy of the junk drawers, if I were to marry again someday, and my husband brought his junk drawer to our new home, and I brought mine, our home would then have two junk drawers: identified within this new context as Hubbie’s junk drawer and Shelley’s junk drawer. The latter wouldn’t be the same “Shelley’s junk drawer” I had when I was single; I would, however, treat it exactly the same.

(We could also merge the contents of our junk drawers, and have one combined His-and-Her junk drawer. This is something I just can’t contemplate, though, which is probably why I’ll remain single the rest of my life. My junk is my junk–mixed with another’s, it would no longer be my junk.)

I could merge two models and the program may or may not use the same names for the bnodes in this new, combined model, but it doesn’t matter: how each is used, and each bnode’s contribution to the model as a whole would be the same regardless of its name, because the name doesn’t matter.

If, in the interests of simplification, I’m not willing to do without my bnodes, I am quite happy to do without other aspects of RDF, such as containers and the Big Ugly: reification.

The Big Ugly

RDF containers are objects that, themselves, are assumed to contain other objects (and in fact, being a container is part of the intrinsic meaning of the object). How the contained objects relate to each other depends on the container: in a Bag, the objects can occur in any order, while in a Seq (sequence), order has meaning.

If you open a RSS 1.0 or 1.1 syndication feed you would see a container. For instance, my feed currently has the following:

<items>
<rdf:Seq>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/27/the-theory-of-relativity-explained/”/>

<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/27/perceived-barriers/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/26/that-sucking-sound-you-hear/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/26/dont-mind-me-just-carry-on-as-usual/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/26/pleasedo-evil/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/25/the-theory-of-relativity/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/25/the-heart-of-the-civil-rights-movement/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/25/lets-hear-it-for-bad-ideas/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/24/travel-confirmed/”/>
<rdf:li rdf:resource=”http://weblog.burningbird.net/2005/10/23/quiet/”/>
</rdf:Seq>
</items>

In turn, each of the items listed within this Seq container would be listed and defined later in the document.

I never use Containers in my simple RDF models (outside of my syndication feed), primarily because one can use straight RDF statements and achieve the same results. When the syndication feed is output as triples, the Seq becomes a simple statement whereby the container object is a bnode, has a predicate of http://www.w3.org/1999/02/22-rdf-syntax-ns#type, and a object of http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq. Each listed item then becomes a separate statement, with the container bnode as the object, a predicate indicating the position in the sequence, and an object identified by the permalink for each individual post. The real meaning for the container comes from the type predicate and the Seq value — no different than how one can attach any number of other statements about a resource by giving specific predicate and object values.

If there are alternatives to RDF containers, there aren’t for reification, though again, the result is a set of triples. Reification is basically making a statement about a statement. If I make a statement that Microsoft has decided to built Vista on top of Linux, just like Mac OS X, and I record this in RDF, I’ll most likely also want to record who made this statement–it acts as provenance for the statement. In RDF, attaching this provenance to the document is known as reifying the statement.

However, reifying a statement does not make an assertion about its truth. In fact, if you look up the word provenance its definition is:

1. Place of origin; derivation.
2. Proof of authenticity or of past ownership. Used of art works and antiques.

In other words, it can be considered a verification of the source, but not necessarily a verification of the truth. An important concept in RDF, but not necessary for the simple uses of RDF I incorporate into my site.

I also don’t need to incorporate OWL, the ontology language that’s built on top of RDF. OWL stands for Web Ontologoy Language. Yeah, I know — the acronym doesn’t fit. OWL adds another layer of sophistication on top of RDF. Through the use of OWL, not only can we record statements about resources, we can also record other parameters that allow us to infer additional information…without this information being specifically recorded.

For instance, I can define a RDF class called Developer. I can then create a subclass of Developer and call it OpenSourceDeveloper, to classify open source developers. Finally, I can create a subclass off this subclass called LampDeveloper, to classify open source developers who mainly work with LAMP technologies. With this structure defined, I can attach additional statements about Developer and be able to infer the same information about open source developers and developers who use LAMP.

With OWL, we can define class structures, constrain membership, define ranges, assign cardinality, establish the logical relationship between properties and so on, all of which allows us to make inferences based on the found statements. It’s very powerful, yet all of it eventually gets recorded as triples. Plain, old triples–the atom of the semantic web.

I haven’t made extensive use of OWL in version 1.0 of my Metaform extension layer, but I plan on incorporating it into version 2.0 of the plugins and extensions. Still, I have managed to capture a great deal of information. The question than becomes–where do I put it?

(Hey! I managed to work 2.0 into the conversations. That should be triggering all sorts of detectors.)

Where to put the pesky data

As you can see, RDF doesn’t have to be complicated, but it can be very sophisticated. With it, we can record specific pieces of information, which can then be queried using a specialized query language (RDQL, and now SPARQL). We can also record information about the structure of the data itself that allows us to make some rather interesting inferences. But there was one small problem with RDF: where do we put it?

RDF is serialized into RDF/XML, which we’ll get into later. For some files, the RDF/XML used to define the resource can be included directly in the file. The XMP section of a photograph is one such. For others, such as a RSS 1.0 syndication feed, the RDF/XML defines the data, as well as the metadata. However, the most common web resource are web pages, and these are created using HTML or XHTML, and these formats do not have a simple method for embedding XML within the documents.

To work around the limitation, people have used HTML comments to embed RDF/XML and this can work in a very limited way. As an example, the RDF/XML used to provide autodiscovery of trackback within weblog posts is embedded within an HTML comment. But the limitations inherent in this approach are so significant that it’s not considered a viable option.

The W3C is also working on an approach to extend XHTML to allow RDF/XML. Still, others are exploring the concept of using existing XHTML attributes to hold RDF data, which can then be interpreted into RDF/XML using XSLT or some other technology. In this approach, class and other attributes–legitimate to use in XHTML–can be used to hold some of the data

However, all of these options presuppose that web pages are, by nature, static objects. The trend, though, is for dynamic web pages. Most commerce applications now are database driven, as are most commercial sites. As for weblogs, it’s becoming increasingly rare to find static HTML pages.

These pages are served up directly from the database when a specific URL is accessed. For instance, accessing the URL http://weblog.burningbird.net/2005/10/28/truth-hurts/, in combination with a specific rule defined in my .htaccess file, triggers the web server to convert the URL into one that identifies the post name and archive information, which is then passed to a PHP program. This program uses this passed in information to locate the post, incorporating the dynamic data with a template resulting in a page that, to all intents, looks like a typical web page.

It’s then a short step to go from serving up a web page view of the data to serving up an RDF/XML view of the metadata defined for the page. That’s what I do in my site–attaching /rdf/ to the end of a post returns an RDF/XML document if formal metadata is defined for the web page. Unfortunately, this conflicts with WordPress, which determines that /rdf/ should return a RSS 1.0 view of whatever data is available for the object. As such when I converted my Wordform plugins and extensions to WordPress, I used /rdfxml/ to pull out any metadata defined for the document.

This works nicely, and with the increased use of dynamic web pages, seems to me to be the way of the future. Not only could you provide a XHTML view of data, you could provide an RDF/XML view of metadata, and even generate a microformat version of the same metadata information for inclusion within the XHTML tags.

Tag, You’re Not It

Speaking of microformats, the hot metadata technology the last year has been microformats and structured blogging. With microformats, the use of standard XHTML attributes such as class and rel are used to define the metadata and associate directly with the data, in a manner that makes the metadata visible to the web page reader. Structured blogging follows this same premise, except that it proposes to use existing XHTML structures in more meaningful manners, facilitating the process through the use of plugins and other programmatic aids.

Both approaches are useful, but limited when compared to RDF. One could easily take RDF/XML data and generate microformats and/or structured blogging, but the converse isn’t true. Even within the simple plugins that I’ve created for Wordform and WordPress, there isn’t a microformat or structured blogging approach that could replicate the metadata that I record for each of my pages. And I’ve not even tried to stretch the bubble.

In fact, the best approach would be to record the data in RDF/XML, and then use this to annotate the dynamically generated XHTML with microformats, or organize it for structured blogging. This approach then provides three different views of metadata — enough to satisfy even the greediest metadata consumer.

Where be the data

When I first created my RDF plugins for Wordform, I inserted the data into the same MySQL database that held my weblog entries. By the time I was finished, I had over 14,000 rows of data. However, about this time, I also started experimenting with storing the data in files, one file for each URL. I found that over time, the performance of the files-based system was better, and a litte more robust than that for the database. More importantly, using the file approach means that people who use my WordPress weblog plugins don’t have to modify their database to handle the RDF data.

When one considers that MySQL accesses data in the file system and that PHP caching usually makes use of files, storing one model per URL in one file makes a lot of sense. And so far, it also takes up less space overall, as all the periphery data necessary for tables in MySQL actually adds to the load.

Of course, each file is stored as RDF/XML–the topic about I left for last, as there is no other aspect of RDF that generates heated discussion than the format of RDF/XML.

The Final Answer

Ask any person why they don’t want to work with RDF and you’ll hear comments about the “RDF tax” and the complexity and most of all, that RDF/XML is ugly.

We’ve seen that the so-called RDF tax is less taxing than one assumes considering the complaints. The requirements for an RDF model are the minimum needed to ensure that data can be safely added to, and taken from a model without any detrimental impact on the integrity of that model. I can easily grab an entirely new vocabulary and throw it into a plugin, which adds data to the same model other plugins add data to and know that I don’t have to worry about a collision between elements, or loss of data. More than that, anyone can build a bot and consume this data without having to know what vocabularies I’m using. Later on, this same data can be queried using SPARQL and it’s only when searching for specific types of data that the vocabularies supported comes into play. The data exists for one purpose but can feed an infinite number of purposes in the future.

As for the complexity, well, RDF just is: make it as simple or as complex as you need. Just because the specification has some more complex components, doesn’t mean you have to use them. Dumb RDF stores–dumb meaning using RDF for immediate needs rather than long-term semantic goodness–are going to become more popular, as soon as we realize that in comparison with other data models, such as the relational, RDF is more self-contained; has a large number of programming APIs to manipulate; and are lightweight and easy to incorporate into existing applications.

Finally, the issue of the ugly RDF/XML. Oddly enough, my first exposure to XML in any depth was through Mozilla’s early use of RDF/XML. As such, I find little about the structure to be offensive.

Regardless of exposure, though, what does it matter how RDF/XML looks? There may be some unusual people who spend Saturday nights manually writing XML, but for the most part, our XML is generated, and this includes our RDF/XML. As for programmers being concerned about having to understand the syntax in order to generate it or parse it, all they have to understand is how the RDF triple works, because it’s the RDF API developers who work the issues of RDF/XML. As such, there are RDF APIs in PHP, Perl, C#, C, C++, Python, Ruby, Java, Lisp, and other languages; APIs that provide functions to generate or parse the XML, so that all a developer needs worry about is calling a function to create a new statement and add it to the model.

In fact, comparing the technologies to work with straight XML and RDF/XML there is no contest–the RDF/XML APIs handle more of the bits at the lower level, and as such, as much easier to use.

As to why we don’t just generate the triples directly, we’ve just spent the last five years convincing business that XML was the interoperable syntax of the future–why should we change now, and say they need another syntax? You can write the most amazing application using any number of tools, in any number of languages without once having to touch the XML. And, as one of my plugins demonstrates, you can also use XML parsing in addition to RDF processing. Two for the price of one.

So my final answer about the ugliness of RDF/XML is: don’t look at it.

Cheap Eats at the Semantic Web Cafe

t’s a rare event when several seemingly disparate items of interest all come together to form a compelling, coalescent whole. This event happened for me the past few weeks; an experience formed of discussions about digital identity and laws of same, LID, Technorati Tags, new and old syndication formats, Google’s nofollow, and the divide between tech and user. Especially the divide between tech and user.

I’ve written about digital identity and LID and nofollow recently, so I want to focus on Technorati Tags in this writing, and then, later, bring in the other technologies relationship to same. Besides, for someone who is interested in lowercase semantic web, how can my ear not be all a quiver when I hear about a new way of ‘adding meaning’ to what can be a meaningless web at times?

Tag, you’re it

If you’re unfamiliar with Technorati Tags, it’s a new implementation of an existing concept previously enabled by other sites such as del.icio.us and flickr. With Technorati tags, webloggers can annotate their entries to add keyword associations to their work forming a quasi-classification on the hoof, so to speak.

When you update your weblog, and ping Technorati (or some other service that results in Technorati’s web bot consuming your post), the link to your post is then added to the other most recent additions to the other entries that share the same tag. Not only that, but items at delicious and flickr are also shown in the page, as this entry labeled Folksonomy demonstrates.

From reading other webloggers, the main excitement behind Technorati Tags is its ability to socialize a classification. David Weinberger wrote the following when the concept was first rolled out:

This is exciting to me not only because it’s useful but because it marks a needed advance in how we get value from tags. Thanks to del.icio.us and then flickr in particular, hundreds of thousands of people have been introduced to bottom-up tagging: Just slap a tag on something and now its value becomes social, not individual.

Cory Doctorow shared in this enthusiasm, writing:

Technorati Tags are keywords that map to category names, keywords, and other cues in blog posts. When you bring up a Technorati Tag for “computers,” you get all relevant blog posts that Technorati knows about, presented on a page with relevant Del.icio.us links and relevant Flickr images. Technorati Tags blend three different Internet services and three services’ worth of tags to tease meaning out of the ether. Brilliant.

Ross Mayfield writes

But below all that global heady stuff, what tags do really well is aid social discovery.

Simon Waldman jumped in with:

Smart. Smart. Smart. If a little rough round the edges.

And Suw Charman enters the lists with:

All in all, this is an interesting way of using emergent tagsonomies to pull together diverse datastreams in one place. As it happens, I’ve had a number of different conversations recently with friends about such things, and this is a useful first step along the way to creating a single entry point for a variety of sources.

It might seem at first exposure that the enthusiasm for Technorati Tags is a little difficult to understand. After all, we’ve been able to classify our writings for a long time in our weblogs; as for searching on specific topics, we’ve had considerable experience using keyword searches in Google and Yahoo. However, the interest in Technorati Tags seems to be focused on its value as a social grouping rather than as a way of categorization. Waldman referenced the term “self-organizing web”, to describe the concept.

For instance, if I were using Technorati Tags in this post, I would add whatever tags I felt represented the content of this writing, such as FolksonomyDigital_IdentityTags, and Old_Mills. Of course, when checking Old_Mills, I find that this is fresh meat from a Technorati perspective, as there no previously annotated weblog listings using this tag. This leads me to believe that perhaps there’s a different tag I want to use. After all, if I’m going to go through the bother of using a Technorati Tag, I’m would rather use one that puts me into an active social classification than one that doesn’t. So I try Missouri instead, because after all, the photos of old mills in this writing are in Missouri. I see a gratifying number of entries for this tag, providing positive feedback of my choice.

This process of refining exactly which tags to use demonstrates what we’re told is the true power of Technorati Tags–not that we, as individuals, can categorize our writing any way we want; but that people will seek out existing tags that represent their material, and therefore begins a grass roots taxonomy–or folksonomy to use what is becoming a popular term.

Returning to my ‘socialized choice’, among the other entries tagged “Missouri” are pointers in del.icio.us to a Metafilter discussion on the recent ruling about the KKK being allowed into the highway cleanup program, and an interesting story in reference to the New Mardras fault, both stories I’ve written about and if had tagged previously, would also show in the list. This does demonstrate the positive grouping effect of these tags.

Still, there are other entries that look more like ads than entries related to Missouri, including ones for mobile DJs. This demonstrates one of the negative aspects of Technorati Tags: their vulnerability to spammers. Another vulnerability that has been quickly pointed out is that the material can be seen as inappropriate to the topic or even offensive when placed next to the other material that’s published in the same category.

Bad tag. Bad.

Rebecca Blood was one of the first to make note of inappropriate material within the content tagged with “MLK” for Martin Luther King day.

Now, that photo is perfectly appropriate on Flickr as part of an individual’s collection, and as documentation of Sunday’s rally. It’s perfectly appropriate as an illustration for ‘protests’, or even ‘Israel’ and ‘Palestine’, even though it surely will offend some people wherever it appears. But it is not appropriate to illustrate a category tagged ‘MLK’. I personally was offended–these sentiments reflect the polar opposite to those espoused by Dr. King. More to the point, such an illustration is inappropriate–that poster has as much to do with Dr. King as would a picture of a banana peel.

Foe Romeo also noticed this, especially when looking at the Teen tag and noticing links to a pornography weblog and suggests that Technorati has taken on new roles as both editor and moderator with the introduction of Tags. In her comments, Kevin Marks responds to her concerns with:

We have confirmed with Flickr that pictures flagged with offensive are not included in external feeds, so the advice to Rebecca to visit Flickr to warn about the picture was correct; we also removed the german porn spam blog you noticed from our database.

We are still feeling our way here, and adding community moderation is one possibility.

But another commenter, Beerzie Yoink (who links to an interesting website, btw) wrote:

I’m not a technical genius, but quite frankly don’t see how they are going to manage this. Won’t tags used by spammers, pornographers, racists, and other jerks will be hard to separate from legitimate posts? It will be interesting to see how this plays out.

(em. mine)

Within a day or so of Tags being released, questions have been asked about separating out ‘good’ material from ‘bad’, and finding ways of altering Technorati so as to eliminate offensive material. Of course, as Julian Bond points out, there’s a mighty big chasm between here and there when it comes to this type of change:

We seem to be playing out the same old, same old pattern once more that’s been done a million times before in online communities. The Politically Correct Police (PCP) are making lots of noise about how “This isn’t right and SOMETHING SHOULD BE DONE”. The Anti-PCP come along, who love a good flame war, and are finding ways to wind them up. The poor developers get backed into a corner and end up coming up with a series of nasty hacks to sanitise what was once a nicely elegant, simple and minimalist solution. What makes me laugh in all this are the ludicrous solutions put forward by the PCP who clearly have never been anywhere code.

One of the challenges with self-forming community efforts is that each member brings with him or her different interpretations of why the group has formed, and what it’s purpose is. What’s particularly fascinating about it is that the same people who exult the ease with which the group can form, are also the same people who then pick through the members, saying which ones can stay, and which ones have to go.

While some of those who have questioned the overall goodness of Technorati tags have focused on the correctness of the content, others focused on the quality of the overall effort. In other words, can cheap semantics scale?

Get yer semantics here! Red hot semantics! Get ’em while they last

I took the title for this post from Tim Bray’s discussion about Technorati tags, where he wrote:

I’ve spent a lot of time thinking about metadata and have written on the subject; the most important conclusion was: There is no cheap metadata. I haven’t seen anything to make me change my mind.

Having said that, and granting the proposition that The Simplest Thing That Could Possibly Work usually wins, I still have to say that the Technorati Tags all being in a single flat namespace does seem a little, well, brittle.

Liz Lawley also wrote on her concerns about the long-term viability of tags and folksonomies, specifically, whether group concensus leads to valid, or best, results:

On the one hand, as a librarian, I understand completely the value of controlled vocabularies and taxonomies. I don’t want to have to look in six different places for information on a given topic—I want some level of confidence that the things I want are grouped together. On the other hand, I don’t share the optimism that so many of my colleagues in this field seem to have that the collective “wisdom of crowds�? will always yield accurate and useful descriptors. Describing things well is hard, and often context-specific.

Bang on the money except that I would extend this further to read, “…describing this well in such a way as to be meaningful to a great proportion of the populace…” All of us can describe things easily understood by ourselves or our immediate social groups.

Both Liz and Tim reference a post by Clay Shirky where he writes that though folksonomies (the concept to which Technorati Tags has been linked) may not have the quality of well-designed vocabularies, they’ll still persist and ultimately triumph, primarily because these efforts minimize cost and maximize user participation.

This is something the ‘well-designed metadata’ crowd has never understood — just because it’s better to have well-designed metadata along one axis does not mean that it is better along all axes, and the axis of cost, in particular, will trump any other advantage as it grows larger. And the cost of tagging large systems rigorously is crippling, so fantasies of using controlled metadata in environments like Flickr are really fantasies of users suddenly deciding to become disciples of information architecture.

Any comparison of the advantages of folksonomies vs. other, more rigorous forms of categorization that doesn’t consider the cost to create, maintain, use and enforce the added rigor will miss the actual factors affecting the spread of folksonomies. Where the internet is concerned, betting against ease of use, conceptual simplicity, and maximal user participation, has always been a bad idea.

Yet it’s interesting that those who support the concept behind folksonomies tend not to use it as effectively as they could, as pind’s dot com discovered when looking at the del.icio.us tags used by Liz and Clay. What’s needed, he then writes, is technology that helps him, and the rest of us, do a better job of classification. But then that takes us back to Julian’s statement about taking minimalistic solutions such as Technorati Tags and telling developers to ‘make them better’–make them so that they perform as well as controlled vocabularies, but without requiring any effort, expertise, or discipline on the part of the users of such technologies.

The concensus among all those who wrote on Technorati Tags seems to be that if folksonomies are not as sophisticated as we would wish, may not scale well, or have the quality that controlled vocabularies have, they’re still based on typically simple solutions; easily applied by the user, controlled by the user, and therefore are better than not having anything when it comes to trying to build this semantic web of ours. Or as Clay wrote:

The advantage of folksonomies isn’t that they’re better than controlled vocabularies, it’s that they’re better than nothing, because controlled vocabularies are not extensible to the majority of cases where tagging is needed. Building, maintaining, and enforcing a controlled vocabulary is, relative to folksonomies, enormously expensive, both in the development time, and in the cost to the user, especailly the amateur user, in using the system.

I grant that tags (Technorati, Flickr, and other) and the other tools of folksonomies are better than having nothing at all; but is there a possibility that they are also worse than having nothing at all?

Bad habits are hard to break

Recently I, and others, wrote about a new single sign-on digital identity system called Light-Weight Digital Identity (LID). What caught our attention wasn’t necessarily that LID was the best digital identity system proposed–there are a lot of unanswered questions inherent with the current implementation–but that it was the first that actually delivered code into the hands of the user that empowered us to control our own identities.

When I wrote on LID, I was asked in several emails what I thought of the Identity Common’s effort with XRI ((eXtensible Resource Identifiers) and XDI (XRI Data Interchange)–universal identification and data exchange protocol specifications, respectively; particularly since I am such an adherant to RDF and both are dependent on URI (Uniform Resource Identifiers) to identity objects of interest, and the implementations of the two could be made interchangable through existing technologies. I answered that I was ‘briefly’ familiar with them, the briefly based on the fact that both are still primarily in specification stage and there is no implementation that I can put my hands on. I could agree that many of the issues about digital identity and problems associated with it have been addressed by the documentation for XRI/XDI — but where’s the goodies?

In other words, XRI/XDI may be the more robust solution, but there’s nothing that I can work with (pre-alpha sourceforge projects not withstanding); where LID, perhaps not as robust, does provide something I can not only use immediately, and I can use without any form of centralized architecture being in place to support it.

Or as was noted in the mailing list for the Identity Commons efforts, sometimes the … “simplest thing that could possibly work” is very attractive indeed.

While I was being questioned about XRI/XDI, several people had emailed Kim Cameron to ask his opinion of it. Kim has become somewhat of a leader in the digital identity community through his interest and not the least because of a set of ‘laws’ he started defining for digital identity implementations.

Rather than address it directly, Kim released a sixth law of digital identities that read as follows:

The Law of Human Integration

The universal identity system MUST define the human user to be a component of the distributed system, integrated through unambiguous human-machine communications mechanisms offering protection against identity attacks.

This law references one of the difficulties inherent with the efforts behind much of the digital identity movement, in that most of the solutions are focused on organizations protecting themselves from abuse and fraud, rather than on individuals being able to safely and easily use whatever solution is provided. This would seem to support LID. However, Kim also provided a scenario earlier in his lead up to his sixth law that plays more subtly on this issue:

To take a very simple example, suppose you have a browser with an address bar showing you the DNS name of the site you are visiting. And suppose there is a “lock icon” which appears when a “secure connection” is in place. What is to prevent a piece of code running on your machine from overwriting the DNS name and throwing up a fake lock icon – so you are convinced you are visiting one secure site when you are actually visiting another insecure one? And so on.

Of course our usual immediate reaction to this type of problem is to find the most expedient single thing we can do to fix it. In the example just given, the response might be to write a new “safe address bar”. And who am I to criticise this, except that in the end, the proliferation of address bars makes things worse. By inventing one, we have unintentionally made possible the new exploit of getting people to install an address bar with evil intent built right into it. Further, who now can tell which address bar is evil and which one is not?

The point I am trying to make is that the new distributed identity system needs to be something other than an “expedient compensation”, something beyond a tactical riposte in the fight for security. And since the identity system has to work on all platforms, it must be safe on all platforms. The properties that lead to its safety can’t be obscurantist or derive from the fact that the underlying platform or software still has a small adoption.

In other words, the expedient solution may not be the best overall solution.

Whether LID can be seen as an ‘expedient solution’ or not, if LID had implementations in PHP or Python that would be simple to install and use, and there was more clarity on the license, it would have fired enough grassroots support to make it a contender for the de facto digital identity implementation, thus making it that much more difficult for other, perhaps more ‘robust’ solutions to find entry into the community at a later time.

This also applies to the concept of meta-data. If people become used to receiving value, even if it is only limited value, from folksonomies based on very little effort on their part, they’re going to become reluctant when other more robust solutions are provided if these latter require more effort on their part. Especially if these more robust or effective solutions take time to be accessible ‘to the masses’ because the creators of same are *enclosured behind walls built of scholarly interest, with no practical means of entry for the likes of you and me.

Clay expands on his general theme of the suckiness of ontologies, as compared to folksonomies because the former forces a future prediction of structure while the latter allows for dynamic growth; the former is based on a graph, with predefined nodes, each requiring a progenitor, while the latter is based on sets, and the only barrier to entry is forming a decision to belong.

Ontology is a good way to organize objects, in other words, but it is a terrible way to organize ideas, and in the period between the invention of the printing press and the invention of the symlink, we were forced to optimize for the storage and retrieval of objects, not ideas. Now, though, we can scrap of the stupid hack of modeling our worldview on the dictates of shelf space. One day the concept of creativity can be a subset of a larger category, and the next day it can become a slice that cuts across several categories. In hierarchy land, this is a crisis; in tag land, it’s an operation so simple it hardly merits comment.

The move here is from graph theory (arrange everything in a tree graph, so that graph traversal becomes the organizing principle) to set theory (sets have members, and the overlap or non-overlap of those memberships becomes the organizing principle.) This is analogous to the change in how we handle digital data. The file system started out as a tree graph. Then we added symlinks (aliases, shortcuts), which said “You can organize things differently than you store them, and you can provide more than one mode of access.�?

Yet, as we’ve already started to see with Technorati Tags, as with other implementation such as del.iciou.us tags and flickr, low barrier to entry usually doesn’t scale well. Something like the Missouri Tag may have few enough entries to make finding the meaningful data easy, but something like Weblog results in so many members as to make it difficult to differentiate from the populace as a whole. The same applies to social networks, where people collect so many ‘friends’ as to make being a ‘friend’ of the person inherently meaningless.

So then we start exploring ways and means to make these simple systems and folksonomies more effective. In the case of Google, the developers create algorithms that try to add meaning to the results returned on a search by basing the results on number of links and popularity of a site, with an assumption that popularity equates to authority. In the case of Flickr, social behavior is incorporated into the tags, and members can label photos as ‘offensive’, in which case the photo is excluded from external feeds. However, without having a clear, not to mention shared, idea of what ‘offensive’ means, the results will always be suspect. After all, some would say that photos of a woman’s bare breasts or a man’s penis are offensive; others would say any photo of President Bush is offensive.

All of these solutions and the tricks to make them work better are based on the fact that the rich context of the data is not captured along with the data, and therefore there is only so much good we can wring out of these ‘cheap’ semantic web solutions before they’re wrung dry and spit out like overchewed tobacco cud. Or before they’re gamed by people such as the comment spammers, and then we, the blades of grass within the grassroots efforts, have to add more effort to our input in order to ‘refine’ (read that ‘fix’) the results, as witness the recent release of Google’s nofollow attribute.

(One could say that Peter Kaminksi is prescient when he remarks January 15th about annotating links in a similar manner to Technorati tags, so that Google could also participate in the new, more meaningful web.)

It is the structure, the future prediction, careful classification, and directed graph nature that Clay disdainfully rejects that allows us to capture the rich nuances of data that will persist longer than the quick transitory interests that meet efforts such as Technorati Tags. One only has to compare the Technorati Tag for Terrorism with the Weapons of Mass Destruction, Terrorist, and Terrorist Type ontologies, and associated instance database to see where the discipline to apply more robust metadata concepts can result in much more controlled, and specific, result sets. And since the data is defined in a universally understood model, RDF, you don’t even have to use the ontology creator’s own search tool (try who, what, where for the three values, in that order)–you could use my much more crude, but quickly hacked together Query-o-Matic, based on existing technologies.

Louis Rosenfeld discusses the strength of searches among controlled data sources as compared to that of folksonomies:

Lately, you can’t surf information architecture blogs for five minutes without stumbling on a discussion of folksonomies (there; it happened again!). As sites like Flickr and del.icio.us successfully utilize informal tags developed by communities of users, it’s easy to say that the social networkers have figured out what the librarians haven’t: a way to make metadata work in widely distributed and heretofore disconnected content collections.

Easy, but wrong: folksonomies are clearly compelling, supporting a serendipitous form of browsing that can be quite useful. But they don’t support searching and other types of browsing nearly as well as tags from controlled vocabularies applied by professionals. Folksonomies aren’t likely to organically arrive at preferred terms for concepts, or even evolve synonymous clusters. They’re highly unlikely to develop beyond flat lists and accrue the broader and narrower term relationships that we see in thesauri.

Returning to Kim Cameron’s sixth law, which states there must be an unambiguous and non-corruptable interface between the user and the technology, we could also apply to this metadata: the costs to support controlled vocabularies/ontologies and uncontrolled vocabularies/folksonomies are the same. At some point a human has to intervene with the technology to refine and validate the result. With ontologies, the intervention occurs before the data is captured; with folksonomies, the intervention occurs with each search.

I put my money on the ‘refine and validate just once’ solution.

Isgood but…is good?

Though Rosenfeld and most others I’ve listed here support folksonomy efforts, some with caveats, others unreservedly, as just one of a variety of technologies that help people find what they need, I tend to be of the camp that believes focusing on easy solutions will make it more difficult to get acceptance for ‘better’ solutions that may require a little more effort. This puts me in the exact **opposite camp of Clay Shirky.

Clay believes that ultimately ontologies will fall to folkonomies, as the latter gain rapid acceptance because of their low cost and ease of use; I believe that ultimately interest in folksonomies will go the way of most memes, in that they’re fun to play with, but eventually we want something that won’t splinter, crack, and stumble the very first day it’s released.

What we don’t need are more cheap solutions, and ultimately, I find that Technorati Tags are a ‘cheap’ solution, though a compelling one, and useful for generating conversation if no other reason. And I don’t want to deginerate Technorati’s efforts with this, because I feel in the end Technorati is going to play a major role in our semantic efforts. Still, no matter how many tricks you play with something like tags, you can only pull out as much ‘meaning’ as you put into them.

What we need, instead, is a way of making richer solutions more accessible to people, and in that, I do agree with Clay–lower the barrier of participation. In the email list for the Identity Commons effort, the members talked about how the URL which serves as identifier within LID is also a URI, which forms the basis for XRIs, and how the group should look at ways of achieving synergy with this new effort. Rather than being disdainful, they sought to turn LID into an opportunity.

This type of attitude is what we need more of–how can we make the richer, more robust solutions available to folks like you and me. In some ways, FOAF, the ontology used to identity ourselves and who we know is an example of this because its very accessible to ‘regular folk’; yet its also based on a robust and highly interchangable data model, which means it could be easily merged with other data that shares the same identity.

One hell of a ride

Clay states that whether we’re supportive of folksonomies or not, they’re going to happen–we are in a kayak floating along a river of change:

It doesn’t matter whether we “accept�? folksonomies, because we’re not going to be given that choice. The mass amateurization of publishing means the mass amateurization of cataloging is a forced move. I think Liz’s examination of the ways that folksonomies are inferior to other cataloging methods is vital, not because we’ll get to choose whether folksonomies spread, but because we might be able to affect how they spread, by identifying ways of improving them as we go.

To put this metaphorically, we are not driving a car, with gas, brakes, reverse and a lot of choice as to route. We are steering a kayak, pushed rapidily and monotonically down a route determined by the enviroment. We have a (very small) degree of control over our course in this particular stretch of river, and that control does not extend to being able to reverse, stop, or even significantly alter the direction w’re moving in.

I consider that the difference between the ‘web’ and the ‘semantic web’ to be one based on ‘meaning’ alone, not on toys and attachments. If my opinon holds true, is the transformation of the web to the semantic web equivalent to a ride in a kayak? Pulled along by forces with little control over direction and speed?

I will concede to Clay the challenging, swift nature of the transport, but argue that only a fool would put themselves into a narrow sliver of wood, hide, or plastic on a raging river without training, accepting to fate to ensure we don’t end up smashed, bloodied, and drowned. And it’s equally foolish to believe that we can, somehow, with the right use of technology, exponentially derive complex meaning out of what is, essentially, flat data.

I agree with Clay that the semantic web is going to be built ‘by the people’, but it won’t be built on chaos. In other words, 100 monkeys typing long enough will NOT write Shakespeare; nor will a 100 million people randomly forming associations create the semantic web.

* No enclosured is not a real word, but should be because it adds more description of the effect than ‘enclosed’.

** Of ontologies, Clay writes …don’t get me started, the suckiness of ontology is going to be my ETech talk this year…, which is probably one reason my own proposal, which is diametrically opposite to Clay’s talk, was not accepted. Well that and I mentioned the ‘p’ word.

Archived, with comments, at the Wayback Machine

Walking in Simon’s Shoes

The editor for my book, Practical RDF, was Simon St. Laurent, well known and respected in XML circles. Some might think it strange that a person who isn’t necessarily fond of RDF and especially RDF/XML, edit a book devoted to both, but this is the way of the book publishing world.

Simon was the best person I’ve worked with on a book, and I’ve worked with some good people. More importantly, though, is that Simon wasn’t an RDF fanatic, pushing me into making less of the challenges associated with RDF, or more of its strengths. Neither of us wanted a rah-rah book, and Practical RDF is anything but.

I’ve thought back on many of the discussions about RDF/XML that happened here and there this last year. Simon’s usually been on the side less than enthusiastic towards RDF/XML, along with a few other people who I respect, and a few who I don’t. Mine and others’ blanket response has usually been in the nature of, “RDF/XML is generated and consumed by automated processes and therefore people don’t have to look at the Big Ugly”. This is usually accompanied by a great deal of frustration on our part because if people would just move beyond the ‘ugliness’ of RDF/XML, we could move on to creating good stuff.

(I say ‘good stuff’ rather than Semantic Web because the reactions to this term are best addressed elsewhere.)

However, the situation isn’t really that simple, or that easily dismissed, If pro-RDF and RDF/XML folks like myself are ever going to see this specification gain some traction, we need to walk a mile in the opponent’s shoes and acknowledge and address their concerns specifically. Since I know Simon the best, I’ve borrowed his shoes to take a closer look at RDF/XML from his perspective.

Simon has, as far as I know, three areas of pushback against RDF: he doesn’t care for the current namespace implementation; he’s not overly fond of the confusion about URI’s; and he doesn’t like the syntax for RDF/XML, and believes other approaches, such as N3, are more appropriate. I’ll leave URIs for another essay I’m working on, and leave namespaces for other people to defend. I wanted to focus on concerns associated directly with RDF/XML, at least from what I think is Simon’s perspective (because, after all, I’m only borrowing his shoes, not his mind).

The biggest concern I see with RDF/XML from an XML perspective is its flexibility. One can use two different XML syntaxes and still arrive at the same RDF model, and this must just play havoc with the souls of XML folks.

As an example of this flexibilty, most implementations of RDF/XML today are based on RSS 1.0, the RDF/XML version of the popular syndication format. You can see an example of this with the RSS 1.0 file for this weblog.

Now, the XML for RSS 1.0 isn’t all that different from the XML for that other popular RSS format, RSS 2.0 from Userland — seen here. Both are valid XML, both have elements called channel and item, and title, and description and so on, and both assume there is one channel, but many items contained in that channel. From an RSS perspective, it’s hard to see why any one would have so much disagreement with using RDF/XML because it really doesn’t add much to the overhead for the syndication feed. In fact, I wrote in the past about using the same XML processing for RSS 1.0, as you would for RSS 2.0.

However, compatibility between the RDF/XML and XML versions of RSS is much thinner than my previous essay might lead one to believe. In fact, looking at RSS as a demonstration of the “XMLness” of RDF/XML causes you to miss the bigger picture, which is that RSS is basically a very simple, hierarchical syndication format that’s quite natural for XML; its very nature tends to drive out the inherent XML behavior within RDF/XML, creating a great deal of compability between the two formats. Compatibility that can be busted in a blink of an eye.

To demonstrate, I’ve simplified the index.rdf file down to one element, and defined an explicit namespace qualifier for the RSS items rather than use the default namespace. Doing this, the XML for item would look as follows:

<rss:item rdf:about=”http://rdf.burningbird.net/archives/001856.htm”>
<rss:description></rss:description>
<rss:link>http://rdf.burningbird.net/archives/001856.htm <dc:subject>From the Book</dc:subject>
<dc:creator>shelleyp</dc:creator>
<dc:date>2003-09-25T16:28:55-05:00</dc:date>
</rss:item>

Though annotating all of the elements with the rss namespace qualier does add to the challenge of RSS parsers that use simple pattern matching, because ‘title’ must now be accessed as ‘rss:title’, but the change still validates as valid RSS using the popular RSS Validator, as you can see with an example.

Next, we’re going to simplify the RDF/XML for the item element by using a valid RDF/XML shortcut technique that allows us to collapse simple, non-repeating predicate elements, such as title and link, into attributes of the resource they’re describing. This change is reflected in the following excerpt:

<rss:item rdf:about=”http://rdf.burningbird.net/archives/001856.htm”
rss:title=”PostCon”
rss:link=”http://rdf.burningbird.net/archives/001856.htm”
dc:subject=”From the Book”
dc:creator=”shelleyp”
dc:date=”2003-09-25T16:28:55-05:00″ />

Regardless of the format used, the longer more widely used approach now and the shortcut, the resulting N-Triples generated are the same, and so is the RDF model. However, from an XML perspective, we’re looking at a major disconnect between the two versions of the syntax. In fact, if I were to modify my index.rdf feed to use the more abbreviated format, it wouldn’t validate with the same RSS Validator I used earlier. It would validate as proper RSS 1.0, and proper RDF/XML, and valid XML — but it sings a discordant note with existing understanding of RSS, RSS 1.0 or RSS 2.0.

More complex RDF/XML vocabularies that are less hierarchical in nature stray further and further away from more ‘traditional’ XML even though technically, they’re all valid XML. In addition, since there are variations of shortcuts that are proper RDF/XML syntax, one can’t even depend on the same XML syntax being used to generate the same set of triples from RDF/XML document to RDF/XML document. And this ‘flexibility’ must burn, veritably burn, within the stomach of XML adherents, conjuring up memories of the same looseness of syntax that existed with HTML, leading to XML in the first place.

It is primarily this that leads many RDF proponents as well as RDF/XML opponents into preferring N3 notation. There is one and only one set of N3 triples for a specific model, and one, and only one RDF model generating the same set of N3 triples.

Aye, I’ve walked a mile in Simon’s shoes and I’ve found that they’ve pinched, sadly pinched indeed. However, I’ve also gained a much better understanding of why the earnest and blithe referral to automated generation and consumption of RDF/XML, when faced with criticism of the syntax, isn’t necessarily going to appease XML developers, now or in the future. The very flexibility of the syntax must be anathema to XML purists.

Of course, there are arguments in favor of RDF/XML that arise from the very nature of the flexibility of the syntax. As Edd Dumbill wrote relatively recently, RDF is failure friendly, in addition to being extremely easy to extend with its built-in understanding of namespace interoperability. And, as a data, not a syntax person, I also find the constructs of RDF/XML to be far more elegant and modular, more cleanly differentiated, than the ‘forever and a limb” tree structure of XML.

But I’m not doing the cause of RDF and RDF/XML any good by not acknowledging how easy is it to manipulate the XML in an RDF/XML document, legitimately, and leave it virtually incompatible with XML processes working with the same data.

I still prefer RDF/XML over N3, and will still use it for all my application, but it’s time for different arguments in this particular debate, methinks.