Why a processor rather than a transform

I have spent a little time looking at other approaches to mapping RDF to a web document created as XHTML; approaches such as GRDDL, which uses XSLT to transform basic concepts from (X)HTML into RDF/XML and then provides a link to the transform.

(RAP just released a GRDDL parser, though it’s based on PHP 5.x, which means don’t expect it out on the streets too soon.)

This works, if all you’re doing is pulling out data that can be mapped to valid XHTML structure elements. But it doesn’t work if you want to capture meaning that can’t be constructed from headers and paragraphs, or DIV blocks with specific class names. Still, it meets a criteria of minimal human intervention, which finds favor among Semantic Web developers. If the user is providing a page anyway, might as well glean the meaning of it.

However, as we’ve found with Google, which does basically the same thing except it performs it’s magic after the material is accessed, automated mechanisms only uncover part of the story. This is why I get people searching on the oddest things coming to my site – accidental groupings of words pulled from my pages just happen to meet a word combination on which they’re searching.

In other words, hoping to discover semantics accidentally, only goes so far.

One reason I use a poetry finder as a test of any new semantic web technologies and approaches is that any solution that would work to help people find the right sub-set of poetry, won’t do so because of accidental semantics.

Let’s look at two popular RDF vocabularies: RSS and FOAF. RSS is an accidentially semantic application. The same data that drives an application such as a weblogging tool can be used to create RSS without much intervention on the part of the user. I could also use the same mechanism that drives RSS to drive out something like my Post Content vocabulary, PostCon.

(Though one bit of information I capture in PostCon, such as the fact that a page has been pulled and information as to why it’s been pulled cannot be capture in RSS; RSS implies a specific state for a document: “I exist.”)

FOAF, on the other hand, requires that the user sit down and identify relationships. There really is little or no accidential semantics for this vocabulary, unless you follow some people’s idea that FOAF and blogrolls are one in the same (a hint: they’re not).

So what drives out the need for FOAF? Well, much of it is driven out by people attracted a bright, new, shiny objects. Still, one can see how something like FOAF could be used to drive out systems of social networks, or even *shudder* webs of trust, so there is an added benefit to doing the work for FOAF beyond it being cool and fun.

The key to attracting human intervention, beyond getting someone influential and well known to push it, is to make it easy for the end user–the non-XML, non-RDF end user–to provide the necessary data, and then to provide good reasons why they would do so. The problem with this approach, though, is that many Semantic Web technologists don’t want to work on approaches that require the human as an initial part of the equation. Rightfully so: a solution that requires effort from people, and that won’t have a payback until critical mass is reached, is not something that that’s easy to sell.

Still, I think FOAF has shown a direction to follow – keep it simple, uncomplicated, and perhaps enough people will buy in at first to reach the critical mass needed to bring in others. The question, though, is whether it can attract the interest of the geeks, because it’s not based on XSLT.

With GRDDL, one can attach a class name to a DIV or SPAN element, and then use XSLT to generate matching RDF/XML. This removes some of the accidental discovery by explicitly stating something of interest with that DIV element. More, this doesn’t require that the data be kept separate from the document – it would be embedded directly in the document.

However, rather than making this less complicated, the whole thing strikes me as making the discovery of information much more complicated than it need be.

Now, not only would the end-user have to write the text of a writing, they would have to go through that text and mark specific classes of information about each element within the XHTML. This then exposes the end user to the XHTML, unless one starts getting into a fairly complicated user interface.

Still, this is another approach that could be interesting, especially when one considers the use of Markdown and other HTML transforms used in weblogging tools. How to do something like this and have it map to multiple data models could be challenging.

Don’t mind me, still thinking out loud.