Maxwell’s Silver Hammer: RDFa and HTML5’s Microdata

Being a Beatles fan, I must admit to being intrigued about the new Beatles box set that will be available in September. I have several Beatles albums, but not all. None of the CDs I own have been re-mastered or re-mixed, including one of my favorite songs, from Abby Road: Maxwell’s Silver Hammer:

Joan was quizzical; Studied pataphysical
Science in the home.
Late nights all alone with a test tube.
Oh, oh, oh, oh.

Maxwell Edison, majoring in medicine,
Calls her on the phone.
"Can I take you out to the pictures,
Joa, oa, oa, oan?"

But as she's getting ready to go,
A knock comes on the door.

Bang! Bang! Maxwell's silver hammer
Came down upon her head.
Bang! Bang! Maxwell's silver hammer
Made sure that she was dead.

I love the chorus, Bang! Bang! Maxwell’s silver hammer came down upon her head…

Speaking of Bang! Bang! Jeni Tennison returned from vacation, surveyed the ongoing, and seemingly unending, discussion on RDFa as compared to HTML5’s Microdata, and wrote HTML5/RDFa Arguments. It’s a well-written look at some of the issues, primarily from the viewpoint of a heavy RDFa user, working to understand the perspective of an HTML5 advocate.

Jeni lists all of the pushback against RDFa that I’m aware of, including the reluctance to use namespacing, because of copy and paste issues, as well as the use of prefixes, such as FOAF, rather than just spelling out the FOAF URI. Jeni also mentions the issue of namespaces being handled differently in the DOM (Document Object Model) when the document is served as HTML, rather than XHTML.

The whole namespace issue goes beyond just RDFa, and touches on the broader issue of distributed extensibility, which will, in my opinion, probably push back the Last Call date for HTML5. It may seem like accessibility issues are the real kicker, but that’s primarily because no one wants to look at the elephant in the corner that is extensibility. Right now, Microsoft is tasked to provide a proposal for this issue—yes, you read that right, Microsoft. When that happens, an interesting discussion will ensue. And unlike other issues, whatever happens will take more than a few hours to integrate into HTML5.

I digress, though. At the end of her writing, Jeni summarizes her opinion of the RDFa/namespace/HtmL5/Microdata situation with the following:

Really I’m just trying to draw attention to the fact that the HTML5 community has very reasonable concerns about things much more fundamental than using prefix bindings. After redrafting this concluding section many times, the things that I want to say are:

  • so wouldn’t things be better if we put as much effort into understanding each other as persuading each other (hah, what an idealist!) so we will make more progress in discussions if we focus on the underlying arguments so we need to talk in a balanced way about the advantages and disadvantages of RDF or, in a more realistic frame of mind:
  • so it’s just not going to happen for HTML5
  • so why not just stop arguing and use the spare time and energy doing?
  • so why not demonstrate RDF’s power in real-world applications?

My own opinion is that I don’t care that RDFa is not integrated into HTML5. In fact, I don’t think RDFa belongs in HTML5. I think a separate document detailing how to integrate RDFa into HTML5, as happened with XHTML, is the better approach.

Having said that, I do not believe that Microdata belongs in the HTML5 document, either. The HTML5 document is already problematical, bloated, and overly complex. It encompasses too much, a fault of the charter, as much as anything else. Removing the entire Microdata section would help, as well as several other changes, but we’ll focus on the Microdata section for the moment.

The problem with the Microdata section is that it is a competing semantic web approach to RDFa. Unlike competition in the marketplace, competition in standards will actually slow down adoption of the standards, as people take a sit-back and see what happens, approach. Now, when we’re finally are seeing RDFa incorporated into Google, into a large CMS like Drupal 7, and other uses, now is not the time to send a message to people that “Oops, the W3C really doesn’t know what the fuck it wants. Better wait until it gets its act together. ” Because that is the message being sent.

“RDFa and Microdata” is not the same as “RDFa and Microformats”. RDFa, or I should say, RDF, has co-existed peacefully with microformats for years because the two are really complementary, not competitive, specifications. Both can be used at a site. Because Microformat development is centralized, it will never have the extensibility that RDF/RDFa provides, and the number of vocabularies will always, by necessity, be limited. Microformats, on the other hand, are easier to use than RDFa, though parsing Microdata is another thing. They both have their strengths and weaknesses. Regardless, there’s no harm to using both, and no confusion, either. Microformats are managed by one organization, RDFa by the W3C.

Microdata, though, is meant to be used in place of RDFa. But Microdata is not implemented in any production capable tool, has not been thoroughly checked out or tested, has not had any real-world implementation that I know of, has no support from any browser or vendor, and isn’t even particularly liked by the HTML WG membership, including the author. It provides only a subset of the functionality that RDFa provides, and necessitates the introduction of several predefined vocabularies, all of which could, and most likely will, end up out of sync with the organizations responsible for the extra-HTML5 vocabulary specification. And let’s not forget that Microdata makes use of the reversed DNS identifier that sprang up, like a plague of locusts, in HTML5, based on the seeming assumption that people will find the following:

com.example.xn--74h

Easier to understand and use then the following:

http://example.com/xn--74h

Which, heaven knows, is not something any of us are familiar with these last 15-20 years.

RDFa and HTML5/Microdata, otherwise known as Issue 76 in the HTML 5 Tracker database. I understand where Jeni is coming from when she writes about finding a common ground. Finding common ground, though, presupposes that all participants come to the party on equal footing. That both sides will need to listen, to compromise, to give a little, to get a little. This doesn’t exist with the HTML5 effort.

Where the RDFa in XHTML specification was a group effort, Microdata is the product of one person’s imagination. One single person. However, that one single person has complete authorship control over the HTML 5 document, and so what he wants is what gets added: not what reflects common usage, not what reflects the W3C guidelines, and certainly not what exists in the world, today.

While this uneven footing exists, I can’t see how we can find common ground. So then we look at Jeni’s next set of suggestions, which basically boil down to: because of the HTML WG charter, nothing is going to happen with HTML5, so perhaps we should stop beating our heads against the wall, and focus, instead, on just using RDFa, and to hell with HTML5 and microdata.

Bang! Bang!

I am very close to this. I had started my book on the issues I have with HTML5, and how I would change the specification, but after a while, a person gets tired of being shut out or shut down. I’m less interested in continuing to “bang my head against the wall”, as Jeni so eloquently put it.

But then I get an email this week, addressed to several folks, asking about the introduction of Microdata: so what does the W3C recommend, then? What should people use? Where should they focus their time?

Confusion. Confusion because the HTML5 specification is being drafted specifically to counter several initiatives that the W3C has been nurturing over the last decade: Microdata over RDF/RDFa; HTML over XHTML; Reverse DNS identifiers over namespaces, and URIs; the elimination of non-visual cues, not only for metadata, but also for the visually challenged. And respect. There is no respect for the W3C among many in the HTML Working Group. And I know I lose more respect for the organization the closer we get to HTML5 Last Call.

In fact, HTML Working Group is a bit of a misnomer. We don’t have HTML anymore, we have a Web OS.

We don’t have a simple HTML document, we have a document that contains the DOM, garbage collection, the Canvas object and a 2D API, a definition for web browser objects, interactive elements, drag and drop, cross-document communication, channel messaging, Microdata, several pre-defined vocabularies, probably more JavaScript than the ECMAScript standard, and before they were split off, client-side SQL, web worker threads, and storage. I’m sure there’s a partridge in a pear tree somewhere in there, but I still haven’t made it completely through the document. It’s probably in Section 10. I know there’s talk of extending to the document to include a 3D API, and who knows what else.

There’s a lot of stuff in HTML5. What isn’t in the HTML5 document is a clean, straightforward description of the HTML or XHTML syntax, and a clearly defined path for people to move to HTML5 from other specifications, as well as a way of being able to cleanly extend the specification—something that has been the cornerstone of both HTML and XHTML in the past. There’s no room for the syntax, in HTML5. It got shoved down by Microdata and the 2D API. There’s no room for the past, the old concepts of deprecated and obsolete have been replaced by such clear terms as “Conforming but obsolete”. And there’s certainly no room for future extensibility. After all, there’s always HTML6, and HTML7, …, HTMLN—all based on the same open, encompassing attitude that has guided HTML5 to where it is today.

If we don’t like what we see, we do have options. We can create our own HTML5 documents, and submit “spec text” for a vote. But what if it’s the whole document that needs work? That many of the pieces are good, but don’t belong in the parent document, or even in the HTML WG?

The DOM should be split out into its own section and should take all of the DOM/interactive, and browser object stuff with it. The document should be re-focused on HTML, without this mash-up of HTML syntax, scripting references, and API calls that exists now. The XHTML section should be fleshed out and separated out into its own section, too, if no other reason to perhaps reassure people that no, XHTML is not going away. We should also be reminded that XHTML is not just for browsers—in fact, the eBook industry is dependent on XHTML. And it doesn’t need Canvas, browser objects, or drag and drop.

Canvas should also be split out, to a completely separate group whose interest is graphics, not markup. As for Microdata, at this point, I don’t care if Microdata is continued or not, but it has no place in HTML5. If it’s good, split it out, and let it prove itself against RDFa, directly.

The document needs cleaning up. There are dangling and orphaned references to objects from Web Workers and Storage still littering the specification. It hops around between HTML syntax and API call, with nothing providing any clarity as to the rhyme or reason for such jumping about. Sure there’s a lot of good stuff in the document, but it needs organization, clean up, and a good healthy dose of fresh air, and even a fresher perspective.

Accessibility shouldn’t be added begrudgingly, woodenly, resentfully. It should be integrated into the HTML, not just pasted on in order to quiet folks because LC is coming up.

The concepts of deprecated and obsolete should be returned, to ensure a sense of continuity with HTML 4. And no, these did not originate with HTML. In fact, the use of deprecated and obsolete have been fairly common with many different technologies. I can guarantee nothing but the HTML5 document has a term like “conforming but obsolete”. I know, I searched high and low in Google for it.

And we need extensibility, and no, I don’t mean Microdata and reverse DNS identifiers. If extensibility was part of the system, folks who want to use RDFa could use RDFa, and not have to beg, hat in hand, to be allowed to sit at the HTML 5 table. This endless debate wouldn’t be happening, and everyone could win. Extensibility is good that way. Extensibility has brought us RDFa, SVG, MathML, and, in past specifications, will continue to bring whatever the future may bring.

whatever the future may bring…

Finding common ground? Walk a mile in each other’s moccasins? Meet mano a mano? Provide alternative specification text?

Bang! Bang!

Jeni’s a pretty smart lady.

Arbitrary Vocabularies and Other Crufty Stuff

I went dumpster diving into the microformats IRC channel and found the following:

singpolyma – Hixie: that’s the whole point… if you don’t have a defined vocabulary, you end up with something useless like RDF or XML, etc
@tantek – exactly
Hixie – folks who have driven the design of XML and RDF had “write a generic parser” as their #1 priority
@tantek – The key piece of wisdom here is that defined vocabularies are actually where you get *user* value in the real world of data generated/created by humans, and consumed eventually by humans.
Hixie – i’m not talking about this being a #1 priority though — in the case of the guy i mentioned earlier, it was like #4 or #5
Hixie – but it was still a reason he was displeased with microformats
@tantek – Hixie – ironically, people have written more than one generic parser for microformats, despite that not being a priority in the design
Hixie – url?
@tantek – mofo, optimus
@tantek – http://microformats.org/wiki/parsers
@tantek – not exactly hard to find
@tantek – it’s ok that writing a generic parser is hard, because not many people have to write one
Hixie – optimus requires updating every time you want to use a new vocabulary, though, right
@tantek – OTOH it is NOT ok to make writing / marking up content hard, because nearly far more people (perhaps 100k x more) have to write / mark up content.
Hixie – yes, writing content should be easy, that’s clear
Hixie – ideally it should be even easier than it is with microformats 🙂
singpolyma – Of course you have to update every time there’s a new vocabulary… microformats are *exclusively* vocabularies
Hixie – there seems to be a lot of demand for a technology that’s as easy to write as microformats (or even easier), but which lets people write tools that consume arbitrary vocabularies much more easily than is possible with text/html / POSH / Microformats today
singpolyma – Hixie: isn’t that what RDFa and the other cruft is about?
Hixie – RDFa is a disaster insofar as “easy to write as microformats” goes
singpolyma – Not that I agree arbitrary vocabularies can be used for anything…
Hixie – and it’s not particularly great to parse either

Hixie – is it ok if html5 addresses some of the use cases that _are_ asking for those things, in a way that reuses the vocabularies developed by Microformats?

Well, no one is surprised to see such a discussion about RDFa in relation to HTML5. I don’t think anyone seriously believed that RDFa had a chance of being incorporated into HTML5. Most of us have resigned ourselves to no longer support the concept of “valid” markup, as we go forward. Instead, we’ll continue to use bits of HTML5, and bits of XHTML 1.0, RDFa, and so on.

But I am surprised to read a data person write something like, if you don’t have a defined vocabulary, you end up with something useless like RDF or XML. I’m surprised because one can add SQL to the list of useless things you end up with if you don’t have defined vocabularies, and I don’t think anyone disputes the usefulness of SQL or the relational data model. A model specifically defined to allow arbitrary vocabularies.

As for XML, my own experiences with formatting for eBooks has shown how universally useful XML and XHTML can be, as I am able to produce book pages from web pages, with only some specialized formatting. And we don’t have to form committees and get buy off every time we create a new use for XML or XHTML; the same as we don’t have to get some standards organization to give an official okee dokee to another CMS database, such as the databases underlying Drupal or WordPress.

And this openness applies to programming languages, too. There have been system-specific programming languages in the past, but the widely used programming languages are ones that can be used to create any number of arbitrary applications. PHP can be used for Drupal, yes, but it can also be used for Gallery, and eCommerce, and who knows what else—there’s no limiting its use.

Heck HTML has been used to create web pages for weblogs, online stores, and gaming, all without having to redefine a new “vocabulary” of markup for each. Come to think of it, Drupal modules and WordPress plug-ins, and widgets and browsers extensions are all based on some form of open infrastructure. So is REST and all of the other web service technologies.

In fact, one can go so far as to say that the entire computing infrastructure, including the internet, is based on open systems allowing arbitrary uses, whether the uses are a new vocabulary, or a new application, or both.

Unfortunately, too many people who really don’t know data are making too many decisions about how data will be represented in the web of the future. Luckily for us, browser developers have gotten into the habit of more or less ignoring anything unknown that’s inserted into a web page, especially one in XHTML. So the web will continue to be open, and extensible. And we, the makers of the next generation of the web can continue our innovations, uninhibited by those who want to fence our space in.

A battle of Beliefs: RDF, Natural Language Processing, and the future of the web

Last Week in HTML has been practicing its wicked ways, and pulled a quote from a comment I made to a post at Sam Ruby’s

Ian is wrong. Absolutely, completely, and dead wrong.

rather than Ian shouting out “Hurrah!”, he says we must have five different solutions to the five problems, because to do otherwise is to…what? Give up control? Fail to meet the Guinness Book of World Records for largest, most pedantic specification ever derived by man?

At first glance, this seems a repetition of an argument that is growing thin with overuse, but the recent discussions in the RDFa mailing list, about RDFa in HTML5, provides a clear demonstration of the basic disconnect between the parties. Enough so to make it of value to re-visit the discussion, again.

On the one hand, you have RDFa, which is a serialization of RDF, which is a formal data model providing support for a universal form of structured data. On the other hand, you have those whose ideology for the future of the web is based on natural language processing. This is an old, old battle and one we’ve been fighting since RDF was first proposed—prior, really, as I remember working ideological differences between natural language processing, as compared to structured data techniques, in various projects at Boeing in the 1980s.

One would think, then, considering the age of the debate that we wouldn’t fight this old battle in the lists for HTML5. Why? Because it exists above and beyond just HTML5. It is a debate about the fundamental nature of the web, at its most general and profound level, while HTML5 is really nothing more then the next generation of HTML. However, we are fighting this macro battle out in the micro lists of HTML5, but deceptively so.

Those who support RDFa have been continuously asked to provide use cases for RDFa, and have created a wiki page to record these use cases. But each time the use cases are proposed, we’re given a response that the use cases are inadequate, and different sets of criteria for how these use cases can be “improved”. It is frustrating to the RDFa adherents, stumbling about in the dark hoping to hit exactly the right “fit” in order to satisfy these never-ending requests.

In the new thread, though, the underlying ideological differences are peering out through the fabric of technical obfuscation, and we see the real purpose behind the demands for RDFa to justify its existence in HTML5. We’re not being asked to justify RDFa in HTML5; we’re being asked to justify RDF, and beyond that, we’re being asked to justify the concept of structured data. Not just once, but for every instance of a use case.

Ian Hickson writes in one comment in the mailing list thread:

I wouldn’t worry too much about the various solutions in each case — a list of solutions can never be complete, and people will never agree on what consists a pro and a con. What would be useful, though, is an example of how RDFa is expected to solve the problem, e.g. with sample markup showing how the relevant data might be encoded and code snippets showing how the data would then be processed; and a discussion of ways to deal with the likely problems (e.g., for this particular use case: how to deal with authors screwing up and encoding bad data, how to deal with apathy from sites that you want to scrape data from, how to deal with malicious authors encoding misleading data, how to deal with spammers, how to deal with requirements like Amazon’s desire to track per-developer usage, how to enable monetization for producers who are intentionally obfuscating the data today, etc. I expect other use cases will have different problems).

The first set of requests are reasonable and have been demonstrated. I use RDFa in my site to document each post with a formal title, author, date, and set of topics, each of which can be extracted using a PHP API that I’ve installed at my site. I plan on using this data in order to generate my front page eventually. This same data can be extracted with a Firefox toolbar, too, if I’m so inclined, and used to output an RDF document for other’s to consume. The data has also been extracted as part of Yahoo’s SearchMonkey effort, I do believe.

Others have provided examples of the Creative Commons licenses, and FOAF, and other uses of RDF/RDFa. Not only the purpose behind the use but even demonstrations of how the data can be combined across pages. These seem to meet the requests for demonstrating code to both incorporate the RDFa in HTML5, as well as code to pull such data out.

As for authors screwing up and providing bad data, well I have to assume the same mechanisms in place, in the browser, when a person inputs bad data into an alt attribute (if it survives in HTML5) would be in place for bad data in a property attribute. And if the data is coded incorrectly, applications expecting valid RDFa wouldn’t be able to process the data, but that’s little different than applications not being able to process a bad script, or malformed piece of SVG, or even a crappy video file, embedded in the page.

The questions I just responded to are legitimate questions. They serve a purpose, and a person can determine by looking at these questions what needs to be provided to ensure the success of the use case. But then we start getting into murkier territory. Ian asks, how to deal with apathy from sites that you want to scrape data from, how to deal with malicious authors encoding misleading data, how to deal with spammers, how to deal with requirements like Amazon’s desire to track per-developer usage, how to enable monetization for producers who are intentionally obfuscating the data today, …

My god, how do we deal with these on the web today? HTML, itself, fails badly with all of these, so do we give up on HTML? If not, then why are we demanding a state of rigor from RDFa that we’re not willing to apply to HTML5, itself?

If you think this latter set of questions were tongue-in-cheek, perhaps a bit of markup levity, Ian repeats them, later, in the same thread

Do we have reason to believe that it is more likely that we will get authors to widely and reliably include such relations than it is that we will get high quality natural language processing? Why?

How would an RDF/RDFa system deal with people gaming the system?

How would an RDF/RDFa system deal with the problem of the _questions_ being unstructured natural language?

How would an RDF/RDFa system deal with data provided by companies that have no interest in providing the data in RDF or RDFa? (e.g. companies providing data dumps in XML or JSON.)

How would an RDF/RDFa system deal with companies that do not want to provide the data free of charge?

How would an RDF/RDFa system deal with companies that want to track per-developer usage of their data?

One could ask all but the first question about HTML, and not find satisfactory answers. Yet we’re being asked to provide sufficient answers to these questions for a small subset of attributes in HTML5, which would form the basis of support for RDFa. As for the first question, Do we have reason to believe that it is more likely that we will get authors to widely and reliably include such relations than it is that we will get high-quality natural language processing?, this, again, brings us back to a fundamental differences in ideology, natural language processing as compared to structured data, and how can one deal with such profound differences in something like a use case?

To repeat what I said earlier, the issue isn’t about RDFa in HTML5. It is about the existence of structured data on the web. It is the underlying purpose behind RDF. It calls into question a decade’s worth of work, based on the input of hundreds if not thousands of developers and designers. It is questioning the fundamental separation of ideology between the web of the future based on natural language processing and the web of the future based on structured data. But where the structured data folks, those who support RDF, and RDFa, welcome natural language processing as a complementary process, the natural language processing folks seem to see the very existence of structured data woven into web documents to be anathema.

Now, someone tell me how we can break through this wall with use cases?

Dan Brickley chastises those on the RDFa group who see this as a battle, writing

This is not a battle. Battles kill people. It is a dispute amongst technologists who have varying assumptions, backgrounds, collaboration networks and agendas, and who are slowly learning to see each other’s perspective.

Please (and I am very serious here) stop using such bloody metaphors to describe what should be a civil and mutually respectful collaborative process. You will not improve anything if you foster this kind of perspective on our shared problems. Battle talk results in a battle mindset. I do not want to hear any RDFa advocates talking in such terms.

Really, enough with the battle stuff. Go find someone who works on HTML5 and be nice to them, find common ground, try out their tools.

Play nice…try out their tools.

I have tried the tools, and in fact just tried the HTML5 validator with the SVG, MathML, and RDFa (minus Curie) preset, and aside from the fact that it tossed my DOCTYPE, didn’t like my profile attribute, some of my meta elements, and the use of “none” as a value for preserveAspectRatio in my SVG, the validator had no problems with any of my RDFa. I would have to assume, then, that we have seen a demonstration of RDFa in HTML5…and found it good? And lo and behold, the RDFa extractors have also found the same page, and the same use of RDFa, to be good. Hands across the water.

But evidently, not sufficient. What else must we do to play nice? Well, Sam has laid out the “nice filter” in comments to his post that began this particular thread

What would it take for inclusion of the RDFa attributes in HTML 5 to be tracked in the W3C HTML Working Group issues list? Given the links I provided at the top of this post, I’d say that pretty much all of the pieces are in place except for a discussion on the public-html mailing list.

What work would be helpful in getting this to be resolved successfully? Fleshing out the use cases addressing as much of these concerns as are relevant.

How can you help? Join the WG and/or contribute to the wiki.

Just so that it is clear, as we move towards summer I plan to become ruthless in clearing out issues which have been raised but don’t appear to have any substantive proposals or support. There is much good work in HTML5 and it would be positively criminal for it not to advance due to procedural maneuverings. I don’t intend to let that happen either.

And this then leads us back to the questions posed by Ian, above. For each use case, must I then justify RDF? Structured data? Must I give details about how spammers will be vanquished, and evil corporations not allowed to monetize such effort? Must I provide a 12-step program in how to lure the reluctant microformat user into the fold? Does the fact that Virgin Mobile misused the Creative Commons license to publish photos of people without getting model releases, mean that the use of RDF/RDFa to document a Creative Commons license can never be a valid use case? After all, it fails the evil corporate use case requirement being demanded of RDFa.

There seems to exist a gentleman’s agreement in these specification email lists, whereby the participants humor absurd questions such as those proposed by Ian. Well, thank goodness I’m no gentleman.

If the RDFa in HTML5 adherents will be required to provide not only justification for RDFa, but also justification for RDF, as a whole, in addition to a dialog and debate about the fundamental differences between natural language processing and structured data with each and every use case, then I fail to see the “niceness” supposedly in play here. It’s difficult, too, to see exactly what we’re supposed to do to bring about this so-called “common ground”. Ultimately, structured data people see natural language processing as complementary, and that there is room on the web for both ideologies. The natural language processing folks see structured data as competitive, and that the web of the future will be based on one or the other, but not both. How do you work through that kind of difference?

Pinky and the Markup Brains

What ended up being the ultimate irritation of my brief foray into HTML5 land, is that I found out, after careful perusal of my original use of RDFa, that I wasn’t using it incorrectly. However, by the time I got through listening to all the arguments, back and forth, and round and round, I was beginning to doubt whether an angle bracket really looked like < and >. I am correct, aren’t I? These are angle brackets, right?

Of course not. I call them angle brackets, but others call them diamond brackets, and I’m sure someone else, most likely from the UK, calls them elbow brackets or the Queen’s brackets, or some such thing.

However, the back and forth, and round and round, wouldn’t be an issue, could even be a journey of discovery, if it weren’t for the arrogance of some of the participants. Or, what I perceive to be arrogance. Variations of, “But that’s wrong and here’s why”, followed up with references to other specifications that hurt, actually physically hurt just to look at, given in a tone of, “How could you think otherwise?” Or responses based on some absolutely obscene piece of markup minutia, repeated over and over again, in attempts to hammer the point home to we, the seemingly dense as bricks.

The end product of such discussions, though, is that people like myself flee the discussion—literally flee, as if the hounds of hell were chomping at our butts. The downside of running away, though, is we’re left feeling that we have no input, no control over what the web of the future will, or will not, allow. That the web of the future of the web is designed by and for the web designers, and not thee and me.

The real problem, though, has less to do with communication style, and more to do with differing levels of expertise and interest. People like me, who are consumers of specs, are mixed in with people who create the parsers and the browsers, and live and breath, eat and sleep this stuff. What else can we, the consumers, do, though? There seemingly is no way for those of us, on the dumb side of markup, to communicate our concerns, wishes, and desires to the other side. But when we do venture into the lists, we are quickly overwhelmed with the specs, the references, the minutia. Our interests get lost in the fact that we don’t have the language to participate. Worse, we don’t have the language to participate in a field notorious for being both competitive, and impatient.

Unbeknownst to ourselves, we have become Pinky to the markup Brains.

So we consumers flee the lists and leave them to the developers and designers, and the end result is that we have specifications, and eventually implementations, that, well, frankly, scare the shit out of most of us.

Don’t believe me? How else could you explain the Yellow Screen of Death that appears whenever you make a simple mistake in markup for the post you’re writing? Not a helpful error, or an error that gently points out where and why the problem occurs; an error that tries to work with you to correct the problem.

No, it is an ugly error, an angry error, with red on yellow, that screams, “Bad, Shelley! Bad”, before it invariably trails off to uselessness on the right side of the browser. You don’t think an actual person like you and me would have designed a specification that encourages this behavior, or a browser that implements it, do you?

The true irony, though, is when you do voice concerns, or criticism, you’re typically met with, “If you want something, you need to participate in the email lists working on the specifications”, and the cycle begins anew. Narf.

Stop justifying RDF and RDFa

update The discussion on RDFa in HTML5 is quite active on the WhatWG mailing list, and so I’m closing comments down here, and encouraging the discussion in that location. There is no restriction on joining the mailing list. A place to start would be a thread I started but I’m sure new threads will be springing up.

I did want to apologize for assuming that the XHTML errors I had recently were due to WhatWG members having fun at my expense. I’ve had people deliberately break my XHTML-based comments in the past when I’ve written about XHTML, and the break was documented with a screenshot on the website of a WhatWG member. I put 2 and 2 together and came up with 5.


I was reading the back and forth argument about the support for RDFa in HTML5, when it hit me that we, who support RDF, and its embedded serialization technique, RDFa, are going about it all wrong.

The question that gets asked, repeatedly, in the HTML5 and WhatWG mailing lists is What problem does RDFa solve? This typically then leads to lengthy discussions about RDFa versus microformats, how one only needs relclassmeta, and script in order to seemingly record the same information. Or that marking this information up in any way is unnecessary, as people won’t use it, use it badly or for evil purposes, and the only direction forward for the web is natural language processing…yada, yada, yada—you’ve heard it all before.

But what if we stop focusing on the perceived purpose of RDF/RDFa? What if, instead of defending RDFa as a format for discovery of semantics on the web, in competition with other techniques, we focus on RDF, as others have focused on MathML and SVG—as a rich, mature specification with its own unique purpose, and its own unique benefit? In other words, begin with the assumption that RDF has value in, and of itself, and does not need to be “justified”. Instead, let’s focus on whether HTML5 can support RDF—the rich, mature specification—as is, with the existing HTML5 extension mechanisms.

The quintessential aspect of RDF is the triple of subject, predicate, and object. For simplicity’s sake: the thing, the property of the thing, and the property’s value.

For the most part, the thing is identified by a URI, a Uniform Resource Identifier, in order to distinguish it from every other thing when different instances of data are combined. To repeat the underlying basis of this particular thought experiment, disregard, for the moment, that RDF is used to record semantics. Focus, instead, on the essential structure of RDF data structure. Now ask yourself: can we represent RDF within an HTML5 document, using the HTML5’s current mechanism for extensibility? My assertion in this writing is that the answer is, no.

To demonstrate, let’s look at the RDF/XML output derived from an examination of the RDFa currently embedded in this page. Case in point, the following:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:ns0="http://www.w3.org/1999/xhtml/vocab#"
  xmlns:ns1="http://purl.org/dc/elements/1.1/"
  xmlns:ns2="http://www.w3.org/2000/01/rdf-schema#">
  <rdf:Description rdf:about="http://realtech.burningbird.net/semantic-web/semantic-markup/oh-look-its-not-just-us-semantic-web-dweebs-who-noticed">
    <ns1:title rdf:parseType="Literal"><a xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" href="/semantic-web/semantic-markup/oh-look-its-not-just-us-semantic-web-dweebs-who-noticed">Oh, look. It's not just us Semantic Web Dweebs who noticed.</a></ns1:title>

    <ns1:subject rdf:parseType="Literal">Semantic Web: <a xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" href="/semantic-web/semantic-markup">Semantic Markup</a></ns1:subject>
  </rdf:Description>
</rdf:RDF>

The RDFa from which this RDF model was derived is the following:

<div id="node-572" class="node" about="/semantic-web/semantic-markup/oh-look-its-not-just-us-semantic-web-dweebs-who-noticed">
      <h2 class="node-title" property="dc:title">
      <a href="/semantic-web/semantic-markup/oh-look-its-not-just-us-semantic-web-dweebs-who-noticed">Oh, look. It's not just us Semantic Web Dweebs who noticed.</a>
    </h2>          
     <div class="taxonomy">
      Tagged: <ul class="links inline"><li property="dc:subject">Semantic Web: <a href="/semantic-web/semantic-markup">Semantic Markup</a></li></ul>    </div>
...
</div>

The triple we’ll focus on is that a given story (subject), belongs to a particular category of story (predicate), which is this case is “Semantic Markup”.

In the example, the subject is identified with the about attribute attached to the outer div element, which encompasses the actual text of the story. The predicate associated with the subject is identified in the property attribute, which is attached to a list element (li), and the RDF object is the text, “Semantic Markup”, contained within the list item element’s opening and closing tags. The two element attributes used in this example, which are not a part of HTML5, are “about” and “property”. The question then is: can we use HTML5’s current extensibility mechanisms to record the same data, maintaining the same essential structure, in order to derive the same RDF data model when the page is passed to some RDF extraction mechanism?

Goodness knows it would seem to be a simple way to represent the RDF bits in existing HTML5 attributes. For instance, we could add “subject” as another class item and thus eliminate the need for the RDFa property. We already have the link contained within the list item, which would seem to serve the purpose of identifying the object uniquely, and therefore don’t need about. In other words, HTML5’s extension mechanism would seem to be sufficient. Except, of course, it’s not.

If the data so documented existed solely within the page, I could use the class attribute to denote the RDF property, but is the “subject” I use in my document, the same as “subject” in someone else’s document? Who knows. Other than a similarity of text, we have no idea if they mean anything. This is a critical breakdown, too, because precision of data model is also an essential element of RDF. Otherwise, we wouldn’t be able to combine documents found on the web with any degree of confidence.

However, I suppose we could annotate the “subject” class value with an abbreviation of the domain from which it derives, in this case the Dublin Core domain, or “dc:” for short. By doing so, when you have a dc:subject in your document, and I have a dc:subject in my document, and both documents attach this property to the same subject, then the data can be safely merged. There is no confusion about what each of us “means”, when use use “subject”.

Of course, we’ll then have to negotiate for a shared meaning behind “dc:”. And we’ll have to ensure that everyone in the world uses the same designation for Dublin Core. Then we’ll have to repeat this exercise for every existing and new vocabulary that comes along…

Perhaps the abbreviated designation isn’t as feasible as it would first seem. So, what we’ll do, then, is annotate the subject with the full domain name URI, and still use the class attribute:

<li class="inline node http://purl.org/dc/elements/1.1/subject">Semantic Markup</li>

Well, that’s going to be interesting to see in our web page documents. Of course, we’ll have to duplicate the domain name URI with every reference to the property, increasing the overall size of the document. And, unfortunately, the dozens, potentially hundreds of RDF parsers that already exist will have to be modified to account for the difference in handling between RDFa embedded in HTML5, and RDFa embedded in XHTML, but that’s a small price to pay for HTML5 compatibility. Really. The RDFa processors will have to look at every use of class in a document, which potentially could slow down processing, and make the applications more sluggish, but that’s also a small price to pay.

Really.

So, we’ve accounted for the predicate, the property in our triple. Next, we need the ability to uniquely identify the resource.

A possible HTML5 attribute we could use is rel attribute, supplying the URI for the subject. However, a quick glance at the HTML5 Wiki for Rel and we can see that, though rel can be str-e-e-e-e-tched almost beyond recognitions, there are limits. Our use of rel as a way of recording a specific URI does not fit within the HTML5 boundaries for permissible uses of the attribute, because it’s not a repeating value that we can define in a table ahead of time.

In our web pages, we can point out our sweethearts, our timesheets, our muse, and a crush. We can’t, however, use rel to point to the resource to which a specific RDF property is attached.

If not the rel, how about others of the HTML5 attributes? For instance, a likely named alternative is the id. Would id work?

Currently, the HTML5 specification supports id to identify a web page element uniquely, but only an element specific to the document and the document’s DOM, or Document Object Model. It’s handy for whizzing the element about the page using JavaScript, and playing pretty, pretty with CSS, but how will it combine with, say, the data from a hundred web pages? A thousand?

Well, it doesn’t combine at all, because the id supported in HTML5 is semantically not the same as the URI necessary for RDF. Though the name of the game in HTML5 is “overloading R us”, in this case the meaning of the term must stretch too much in order to successfully encompass both needs.

So, what is wrong with using a hypertext link to identify a resource? And convincing the HTML5 crew to add “rdf-resource” to “sweetheart” and “muse” in the list of valid rel attribute values?

Ah, now that’s where the rubber meets the road when it comes to RDF. This takes us all the way back to the beginning of the discussions about RDF, and the emphasis placed on the fact that a URI is not the same as a URL. And though a URL is an instance of a URI, not every instance of a URI can be safely used in place of a URL. In other words, we can’t depend on using a hypertext link to identify a resource.

OK, then, what about limiting our RDF to those cases where the URI is a URL?

Unfortunately, this also fails to map cleanly between HTML5 and RDF. In the example, the actual hypertext link associated with the list element with the given property of “dc:subject” isn’t the RDF triple subject, at all. That link is associated with the web page leading to a list of related postings. It’s handy, but it doesn’t uniquely identify the subject being described. No, the actual resource, or subject, is the story, itself.

Now, the story is identified by a hypertext link, but the link in this case isn’t attached in any meaningful way to the element containing our “dc:subject” property attribute. More importantly, from a viewpoint of achieving a clean mapping between the RDF model and bits embedded within the HTML5 document, there is no logic or set of rules within HTML5 to associate the two; not in such a way that we can guarantee the same RDF data model with each iteration of usage within an HTML5 document.

We can assume there’s another link containing the URI within the parent block somewhere that uniquely identifies the resource. There is no formal logic, however, nor set of rules that guarantees we’ll always be able to derive the same RDF model, each and every time.

In other words, the extension mechanisms built into HTML5 can’t ensure that the embedded data can then be used to safely derive and return a consistent RDF model.

RDFa, on the other hand, does define these rules. Defines them well enough that I can make minor modifications to my Drupal template to embed the RDF data, and use a packaged PHP-based API to pull this same RDF data back out. Not just myself—anyone wanting to annotate their web pages with RDF could do so, without negatively impacting on any other aspect of the page, or its consumption by other agents, such as browsers. And any application can then pull the data out using any number of language-based APIs. Unfortunately, though, RDFa does not fit cleanly into the current HTML5 specification. It doesn’t fit, and seemingly, is not welcome.

In the recent discussions related to once again having to “prove” the worthiness of RDF/RDFa, HTML5 lead editor, Ian Hickson, wrote the following in a note posted to one of the HTML working group’s email list.

Also, while the solutions we’re designing will almost certainly still be in use decades from now, and will almost certainly influence the solutions in use centuries from now, we are not actually designing the solutions for the problems seen decades from now.

That is to say, we are trying to solve the problems of today and the next few years, with a design that will be extensible in the future by the maintainers of HTML once they know what the problems of the future are. HTML5 is not the end of the road; when HTML5 is widely deployed and used, then we will be able to design HTML6 on top of it. And so forth.

Thus there is no need for HTML5 to have author-usable features for extensibility to solve the problems of decades from now. The extensibility mechanisms for authors (and HMTL5 has many …) should solve _today’s_ problems; and the language should be designed in such a way that the future maintainers of HTML can later extend the language to fix their problems. This is just how HTML4 was done; it’s how CSS was done; it’s how XML was done (you can’t invent new XML syntax, for instance, that would require a new version of XML).

In this writing, I’ve only looked at the most trivial aspects of the RDF model and its RDFa serialization. If HTML5 fails with something as primitive as a simple RDF triple, it will certainly continue to fail for anything more complex. However, the point on this writing isn’t to highlight the shortcomings of HTML5, as a whole, but to demonstrate that the extension mechanisms within HTML5 are not sophisticated enough to handle existing needs. Not some future extensiblity, as Ian notes, but a need that exists today.

RDF is a rich data model with widespread use, documented by a mature specification, supported by any number of tools in any number of applications, in use by any number of companies, for any number of purposes. It is not some lightweight Johnny-come-lately that can be disregarded and ignored because it doesn’t satisfy a small group’s determination of what is, or is not, essential to the web. We don’t have to justify our interest in RDF, and therefore are fully within our rights to ask that it be supported in any web page markup currently under development by the W3C.

Now, if the HTML5 working group wishes to demonstrate that the RDF model can be implemented in HTML5, as is, then they should do so. They should not, though, demand that we give up the RDF model in order to support some other model, just because they don’t happen to see the need for RDF for themselves.

We in the RDF community are not asking the HTML5 working group to support …extensibility to solve problems of decades from now. We’re asking for a solution to a problem that exists today. Now. This very moment.