Maxwell’s Silver Hammer: RDFa and HTML5’s Microdata

Being a Beatles fan, I must admit to being intrigued about the new Beatles box set that will be available in September. I have several Beatles albums, but not all. None of the CDs I own have been re-mastered or re-mixed, including one of my favorite songs, from Abby Road: Maxwell’s Silver Hammer:

Joan was quizzical; Studied pataphysical
Science in the home.
Late nights all alone with a test tube.
Oh, oh, oh, oh.

Maxwell Edison, majoring in medicine,
Calls her on the phone.
"Can I take you out to the pictures,
Joa, oa, oa, oan?"

But as she's getting ready to go,
A knock comes on the door.

Bang! Bang! Maxwell's silver hammer
Came down upon her head.
Bang! Bang! Maxwell's silver hammer
Made sure that she was dead.

I love the chorus, Bang! Bang! Maxwell’s silver hammer came down upon her head…

Speaking of Bang! Bang! Jeni Tennison returned from vacation, surveyed the ongoing, and seemingly unending, discussion on RDFa as compared to HTML5’s Microdata, and wrote HTML5/RDFa Arguments. It’s a well-written look at some of the issues, primarily from the viewpoint of a heavy RDFa user, working to understand the perspective of an HTML5 advocate.

Jeni lists all of the pushback against RDFa that I’m aware of, including the reluctance to use namespacing, because of copy and paste issues, as well as the use of prefixes, such as FOAF, rather than just spelling out the FOAF URI. Jeni also mentions the issue of namespaces being handled differently in the DOM (Document Object Model) when the document is served as HTML, rather than XHTML.

The whole namespace issue goes beyond just RDFa, and touches on the broader issue of distributed extensibility, which will, in my opinion, probably push back the Last Call date for HTML5. It may seem like accessibility issues are the real kicker, but that’s primarily because no one wants to look at the elephant in the corner that is extensibility. Right now, Microsoft is tasked to provide a proposal for this issue—yes, you read that right, Microsoft. When that happens, an interesting discussion will ensue. And unlike other issues, whatever happens will take more than a few hours to integrate into HTML5.

I digress, though. At the end of her writing, Jeni summarizes her opinion of the RDFa/namespace/HtmL5/Microdata situation with the following:

Really I’m just trying to draw attention to the fact that the HTML5 community has very reasonable concerns about things much more fundamental than using prefix bindings. After redrafting this concluding section many times, the things that I want to say are:

  • so wouldn’t things be better if we put as much effort into understanding each other as persuading each other (hah, what an idealist!) so we will make more progress in discussions if we focus on the underlying arguments so we need to talk in a balanced way about the advantages and disadvantages of RDF or, in a more realistic frame of mind:
  • so it’s just not going to happen for HTML5
  • so why not just stop arguing and use the spare time and energy doing?
  • so why not demonstrate RDF’s power in real-world applications?

My own opinion is that I don’t care that RDFa is not integrated into HTML5. In fact, I don’t think RDFa belongs in HTML5. I think a separate document detailing how to integrate RDFa into HTML5, as happened with XHTML, is the better approach.

Having said that, I do not believe that Microdata belongs in the HTML5 document, either. The HTML5 document is already problematical, bloated, and overly complex. It encompasses too much, a fault of the charter, as much as anything else. Removing the entire Microdata section would help, as well as several other changes, but we’ll focus on the Microdata section for the moment.

The problem with the Microdata section is that it is a competing semantic web approach to RDFa. Unlike competition in the marketplace, competition in standards will actually slow down adoption of the standards, as people take a sit-back and see what happens, approach. Now, when we’re finally are seeing RDFa incorporated into Google, into a large CMS like Drupal 7, and other uses, now is not the time to send a message to people that “Oops, the W3C really doesn’t know what the fuck it wants. Better wait until it gets its act together. ” Because that is the message being sent.

“RDFa and Microdata” is not the same as “RDFa and Microformats”. RDFa, or I should say, RDF, has co-existed peacefully with microformats for years because the two are really complementary, not competitive, specifications. Both can be used at a site. Because Microformat development is centralized, it will never have the extensibility that RDF/RDFa provides, and the number of vocabularies will always, by necessity, be limited. Microformats, on the other hand, are easier to use than RDFa, though parsing Microdata is another thing. They both have their strengths and weaknesses. Regardless, there’s no harm to using both, and no confusion, either. Microformats are managed by one organization, RDFa by the W3C.

Microdata, though, is meant to be used in place of RDFa. But Microdata is not implemented in any production capable tool, has not been thoroughly checked out or tested, has not had any real-world implementation that I know of, has no support from any browser or vendor, and isn’t even particularly liked by the HTML WG membership, including the author. It provides only a subset of the functionality that RDFa provides, and necessitates the introduction of several predefined vocabularies, all of which could, and most likely will, end up out of sync with the organizations responsible for the extra-HTML5 vocabulary specification. And let’s not forget that Microdata makes use of the reversed DNS identifier that sprang up, like a plague of locusts, in HTML5, based on the seeming assumption that people will find the following:

com.example.xn--74h

Easier to understand and use then the following:

http://example.com/xn--74h

Which, heaven knows, is not something any of us are familiar with these last 15-20 years.

RDFa and HTML5/Microdata, otherwise known as Issue 76 in the HTML 5 Tracker database. I understand where Jeni is coming from when she writes about finding a common ground. Finding common ground, though, presupposes that all participants come to the party on equal footing. That both sides will need to listen, to compromise, to give a little, to get a little. This doesn’t exist with the HTML5 effort.

Where the RDFa in XHTML specification was a group effort, Microdata is the product of one person’s imagination. One single person. However, that one single person has complete authorship control over the HTML 5 document, and so what he wants is what gets added: not what reflects common usage, not what reflects the W3C guidelines, and certainly not what exists in the world, today.

While this uneven footing exists, I can’t see how we can find common ground. So then we look at Jeni’s next set of suggestions, which basically boil down to: because of the HTML WG charter, nothing is going to happen with HTML5, so perhaps we should stop beating our heads against the wall, and focus, instead, on just using RDFa, and to hell with HTML5 and microdata.

Bang! Bang!

I am very close to this. I had started my book on the issues I have with HTML5, and how I would change the specification, but after a while, a person gets tired of being shut out or shut down. I’m less interested in continuing to “bang my head against the wall”, as Jeni so eloquently put it.

But then I get an email this week, addressed to several folks, asking about the introduction of Microdata: so what does the W3C recommend, then? What should people use? Where should they focus their time?

Confusion. Confusion because the HTML5 specification is being drafted specifically to counter several initiatives that the W3C has been nurturing over the last decade: Microdata over RDF/RDFa; HTML over XHTML; Reverse DNS identifiers over namespaces, and URIs; the elimination of non-visual cues, not only for metadata, but also for the visually challenged. And respect. There is no respect for the W3C among many in the HTML Working Group. And I know I lose more respect for the organization the closer we get to HTML5 Last Call.

In fact, HTML Working Group is a bit of a misnomer. We don’t have HTML anymore, we have a Web OS.

We don’t have a simple HTML document, we have a document that contains the DOM, garbage collection, the Canvas object and a 2D API, a definition for web browser objects, interactive elements, drag and drop, cross-document communication, channel messaging, Microdata, several pre-defined vocabularies, probably more JavaScript than the ECMAScript standard, and before they were split off, client-side SQL, web worker threads, and storage. I’m sure there’s a partridge in a pear tree somewhere in there, but I still haven’t made it completely through the document. It’s probably in Section 10. I know there’s talk of extending to the document to include a 3D API, and who knows what else.

There’s a lot of stuff in HTML5. What isn’t in the HTML5 document is a clean, straightforward description of the HTML or XHTML syntax, and a clearly defined path for people to move to HTML5 from other specifications, as well as a way of being able to cleanly extend the specification—something that has been the cornerstone of both HTML and XHTML in the past. There’s no room for the syntax, in HTML5. It got shoved down by Microdata and the 2D API. There’s no room for the past, the old concepts of deprecated and obsolete have been replaced by such clear terms as “Conforming but obsolete”. And there’s certainly no room for future extensibility. After all, there’s always HTML6, and HTML7, …, HTMLN—all based on the same open, encompassing attitude that has guided HTML5 to where it is today.

If we don’t like what we see, we do have options. We can create our own HTML5 documents, and submit “spec text” for a vote. But what if it’s the whole document that needs work? That many of the pieces are good, but don’t belong in the parent document, or even in the HTML WG?

The DOM should be split out into its own section and should take all of the DOM/interactive, and browser object stuff with it. The document should be re-focused on HTML, without this mash-up of HTML syntax, scripting references, and API calls that exists now. The XHTML section should be fleshed out and separated out into its own section, too, if no other reason to perhaps reassure people that no, XHTML is not going away. We should also be reminded that XHTML is not just for browsers—in fact, the eBook industry is dependent on XHTML. And it doesn’t need Canvas, browser objects, or drag and drop.

Canvas should also be split out, to a completely separate group whose interest is graphics, not markup. As for Microdata, at this point, I don’t care if Microdata is continued or not, but it has no place in HTML5. If it’s good, split it out, and let it prove itself against RDFa, directly.

The document needs cleaning up. There are dangling and orphaned references to objects from Web Workers and Storage still littering the specification. It hops around between HTML syntax and API call, with nothing providing any clarity as to the rhyme or reason for such jumping about. Sure there’s a lot of good stuff in the document, but it needs organization, clean up, and a good healthy dose of fresh air, and even a fresher perspective.

Accessibility shouldn’t be added begrudgingly, woodenly, resentfully. It should be integrated into the HTML, not just pasted on in order to quiet folks because LC is coming up.

The concepts of deprecated and obsolete should be returned, to ensure a sense of continuity with HTML 4. And no, these did not originate with HTML. In fact, the use of deprecated and obsolete have been fairly common with many different technologies. I can guarantee nothing but the HTML5 document has a term like “conforming but obsolete”. I know, I searched high and low in Google for it.

And we need extensibility, and no, I don’t mean Microdata and reverse DNS identifiers. If extensibility was part of the system, folks who want to use RDFa could use RDFa, and not have to beg, hat in hand, to be allowed to sit at the HTML 5 table. This endless debate wouldn’t be happening, and everyone could win. Extensibility is good that way. Extensibility has brought us RDFa, SVG, MathML, and, in past specifications, will continue to bring whatever the future may bring.

whatever the future may bring…

Finding common ground? Walk a mile in each other’s moccasins? Meet mano a mano? Provide alternative specification text?

Bang! Bang!

Jeni’s a pretty smart lady.

Forests I have loved

By accident and restless choice, I am the ultimate stone that gathers no moss and have lived all over this country. In each location, I’ve hiked whatever wilderness the area boasts, and one doesn’t truly know how beautiful this country is until you’ve walked the fields and forests, beaches and rivers.

In the Northwest, the wet rainforests of the Peninsula be-grudge every inch of the path and at times you feel as if the forest will swallow you whole, so rich and close it is. If one is fanciful, and the rainforests generate fancy, one would think to look closely at the bushes, to see if a set of eyes looks back. Cold water droplets down the back of one’s collar area is a typical Northwest rainforest experience. Elsewhere in the region, the forests are less dense but no less wild: whether walking the foothills of The Cascades, or the high hills of the Inland Empire.

As a break from the forests, one can walk the desert-like petrified forests, the rich meadows, or the beaches of the Oregon coast, getting lost among the rocks and the tidal pools, to climb sandy dunes and rocky cliffs. I have walked a thousand miles of Washington and Oregon through the years, and every mile is unique.

Roosevelt lake a few miles from where I grew up

In Arizona, the forests are in the north and consist mainly of Ponderosa and scrub pine. In the red rock country, the trees fight for a life among the rugged rocks, their green a brilliant counter-point to the rust reds of the ground, and the azure blue of the skies. In the Arizona deserts, one can turn about once, twice, and get lost if not careful, and during the summer, the wilderness is unforgiving of fools. But, oh the beauty of an Arizona desert in the Spring, with flowering cacti and cool breezes, snakes warming themselves in the sun, lizards scampering about. And the area is so rich with minerals that one can find entire valleys literally sprinkled with jasper or black or white onyx.

One might expect fierce wilderness in Vermont, but you’d be surprised. The entire state was clear cut at one time, and the trees are of a uniform sameness and type and size. But in the winter, when the snow is on the ground and the lakes are frozen, that’s when Vermont shines for me. The irony though is that there are few places to hike easily in Vermont. In the winter, on Grand Isle, the local high school opened its doors in the evenings for community members to walk the corridors, get a bit of exercise and socialize. When snow is 4 feet deep, you don’t just cut across the country for a bit of a hike. Unless you’re a red fox.

Once, when I stayed at a bed and breakfast in the central part of the state, I found a trail made by a snow trailer and was able to walk to the top of the hill the B & B was next to. The day was sunny and cold, and fresh snow was pure white, all about me. As I walked further and further up the hill, all sounds fell away until the only thing you can hear is your own heart.

Trail in Muir Woods, CA

In Massachusetts, there are miles of coasts to walk if you can find them. The water is warmer than the Pacific but more temperamental, and there are few experiences finer than to stand on a beach during a summer storm in New England. Wet. Truly wet.

I prefer hiking, but it’s hard to resist the lure of the Emerald Necklace in Boston for walking — the series of connected parks that traverse the city. In Boston, you’re always aware that the streets you walk were once walked by the likes of John Hancock, Samual Adams, and Paul Revere. It was in one part of the Necklace that I walked along a stream and a red-tailed hawk landed on a branch only a few feet away. Right in the middle of the city.

In Montana, the green forest gives way to mile after mind-numbing mile of cattle ranches before hitting rocky mountains that tear through the earth in jagged layers, dangerous to walk, beautiful to see. And In Idaho, the lakes rest like blue sapphires nestled in verdant green velvet.

In Northern California, you can walk among Redwood trees so tall that no other life grows on the forest floor, because no sunlight ever makes it past the trees. In the distance, you can hear birds singing, but not a sound at the forest floor. As you walk, you can reach out and pat a tree that was born about the time when Abraham gave birth to Judaism, Christianity, and Islam.

Rocky Mountain forest

Here in Missouri, where mighty rivers have carved a culture unique to this region, of blues and banjos, where north and south meet and co-exist, this is a land of many faces: river fronts give way to wild mountain, which gives way to city, which gives away to parks absolutely unique in this country. One can walk every day in the year and still not touch all the trails and paths this state supports.

The mountains here are smaller than in the Northwest, but no less wild and no less fierce with brambles and tangles and rocks and soft clay ready to trip you up at every step. Here is where one is likely to meet white-tailed deer and beaver and black bear in addition to squirrel and opposom and raccoon. The bird life is as rich as the landscape, with bluebirds and red cardinals and mockingbirds and finch, hawk, eagle, and red-winged blackbird, all within a few miles of the city.

However, the real magic in this simple land is to walk the same path in all seasons; to see the land in winter, only hinted at behind lush trees and bushes in the summer; to watch a whole valley suddenly become dusted with green after a spring rain; to stand at the edge of the forest and see color that would shame the finest painters as the leaves of dozens of different trees of different heights and shapes change into their autumn colors, of gold and rust, pink and scarlet, with a hint here and there of defiant, stubborn green; to stand beneath a canopy of trees as golden leaves cascade down around you.

Trail in the fall

Arbitrary Vocabularies and Other Crufty Stuff

I went dumpster diving into the microformats IRC channel and found the following:

singpolyma – Hixie: that’s the whole point… if you don’t have a defined vocabulary, you end up with something useless like RDF or XML, etc
@tantek – exactly
Hixie – folks who have driven the design of XML and RDF had “write a generic parser” as their #1 priority
@tantek – The key piece of wisdom here is that defined vocabularies are actually where you get *user* value in the real world of data generated/created by humans, and consumed eventually by humans.
Hixie – i’m not talking about this being a #1 priority though — in the case of the guy i mentioned earlier, it was like #4 or #5
Hixie – but it was still a reason he was displeased with microformats
@tantek – Hixie – ironically, people have written more than one generic parser for microformats, despite that not being a priority in the design
Hixie – url?
@tantek – mofo, optimus
@tantek – http://microformats.org/wiki/parsers
@tantek – not exactly hard to find
@tantek – it’s ok that writing a generic parser is hard, because not many people have to write one
Hixie – optimus requires updating every time you want to use a new vocabulary, though, right
@tantek – OTOH it is NOT ok to make writing / marking up content hard, because nearly far more people (perhaps 100k x more) have to write / mark up content.
Hixie – yes, writing content should be easy, that’s clear
Hixie – ideally it should be even easier than it is with microformats 🙂
singpolyma – Of course you have to update every time there’s a new vocabulary… microformats are *exclusively* vocabularies
Hixie – there seems to be a lot of demand for a technology that’s as easy to write as microformats (or even easier), but which lets people write tools that consume arbitrary vocabularies much more easily than is possible with text/html / POSH / Microformats today
singpolyma – Hixie: isn’t that what RDFa and the other cruft is about?
Hixie – RDFa is a disaster insofar as “easy to write as microformats” goes
singpolyma – Not that I agree arbitrary vocabularies can be used for anything…
Hixie – and it’s not particularly great to parse either

Hixie – is it ok if html5 addresses some of the use cases that _are_ asking for those things, in a way that reuses the vocabularies developed by Microformats?

Well, no one is surprised to see such a discussion about RDFa in relation to HTML5. I don’t think anyone seriously believed that RDFa had a chance of being incorporated into HTML5. Most of us have resigned ourselves to no longer support the concept of “valid” markup, as we go forward. Instead, we’ll continue to use bits of HTML5, and bits of XHTML 1.0, RDFa, and so on.

But I am surprised to read a data person write something like, if you don’t have a defined vocabulary, you end up with something useless like RDF or XML. I’m surprised because one can add SQL to the list of useless things you end up with if you don’t have defined vocabularies, and I don’t think anyone disputes the usefulness of SQL or the relational data model. A model specifically defined to allow arbitrary vocabularies.

As for XML, my own experiences with formatting for eBooks has shown how universally useful XML and XHTML can be, as I am able to produce book pages from web pages, with only some specialized formatting. And we don’t have to form committees and get buy off every time we create a new use for XML or XHTML; the same as we don’t have to get some standards organization to give an official okee dokee to another CMS database, such as the databases underlying Drupal or WordPress.

And this openness applies to programming languages, too. There have been system-specific programming languages in the past, but the widely used programming languages are ones that can be used to create any number of arbitrary applications. PHP can be used for Drupal, yes, but it can also be used for Gallery, and eCommerce, and who knows what else—there’s no limiting its use.

Heck HTML has been used to create web pages for weblogs, online stores, and gaming, all without having to redefine a new “vocabulary” of markup for each. Come to think of it, Drupal modules and WordPress plug-ins, and widgets and browsers extensions are all based on some form of open infrastructure. So is REST and all of the other web service technologies.

In fact, one can go so far as to say that the entire computing infrastructure, including the internet, is based on open systems allowing arbitrary uses, whether the uses are a new vocabulary, or a new application, or both.

Unfortunately, too many people who really don’t know data are making too many decisions about how data will be represented in the web of the future. Luckily for us, browser developers have gotten into the habit of more or less ignoring anything unknown that’s inserted into a web page, especially one in XHTML. So the web will continue to be open, and extensible. And we, the makers of the next generation of the web can continue our innovations, uninhibited by those who want to fence our space in.

Wolfram Alpha

Sheila Lennon asked my opinion on the Nova Spivack’s recent writing about Wolfram Alpha, and posted my response, as well as other notes. Wolfram Alpha is the latest brainchild of Mathematica creator, Stephan Wolfram, and is a stealth project to create a computational knowledge engine. To repeat my response:

First of all, it’s not a new form of Google. Google doesn’t answer questions. Google collects information on the web and uses search algorithms to provide the best resources given a specific search criteria.

Secondly, I used Mathematica years ago. It’s a great tool. And I imagine that WolframAlpha will provide interesting answers for specific, directed questions, such as “what is the nearest star” and the like. But these are the simplest of all queries, so I’m not altogether impressed.

Think of a question: who originated the concept of “a room of one’s own”. Chances are the Alpha system will return the writing where the term originated, Virginia Woolf’s “A Room of One’s Own”, and the author, Virginia Woolf. At least, it will if the data has been input.

But one can search on the phrase “A room of one’s own” and get the Wikipedia entry on the same. So in a way, WolframAlpha is more of a Wikipedia killer than a Google killer.

Regardless, when you look via Google, then you get link to Wikipedia, but you also get links to places where you can purchase the book, links to essays about the original writing, and so on. You don’t get just a specific answer, you also get context for the answer.

To me, that’s power. If I wanted answers to directed questions, I could have stayed with the Britannica years ago.

Nova Spivack’s writing on the Alpha is way too fannish. And too dismissive of Google, not to mention the human capacity for finding the exact right answer on our own given the necessary resources.

Again, though, all we have is hearsay. We need to try the tool out for ourselves. But other than helping lazy school kids, I’m not sure how overly useful it will be. If it’s free, yeah. If it’s not, it will be nothing more than a novelty.

I also beg to differ with Nova, when he states that Wolfram Alpha is like plugging into a vast electronic brain. Wolfram Alpha isn’t brain-like at al.

The human brain is amazing in its ability to take bits and pieces of data and derive new knowledge. We are capable of learning and extending, but we’re really shite, to use the more delicate English variation of the term when it comes to storing large amounts of data in an easily accessible form.

Large, persistent data storage with easy access is where computers excel. You can store vast amounts of data in a computer, and access it relatively easily using any number of techniques. You can even use natural language processing to query for the data.

Google uses bulk to store information, with farms of data servers. When you search for a term, you typically get hundreds of responses, sorted by algorithms that determine the timeliness of the data, as well as its relevancy. Sometimes the searches work; sometimes, as Sheila found when querying Google for directions to cooking brown rice in a crockpot, the search results are less than optimum.

Wolfram Alpha seems to take another approach, using experts to input information, which is then computationally queried to find the best possible answer. Supposedly if Sheila asked the same question of Wolfram Alpha, it would return one answer, a definitive answer about how to cook brown rice in a crockpot.

Regardless, neither approach is equivalent to how a human mind works. One can see this simply and easily by asking those around us, “How do I cook brown rice in a crockpot?” Most people won’t have a clue. Even those who have cooked rice in a crockpot won’t be able to give a definitive answer, as they won’t remember all the details—all the ingredients, the exact measurements, and the time. We are not made for perfect recall. Nor are we equipped to be knowledge banks.

What we are good at is trying out variations of ingredients and techniques in order to derive the proper approach to cooking rice in a crockpot. In addition, we’re also good at spotting potential problems in recipes we do find, and able to improve on them.

So, no, Wolfram Alpha will not be like plugging into some vast electronic brain. And we won’t know how well it will do against other data systems until we all have a chance to try the application, ourselves. It most likely will excel at providing definitive answers to directed questions. I’m not sure, though, that such precision is in our best interests.

I also Googled for a brown rice crockpot recipe, using the search term, “brown rice crockpot”. The first result was for RecipeZaar, which lists out several recipes related to crockpots and brown rice. There was no recipe for cooking just plain brown rice in a crockpot among the results, but there was a wonderful sounding recipe for Brown Rice Pudding with Coconut Milk, and another for Crocked Brown Rice on a Budget that sounded good, and economical. I returned to the Google results, and the second entry did provide instructions on how to cook brown rice in a crockpot. Whether it’s the definitive answer or not, only time and experimentation will tell.

So, no, Google doesn’t always provide a definitive answer to our questions. If it did, though, it really wouldn’t much more useful than Wikipedia, or our old friend, the Encyclopedia Britannica. What it, and other search engines provide is a wealth of resources for most queries that not only typically provide answers to the questions we’re asking, but also provide any number of other resources, and chances for discovery.

This, to me, is where the biggest difference will exist between our existing search engines and Wolfram Alpha: Alpha will return direct answers, while Google and other search engines return resources from which we can not only derive answers but also make new discoveries. As such, Alpha could be a useful tool, but I’m frankly skeptical whether it will become as important as Google or other search engines, as Nova claims. I don’t know about you all, but I get as much from the process of discovery, as I do the result.


Nova released a second article on Wolfram Alpha, calling it an answer engine, as compared to a search engine. In fairness, Nova didn’t use the term “Google killer”, but by stating the application could be just as important as Google does lead one to make such a mental leap. After all, we have human brains and are flawed in this way.

As for artificial intelligence, I wrote my response to it on Twitter: It astonishes me that people spend years and millions on attempting to re-create what two 17 year olds can make in the back seat of a car.

Drupal and OpenID

I have been focused on OpenID implementations lately, specifically in WordPress and Drupal. The Drupal effort is for my own sites.

Until this weekend, I had turned off new user registration at my Drupal sites, because I get too many junk user registrations. However, to incorporate OpenID into a Drupal site, you have to allow users to register themselves, regardless of whether they use OpenID or not.

I think this all or nothing approach actually limits the incorporation of OpenID within a Drupal site. If you limit registration to administrator’s only, then people can’t use their OpenIDs unless the administrator gets involved. If you allow people to self-register, there’s nothing to stop the spammy registrations.

I believe that OpenID should be an added, optional field attached to the comment form, allowing one to attach one’s OpenID directly to a comment, which then creates a limited user account within the site specifically for the purposes of commenting. Rather than just providing options to allow a user to register themselves, or not, add another set of options specific to OpenID, and allow us to filter new registrations based on the use of OpenID.

Currently, the new user registration options in Drupal 6x are:

  • Only site administrators can create new user accounts.
  • Visitors can create accounts and no administrator approval is required.
  • Visitors can create accounts but administrator approval is required.

Turn on the latter two options and you’ll get spammy registrations within a day. Not many, but annoying. I believe there should be a fourth and fifth option:

  • Visitors can create accounts using OpenID, only, and no administrator approval is required.
  • Visitors can create accounts using OpenID, only, but administrator approval is required.

With these new options, I could then open up new user registration for OpenID, but without having to allow generic new user registration for the account spammers that seem to be so prevalent with Drupal.

To attempt to implement this customized functionality at my sites, I’ve been playing with Drupal hooks, but the change is a little more extensive than just incorporating a hook handler and a few lines of code, at least for someone who is relatively new to Drupal module development like I am.

Taking the simplest route that I could implement as a stand-alone module, what I’m trying now is to modify the new user registration forms so that only the OpenID registration links display. You’ll see this, currently, in the sidebar if you access the site and you’re not logged in. Unfortunately, you have to click on the OpenID link to open the OpenID field, because I’m still trying to figure out how to remove the OpenID JavaScript that hides the field (there is a function to easily add a JavaScript library, but not one to remove an added library).

With my module-based modifications, rather than a person having to click a link to create a new account, and specify a username and password, they would provide their OpenID, and I would automatically assign them a username via autoregistration. To try my new sidebar module, I decided to turn my Drupal sites into OpenID providers, as well as clients, and use one of them as a test case. Provider functionality is not built in, but there is an OpenID provider module, which I downloaded and activated with my test Drupal installation (MissouriGreen).

I tried my new module and OpenID autoregistration but ran into a problem: the Drupal client does not like either the username or email provided via the Drupal OpenID provider. Why? Because the OpenID identifier used in the registration consists of the URL of the Provider, which is the URL of the Drupal site I used for my test, and the Drupal client does not like my using a URL. In addition, the provider also didn’t provide an email address.

Digging into the client side code, I discovered that the Drupal OpenID client supports an OpenID extension, Simple Registration. Simple Registration provides for an exchange of the 9 most requested information between the OpenID client and provider: nickname, email, full name, dob (date of birth), gender, postcode, country, language, and timezone. With Simple Registration, you can specify which of the items is optional and which mandatory, and the current OpenID client wants nickname and email.

By using Simple Registration on the provider, I could then provide the two things that my Drupal OpenID client wanted: nickname and email. Unfortunately, though, the current version of the OpenID provider doesn’t support Simple Registration. I was a little surprised by this, as I had made an assumption that the Drupal OpenID provider would work with the Drupal OpenID client. However, OpenID is in a state of flux, so such gaps are to be expected.

Further search among the Drupal Modules turned up another module, the Drupal Simple Registration module, which allows one to set the mandatory and optional fields passed as part of the OpenID authentication exchange. The only problem is that the OpenID Provider also doesn’t have any incorporated hooks, which would allow the Simple Registration module to provide the Simple Registration data as part of the response. To add these hooks, the Simple Registration module developer also supplied a patch that can be run against the OpenID Provider code to add the hook.

I applied the patch and opened the module code and confirmed that it had been modified to incorporate the hook. I then tried using the Drupal site as OpenID provider again, but the registration process still failed. Further tests showed that the Simple Registration data still was not being sent.

All I really want to do is test the autoregistration process, so I abandoned the Drupal OpenID provider, and decided to try out some other providers. However, I had no success with either my Yahoo account or my Google GMail account, even though I believed both provided this functionality. The Yahoo account either didn’t send the Simple Registration fields or failed to do so in a manner that the Drupal OpenID client could understand. The Gmail account just failed, completely, with no error message specifying why it failed.

I felt like BarbieOpenID is hard!

I finally decided to use phpMyID, which is a dirt simple, single user OpenID application that we can host, ourselves. I had this installed at one time, pulled it, and have now re-installed at my base burningbird.net root directory. I added the autodiscovery tags to my main web page, and uncommented the lines in the MyID.config.php file for the nickname, full name, and email Simple Registration fields. I then tried “https://burningbird.net” for OpenID autoregistration at RealTech. Eureka! Success.

The new user registration is still currently blocked at creation, but the site now supports autoregistration via OpenID. Unfortunately, though, the registration spammers can still access the full account creation page, so I can still get spammy registrations. However, I believe that this page can be blocked in my mandatory OpenID module, with a little additional work; at least until I can see about possibly creating a module that actually does add the OpenID only options I mentioned above. The people who generate spammy user account registrations could use OpenID themselves, but the process is much more complex, and a lot more controlled at the provider endpoint, so I think this will help me filter out all but the most determined spammy registrations.

Once all of this is working, I’ll see about adding the OpenID login field to the comment form, rather than in the sidebar. If one wonders, though, why there isn’t more use of OpenID, one doesn’t have to search far to find the answers. Luckily for Drupal users, OpenID seems to be an important focus of this week’s DrupalCon in Washington DC, including a specialized Code Sprint.