Adventures in XHTML

Recovered from the Wayback Machine.

During the recent light hearted discussions revolving around IE8 and its faithful companion, Wonder Tag, a second topic thread broke out about XHTML. As is typical whenever XHTML is brought up, the talk circles around to the draconian error handling or yellow screen of death when encountering even a small, harmless seeming discrepancy in a page’s markup.

However, the yellow screen of death is a factor of how Firefox deals with problems, not handling that’s inherent to serving XHTML as application/xhtml+xml. Safari’s error handling is much less extreme, attempting to render all of the ‘good’ markup up to the point where the ‘bad’ markup occurs.

Opera’s error handling is even more friendly. It provides the context of the error, which makes it the best tool for debugging a faulty XHTML page. You might say Opera is to XHTML, as Firebug is to JavaScript. The browser also provides an option to process the page as a more forgiving HTML.

To return to the discussion I linked earlier, in response to the mention of the draconian error handling, I wrote:

I can agree that the extreme error handling of the page can be intimidating, but it’s no different than a PHP page that’s broken, or a Java application that’s cracked, or any other product that hasn’t been put together right.

To which one of the commenters responded:

I don’t want to get off-topic either but I hear this nonsense a lot. You can’t simply compare a markup language with a programming language. They have very different intended authors (normal people versus programmers) and very different purposes.

I disagree. I believe you can compare a markup with a programming language. Both are based on technical specifications and both require an agent to process the text in a specific way to get a usable response. As with PHP or Java, you have to know how to arrange XHTML in order to get something useful. Because HTML has a more forgiving processor than the XHTML or PHP doesn’t make it less technical–just inherently more ‘loose’ for lack of a better term.

In my opinion, the commenter, Tino Zijdel, was in error on a second point, as well: markup isn’t specific to programmers. In fact, programmers are no better at markup than ‘normal’ people. Case in point is the error pages I’ve shown in this post.

As most of you are aware, I serve my pages up with the application/xhtml+xml MIME type. For those of you who have tried to access this site using IE, you’re also aware that I don’t use content negotiation, which tests to see if the browser is capable of processing XHTML and returns text/html if not.

Before yesterday, I still served up the WordPress administration pages as text/html, rather than application/xhtml+xml. Yesterday I threw the XHTML switch on the administration pages as well, and ended up with some interesting results. For instance, both plug-ins I use that have an options page had bad markup. In fact one, a very popular plug-in that publishes del.icio.us links into a post, had the following errors:

The ‘wrap’ class name wasn’t in quotes.
Five input fields were not properly terminated.
The script element didn’t have a CDATA wrapper.
Properties such as ‘disabled’ and ‘readonly’ were given as standalone values.
Two extraneous opening TR tags.
One non-terminated TR element.
Two terminating label elements without any starting tag.

For all of that, though, it didn’t take me more than about 15 minutes to fix the page, with a little help from Opera.

The WordPress administration pages work except for the Dashboard, where the version of jQuery that comes with WordPress didn’t seem to handle the Ajax calls to fill the page. I updated jQuery with the latest version, and the feed from the WordPress weblog shows, but not the other two items. At least, not with Firefox 3 or Safari, but all the content does show with Opera.

The Text Control plug-in had one minor XHTML error in the options page, but even when that was fixed, selecting a new text formatting option in the post doesn’t work–the selection goes back to the default. That one will end up being more challenging to fix, because I haven’t a clue what’s stopping the update.

WordPress does a decent job of generating proper XHTML content when using the default formatting. In fact the only problem I’ve had, other than when I embed SVG inline, was my own inaccurate use of markup. I used <code> elements, by themselves, when displaying block code. What I should have used is the <code> preceded by <pre>. When I do, the WordPress default formatting works without problems.

remove_filter('comment_text', 'wpautop', 30);
remove_filter('comment_text', 'wptexturize');
add_filter('comment_text', 'tc_comment');

My error, and the errors of the plug-in creators all demonstrate that though programmers might be more familiar with the consequences of making a mistake with technical text, we don’t make fewer mistakes than anyone else when it comes to using web page markup. Our only advantage is we’re not as intimidated by pages with errors. Regardless of how displayed or our relative technical expertise, though, these error messages aren’t necessarily a bad thing.

One of the advantages to serving the pages with application/xhtml+xml is that we catch mistakes before we serve the pages up to our readers. We definitely catch the mistakes before we release code that generates badly formed markup, or providing broken option pages to accompany our coded plug-ins. I can’t for the life of me understand why any programmer, web developer, or designer would want less than 100% accuracy from their web pages. That’s tantamount to saying, “Hire me. I write sloppy shit.”

Of course, being able to program can have advantages when working with XHTML, especially with many of today’s applications. WordPress does a good job at working in an XHTML environment, but not a great one. One example of where the application fails, badly, is in the Atom feed.

In Atom, WordPress outputs the HTML type as an attribute to many of the fields:

<summary type="<?php html_type_rss(); ?>">
<![CDATA[<?php the_excerpt_rss(); ?>]]></summary>
<?php if ( !get_option('rss_use_excerpt') ) : ?>

This is all well and good except for one thing: when the type is returned as ‘xhtml’, Atom feeds are supposed to use the following syntax for the content:

<summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">
...</div></summary>

This is an outright error in how the Atom feed is coded in WordPress. I’ve had to correct this in my own feed, and then remember not to overwrite my copy of the code whenever there’s an update. What the code should be doing is testing the type, and then providing the wrapper accordingly.

A second issue with WordPress is more subtle, and has to do with that part of XML I don’t consider myself overly familiar with: character sets and encoding. As soon as I switched on XHTML at my old weblog, I started to have problems with certain characters in my comments, and had to adjust the WordPress comment processing to allow for UTF-8 encoding. As it is, I’m not sure that I’ve covered all the bases, though I haven’t had any re-occurrence of the initial problems.

However, during the XHTML discussion, Philip Taylor demonstrated another problem in the WP code, in this case sending through a couple of characters that the WP search function did not like.

I checked with one of my two XHTML experts, Jacques Distler (the other being Sam Ruby), and the characters were Unicode, specifically:

utf-8 0xEFBFBE = U+FFFE
utf-8 0xEFBFBF = U+FFFF

From Jacques I found that Philip likes the U+FFFE and U+FFFF Unicode characters because they’re not part of the W3C’s recommended regular expression for filtering illegal characters.

Unfortunately, to protect against these characters in search as well as comments required code in more than one place, and in fact, having to hack into the back end of WordPress. This is not an option available to someone who isn’t a programmer. However, this example doesn’t demonstrate that you have to be coder to serve pages as XHTML–it demonstrates that applications such as WordPress have a ways to go before being technically, rather than just cosmetically, compliant with XHTML.

Having said that, I can almost hear the voices now: Why bother, they say. After all, no one uses XHTML, do they?

Why bother? Well, for one thing, XHTML served as XML provides a way to integrate other XML-based specifications into the page content, including in-line SVG, as well as MathML, and even RDF/XML if we’re so inclined. The point is, serving XHTML as XML provides an open platform on which to build. Otherwise, we’re dependent on committees to hash through what will or will not be allowed into a specification, based on one company or another’s agenda.

We can include SVG into a page using an object element, but we can’t integrate something like SVG and MathML together without the ability to include both inline. We certainly can’t incorporate SVG into the overall structure of the page–at least not easily using separate files. There is no room in an HTML implementation for all the other XML-based vocabularies, and we can only cram so much into class attributes before the entire infrastructure collapses.

No, we need both: an HTML implementation for those not ready to commit to an XML-based implementation, and XHTML for the rest of us.

During the recent discussions on IE8, several people asked Chris Wilson from Microsoft whether IE8 will support the application/xhtml+xml MIME type. So far, we’ve not had an answer. Whatever the company decides, though, XHTML is not going away. The HTML5 working draft, which was just released, is about a vocabulary, not a specific implementation of that vocabulary. Both HTML and XHTML implementations are covered in the document, though XHTML isn’t covered as fully because most of the aspects of processing XHTML are covered in other documents. At least, that’s what we’re being told.

What’s critical for the HTML5 effort is that browsers support both implementations. Even the smallest mobile device is not going to be so overburdened by the requirements that it can’t consume pages delivered up as proper XHTML. It’s a sure thing that handling clean markup takes less requirements than handling a mess.

I’d also hate to think we’re willing to trade well designed and constructed web sites for pages filled with missing TR end tags, poorly nested elements, and unquoted class names, just because Microsoft can’t commit to the spec, and Firefox took the “bailing out now!” approach to error handling.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31