December 13th, 2007

What does it take to convert your Wordpress weblog to XHTML?

First, the template has to be valid XHTML. One way to check this is to make sure the page validates as XHTML, first, before actually converting the page to XHTML. I use an XHTML 1.1 DOCTYPE that supports MathML and SVG:


<!DOCTYPE html PUBLIC
    "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
    "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
I also add XHTML, SVG, and XLink namespaces:

<html xmlns="http://www.w3.org/1999/xhtml" 
      xmlns:svg="http://www.w3.org/2000/svg"
      xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en">

When you validate the page the validator will let you know that the DOCTYPE differs from the page MIME type, but shouldn't impact on the validation process. Just make sure that the validator is treating your page as XHTML.

The reason why the Validator assumes the page is HTML is because the page is served up as HTML at this point, Wordpress wants to serve pages up as HTML. In fact, Wordpress fights you every step in the way when it comes to serving your page as XHTML. Luckily, there's nice people who build plug-ins to ensure your page is served up as XHTML. However, not every browser supports XHTML. For those limited browsers, we have to serve the pages as HTML. If we don't, the limited browser (that would be, IE) has a problem serving the pages.

Testing to see what a browser can handle is known as content negotiation. There is a way you can implement content negotiation with .htaccess, but this approach doesn't work well with Wordpress. Instead, I use the m0n5t3's nest "content negotation plug-in for Wordpress". I install it, activate it, and it manages the content negotation for me–serving pages as XHTML for browsers that can handle it; and HTML for those that can't (IE).

To ensure the comments work, I added the following line to wp-comments-post.php before saving the comment:

$comment_content = mb_convert_encoding($comment_content, "UTF-8","auto");

If you've followed my steps so far, congratulations! You're now serving your pages as XHTML. Now, go back through your archives. Be prepared for:

  • Yellow screen of death for Firefox
  • Opera's polite, "You're F**cked!" elegant gray
  • Safari's, "You're hurting me!" page
  • IE is reading the page as HTML, which means it doesn't care that your page is crappy.

I've had a weblog for years, other pages even longer. I have used old HTML, dated HTML, and good HTML, used badly. This means I have a lot of pages that will break when served as XHTML.

There might be *nice, automated applications that can fix all my bad uses of HTML. I've not tried to create such an application, nor have I found one. Instead, I fix pages manually, based on someone letting me know they've found a broken page. I also have an application I run that shows me which pages are broken. I run this application when I have time, fixing pages.

The application I use to find bad XHTML pulls the content in from the Wordpress database:


<?php
require_once('./wp-config.php');
require_once('./XhtmlValidator.php');

global $wpdb;

$sql="select ID,post_content from $wpdb->posts 
where post_status = 'publish'
ORDER BY ID ASC ";

$lines = $wpdb->get_results($sql);
if ($lines) {

   foreach ($lines as $line) {
      $post = $line->ID;
      $data = "<div>" . $line->post_content . "</div>";
      $XhtmlValidator = new XhtmlValidator();
      if($XhtmlValidator->validate($data) === false){
         echo "Post $post <br />\n";
         $XhtmlValidator->showErrors();
      }
    }
}

?>

As you can see from accessing the application, I still have work to do. I make use of a PHP class, XhtmlValidator, from Akelos Framework. It works nicely. Too nicely.

Of course, the upside to all of this is that my new posts are XHTML valid, or I wouldn't be able to publish them. To ensure this continues this way, I turn off WP formatting for those posts that Wordpress formats incorrectly. For instance, I can't use Wordpress default formatting when I use CODE elements, because WP wants to insert inappropriate paragraph tags.

Is it work? Yes, but when you're done you know, without a doubt, that all your i's are dotted, your t's crossed. You also know that you can add trees.

Christmas Tree holiday religholiday festive advent christmas christianity recreation Aaron Spike Aaron Spike Aaron Spike image/svg+xml en

And cute, cuddly bears.

image/svg+xml
And choo-choo trains.
image/svg+xml

Which, unfortunately, you can't see if you're using IE. They're cute, take my word for it. And semantical, too, thanks to RDF embedded with the image. All allowed, because the page is served up as XHTML.

(SVG images from Wikipedia. Artists: Aaron Spike, Richard Thompson, and Jarno Vasamaa)

(Per Sam Ruby, HTML5Lib should be able to fix the XHTML. )
Comments
1
Sam Ruby - 9:43 pm 12/13/2007

In html5lib, you can find a utility that will convert any web page into XHTML. It first builds a DOM from the page using all the quirks and implicit rules that HTML has built up over the years. Then it simply serializes the DOM.

It is available in both Ruby and Python flavors. Simply pass in an -x option. Thus: bin/html5 -x http://google.com/andpython parse.py -x http://google.com/

2
Shelley - 9:47 pm 12/13/2007

Thanks, I'll check that out Sam. If it can unquirk my HTML, it's got to be good.

3
Sam Ruby - 9:39 am 12/14/2007

By the way, your URI is right on target. This particular post caused the feedvalidator to misbehave and the universal feed parser to outright break.

In the case of the feedvalidator, it some some elements that it thought it understood (like rdf:*, dc:*, cc:*) in contexts that it wasn't expecting them, and reported these instances as errors. I've committed the fix, but if you want to see the problem these changes aren't scheduled to be deployed for nearly three hours. After the fix, there still are a few warnings that I need to stomp on, but those can wait.

The feed parser, on the other hand, saw dc:title and dc:description in the middle of atom:content and decided to chase its own tail, eventually resulting in stack overflow. Also fixed.

Thanks! (I mean it)

By the way, one difference between HTML and XHTML that shows up on this page. If a <pre> tag in HTML is immediately followed by a newline character, that character doesn't make it into the DOM. In XHTML, it does.

4
Shelley - 10:09 am 12/14/2007

Good to hear, I had a feeling some validators and/or feed machines and planet software might have some problems with this one. This is also testing SVG for all the browsers, all of which have work they need to do. But now, they have test cases.

(you should click the comment link to this page with Firefox 2 to get a gurgle.)

5
Sam Ruby - 11:20 am 12/14/2007

I had a feeling some validators and/or feed machines and planet software might have some problems with this one.

Are you aware that your feed switched back to icky poo RSS 2.0?

6
Shelley - 11:25 am 12/14/2007

It's that damn hard coding in Wordpress. I forgot and overwrote the application where I made the change.

later…all fixed now

7
Sam Ruby - 2:58 pm 12/14/2007

It's that damn hard coding in Wordpress.

Progress is being made.

8
Shelley - 3:18 pm 12/14/2007

Good for you, Sam. That will make a significant difference.

9

I'm pretty sure html5lib doesn't belong to Google.

10
ralph - 10:20 pm 12/14/2007

And then, once you've succeeded at all this, Shiira will take your XHTML served as application/xhtml+xml and display it as an XML tree rather than as a web page.

Ah well, that's what I get for using an obscure browser….

11
Shelley - 12:07 am 12/15/2007

Thanks Hiếu.

Sorry, Ralph. The content negotiation should work with all browsers.

12
ralph - 6:35 am 12/15/2007

It's working. Shiira says it accepts application/xhtml+xml. Your site sends application/xhtml+xml. The content negotiation works. And then it all falls down thanks to a bug in Shiira.

It's probably a measure of how few sites out there send application/xhtml+xml that the bug hasn't been fixed. The first report of it in the thread I linked to was in July. :-(

Thanks to all those who have contributed to the discussion. Comments are now closed, but you can contact the author of the post directly.