How communication fails

I need to finish my “Semiotics of I” essay with its discussion of URIs, representations, and self (“I am linked, therefore I am”). However, the weather saps my energy as we enter our fifth day of hot weather alerts. Rather than profound writing on web esoterics, I’ll be happy if I can actually manage to get my clothes to the laundry room this morning.

Speaking of semantics, interesting thread over at the Pie/Echo/Atom syntax email list. The thread started innocently enough with Simon Willison:

Tim just mentioned a mandatory order for the <issued>, <modified> and <created> elements, hence my question. Will the final Atom specification include text along the lines of “client implementations MUST reject Atom feeds if they are invalid”.

The thread then spiraled wildly into discussions of well-formed XML versus badly formed HTML, sensible suggestions interspersed with the geek equivalents of “Yo dog’s a bitch and so’s your mama”.

However, a couple of comments arose on the thread that are worth yanking out of geekland and talking about openly. The first has to do with validity of data, not just validity of syntax. The second has to do with error notification.

One suggestion being circulated is that when an aggregator tries to consume an invalid Pie/Echo/Atom syndication feed, an email or some other notice is sent to the producer of said feed, telling them to fix their broken feed. This sounds feasible until you start looking at what happens in the real world.

For many webloggers, the feeds we produce are ones we’ve added to our tools following one person or another’s instructions. Most people provide the feed primarily because they’ve been asked to and have only a small understanding of what the template tags and the XML means. Many of us have tweaked our feeds, such as my removal of the content encoded element because I don’t publish my content in its entirety. Any one of these actions can introduce errors.

Now, consider the scenario: your feed is accessed by let’s say 100 aggregators, because you have 100 people subscribe to your feed. Each aggregator accesses the feed once per hour. Do the math: exactly how many email messages are going to be generated in one single day based on one simple easy to do mistake? I wasn’t aware that spam is an effective tool for helping people correct their mistake.

Simon Willison recognized this as a problem:

There’s also the problem of what could amount to a distributed DoS on anyone with a lot of traffic who accidentally invalidates their feed. Can you imagine if someone with a thousand subscribers dropped an unescaped ampersand in to their Atom feed? Within the hour they would have 1,000 error reports to wade through (assuming all aggregators followed the report-error part of the standard).

However, Simon then proposed acceptance of another idea:

A better practical solution is probably to follow Bill Kearney’s example in having a big directory of Atom feeds which publically flags any that are broken, gently embaressing the owner in to fixing the feed.

What did Bill Kearney say? The following:

Ignorance we can help with decent documentation and friendly validators.
Laziness we can combat with a rigorous validator and, frankly, fear of exposure.
Should folks find themselves desparate to remain ignorant and lazy, well,
they’re more than welcome to use a spec that better suits them. It’s been my
experience, however, that by educating people and setting good examples they dotend to come around..

This is probably the first time I’ve ever heard ‘embarrassment’ and ‘fear of exposure’ used as effective solutions to a technical problem.

Tim Bray wrote an essay on this, but he’s confused the types of error handling, as others in the list have done, and that leads me to my next and more serious concern: validating the data rather than validating the syntax. Asserting that the syntax is valid and well-formed XML is one thing; but start validating the data delimited with the syntax, and that’s where the problems are going to arise.

Sam mentions that the Pie/Echo/Atom validator has now been extended to check for dates:

Recently, the validator was improved to check for dates like February
30th. Within days, a feed was caught with this problem.

Well, that’s cool – but what does this have to do with the syntax? What if I want to generate a feed that has February 30th, as a joke or because I’m feeling contrary. No harm to the Pie/Echo/Atom syntax, is there? Not even the RDF Validator – and we all know that RDF is complex and just full of meaning – checks the data contained within the syntax requirements.

Scott Johnson suggests we go even further because of a misuse of the language tag. He writes:

Something like 50+% of asian weblogs are set to english when they display kanji.

There are linguistic algorithms that could be put into the validator as well as a user level prompt that asks them “Is this text in english” and if they answer No, it could deny the validation.

So when is technically correct but lying invalid?

Lying? After reading this I immediately went to my RSS 2.0 syndication template and changed the language to mn – Mongolian. Why? Because I’m both arbitrary and contrary. In other words, I’m a typical technology user.

Head’s up Alpha Geeks, you forgot one rule, one important lesson: know your customers. Don’t assume that the recipient of the ‘bad feed’ email is going to be a commercial feed provider, or someone who even gives a shit whether the feed is accurate or not – they’re only providing it because they were asked. Additionally, don’t assume that your rules over the syntax of the feed bleed over into imposing rules on the data of the feed, outside of those that are essential for the syntax. The more rules you add to Pie/Echo/Atom, the more rules are going to be broken.

(By the way – PEAW? You all are using a word that looks suspiciously like “pugh” for a name now?)