Smart URLs, converting from MT to WP, and die, URL, die

Recovered from the Wayback Machine.

I am in the midst of trying to salvage weblog entries that have gone through many variations of URL identification, as I’ve passed from tool to tool, and through many variations of what is subjective goodness in URL naming strategies. At the same time, I am also dealing with years old links to material so far out of date it’s laughable to even think about having ‘this is dead’ notices for it.

The problem with old URLs started becoming extreme enough at my site for me to write an application, PostCon, which I’ve talked about previously. PostCon provides the ability to selectively annotate the information that is returned for old URLs that have been pulled, or to manage URL movement. All well and good – but ultimately in the end, I knew I would reach a point of having to just letting the URL die a natural death.

Tim Berners-Lee has stated that Cool URIs don’t change, but he said this back in 1998, when the Web was only a few years old and we thought that the inherent goodness of the Web was based on accumulated knowledge. Now, over a decade after the Web’s birth, we’re finding that the Internet is an ocean and URLs are rocks around our neck, and with each passing year, the water is getting higher.

I had a domain, yasd.com, which I’d had for years and accumulated a vast number of URLs to funky (in the bad sense) material within that domain. The content the URLs reference is badly outdated, much of it to long dead technology. There were page examples for dealing with the beta version of Navigator and IE and how to deal with cross-browser differences and so on. None of the examples have worked for years, and for the few that I managed to pull from version to version, I finally gave up when Mozilla seemed to splinter into many sparkly pieces, and there are now so many different variations of browser/operating system pairs that the only way you can hope to survive is making sure that you work to the most common standards (not necessarily the newest or even the best).

The yasd.com domain was also tainted, long time ago, because there are so many variations of what ‘YASD’ means. For instance, a popular meaning for YASD is “Yet Another Sudden Dead”, a gaming term, and it is through this that I started getting so much of my email spam: kids were using the domain, yasd.com, as a phony sign up email address whenever they wanted a throwaway address.

Rather than continuing to renew yasd.com, and dynamicearth.com, and p2psmoke.org year after year, just to maintain that URL ‘coolness’, this year I’m letting them go.

(The moment I released yasd.com, the email spam coming into my email system fell by 80%.)

Now, before releasing these old domains, I could have setup permanent redirects for the old domain URLs to URLs on my new domain, and I suppose this would be the proper thing to do – but why? There is no value in this old material, and neither is there any additional value with posting a note saying, “This material is out of date and no longer supported.” Though the message might be more meaningful than getting a generic 404 error message, the benefit of providing it is offset by the cost of continually maintaining these old, old, old URLs. Doing so might be ‘cool’–but there is no value either to myself, to the search engines, or, ultimately, to the person arriving at my site from an old, old, old link.

(Unfortunately, the registrar I have, rather than letting the URLs relapse gracefully into a 404 status (and hence letting Google clean out its database), insists on persisting the domain for a time to try and get me to renew it. So if you search on “C# book” and go to what was the Google link to this (third down from the top), you’ll get a foolish registrar generated page instead. )

That takes care of the old and useless, but what about the relatively new and possibly useful?

For the good URLs, ones to pages that still exist, I use rewrite rules in .htaccess wherever possible, and then use PostCon for the rest.

(The .htaccess file is a file consumed by the Web server with directives telling it how to manage specific page requests, including redirects from old page URLs to new. One directive provides a pointer to an error handler file or application that handles all ‘bad’ page accesses, and I use this to point to my PostCon application.)

For the many weblog URL lives: I used .htaccess when I went from individual entry pages ending in .php to ones ending in .htm, and I used PostCon to manage the redirects when I went from numbered pages to ‘cruft-free’ URLs–URLs that are based on a archival data and post title. But now, I’m faced with an interested challenge.

When moving from Movable Type to WordPress, I went from a category-based archive to one based on the date. I could generate .htaccess entries for each file using Movable Type, and since I’m moving the archive location, the only .htaccess file that would be impacted by such a large number of redirects in the one in my old archive location.

However, a second problem arises with the conversion from MT to WordPress and that is both products default to a different separator character when generating ‘dirified’ URLs. Movable Type uses ‘underscores’ (’_’) for all of the replaced characters in a title, such as the spaces; WordPress uses the dash (’-‘).

(Though I appreciate the efforts undertaken, in my opinion the Atom effort would have paid for itself ten times over by now if instead of focusing on the syndication track first in its efforts (which I hasten to point out is now my new default syndication feed, so don’t get pissy with me), it focused on porting behavior instead–including an overall agreed on definition between the tools as to what is a ‘cruft free URL’.)

There are page specific and programming specific ways of working this issue, none of which I’m entirely satisfied with because I don’t want to maintain all of the old files at the old location over time. What I can do is write code to create .htaccess entries (or PostCon entries) that map between the different filenames, including managing the underscore to dash conversion.

In addition, I may be able to create a rewrite rule that handles the conversion for me, including the conversion based on category (by discounting the categories and handling individual title overlaps) to date, not to mention the underscore to dash.

But then I’m faced with the decision: do I want to use underscores, or do I want to use dashes?

Further research shows that supposedly to search engines, the underscore is seen as a part of the search phrase, while the dash is seen as nothing more than white space. On the other hand, others swear by the use of underscore, and feel that it makes for a more ‘attractive’ URL. In addition, they state that smart search engine bots know how to handle both dashes and underscores.

(Oddly enough, much of this discussion is encapsulated in a forum thread having to do with pMachine’s new ExpressionEngine application. )

I can always alter the code for WordPress to work with underscores instead of dashes, but do I want to?

Before I finish this last URL cleanup task, managing the weblog archive URLs, I seek further opinion from others:

In intelligent URLs, is it better to go underscore or dash?