Weblog Links: Part 2—Re-weaving the Broken Web

Recovered from the Wayback Machine.

If you change your weblogging environment, such as move to a different tool, different archive structure, or even different server and domain, and others have linked to your posts, you’re going to be leaving broken links behind. What can you do? Actually, quite a bit, depending on how important the continuous linkage is, and what change you’ve made.

(Note: All web server discussions in this essay are based on the assumption that you’re using Apache at either your old or new file location. If not, you’ll need to check the server documentation for instructions in the use of these techniques.)

First, though, a look at the toolbox you’ll be using.

404, links, htaccess, mod_rewrite, meta redirect, and code — oh my

Contrary to popular opinion, a missing page isn’t a flaw within the internet. I like Edd Dumbill’s view of the 404 error, given in his essay in defense of RDF:

The oft-cited innovation of the web is the 404 error: the ability for a page not to be there, and the system still work.

The 404 status is a legitimate web page status, and the web is designed to work with it. It is a tool rather than a point of damage, and should be treated as such.

Other simple tools are the hypertext link as well as the web-page based META tag that can be modified to become a redirect to a new location. The advantage to these is that you don’t have to access direct file or web server configuration access in your old environment, something you’re not likely to have with hosted solutions such as Blogger, Bloghorn, or TypePad.

However, some tools require the appropriate environment to work, including the aforementioned file access. For instance, web servers have the ability to include redirection within their configuration — something I won’t talk about here because most webloggers have no access to their web server’s configuration files. Most, though, have all the control they need as long as they can host a file — one file — at the old file location. Unfortunately, this is usually as inaccessible as the web server configuration with hosted environments.

If you’re moving pages within a server, or you have the ability to host files in your old location, then you can consider the use of the .htaccess file for most of your redirection needs. The .htaccess file (and note that the beginning period (‘.’) is required) is a text file that contains instructions to the web server that overrides the server’s normal web page service behavior. The instructions can be as simple as sending 404 errors to a specific document; to something as complicated as preventing what is known as ‘hot links’ — preventing people from accessing your photos directly from their web pages, and thereby increasing your bandwidth use.

Of course, if you do have access to web server configuration, you’re better off using this rather than .htaccess; once you turn on .htaccess support, the web server will then look for these in every directory, and every page request within a directory also causes the .htaccess document in that directory to be loaded. If you’re .htaccess file is large, and documents in the directory are requested frequently, the performance hit can be significant.

Most webloggers share home space with several other people, and direct web server configuration file access is usually discouraged. Additionally, it’s rare that our archived pages are accessed that frequently that the use of .htaccess is an issue. Still, you’ll always want to do what you can to keep the .htaccess files small.

Even within the .htaccess file, there are multiple ways to accomplish the same task, usually depending on whether your host has a certain extension, mod_rewrite, compiled into your web server.

If neither web-page or web-server configuration solutions appeals, or for more sophisticated file redirection, you can use code such as JavaScript or PHP or Perl to redirect people or to handle missing pages.

Rather than write a tutorial of all these techniques, I’ll demonstrate uses of each in different circumstances. At a minimum, this will give you some idea of what are your options.

Changing Domain, Same filenames and archive structure

If you’re maintaining the same filenames and archive structure, but changing domains, you have two solutions depending on what your environment supports.

If you don’t have the ability to create an .htaccess file on your old host then you can use the HTML META tag with the refresh attribute to redirect people to the new location.

Mike Golby recently moved to the Wayward Weblogger co-op from Blogspot hosting, and he could use this approach with his old archives, which are maintained both at Blogspot and at his new location. For instance, this post at http://pagecount.blogspot.com/2002_12_01_pagecount_archive.html#90127536 could be redirected to the new location, here by adding the appropriate META tag to each of the old archive files. Since this is a template based system, he would need to add a bit of code to the template and then regenerate the archives at the old site to generate the META tag:

<meta http-equiv=”refresh” content=”5;url=http://pagecount.burningbird.net/pagecount/2002_12_01_pagecount_archive.html”>

When the page with this at the top is loaded into a browser, it automatically redirects the person to the new location after 5 seconds. The time difference is necessary, because if you didn’t have it, the back button wouldn’t work. You’ve all been out to sites that refresh to a different page, but then wouldn’t allow you to back button out of it. You don’t like it, your readers won’t either.

(I would show you the exact template tag and format to use, but I can’t find any that work — Blogger’s tags are very limited in their use, as well as not being especially well documented. Both items covered in detail in part 3 or this series. If someone has working template code for this, please add to comments or send me link.)

This won’t help with specific items, but at least you can get redirects at the file level. You should only use the META tag if you have no other option — they require that the page be downloaded, and they don’t give your readers a smooth transition. A better approach would be to use a server-side technique, such as the .htaccessf file.

If you have the ability to create a file in your old environment, you can create an .htaccess file in the top-level directory and use pattern or URL matching to redirect requests to the new location.

For instance, in part 1 of this series I mentioned about John Robb’s weblog and it disappearing from its old location on the Userland servers. However, recent accesses show that the old server location has been modified to redirect requests from the old location to the new. Now, when I access an old link, such a shttp://jrobb.userland.com/2001/11/05.html, I’m redirected to the new location at http://jrobb.mindplex.org/2001/11/05.html.

To redirect all directories at the old site to the new one, the .htaccess file would contain the following line:

Redirect permanent / http://jrobb.mindplex.org/

When a page is accessed at the old site, this directive lets Apache know to send it to the new URL, but the rest of the request is maintained as is — which means that http://jrobb.userland.com/2001/11/05.html gets redirected to http://jrobb.mindplex.org/2001/11/05.htm.

As long as the .htaccess file is located on the old server, and John maintains the same archive structure, the page requests will be redirected to the correct location. But what happens if John changes the archive structure, such as moving to a different tool? Well, the options then depend on exactly what you’re starting with, and where you’re going.

Changing Archive Structure

Changing archive structure is either a very simple process, or virtually impossible to deal with. All of it depends on whether a pattern can be established between the old and the new locations.

As an example of a very simple change, when I first started with Movable Type I used to have my archive files sent to /archives, and my individual pages all had a .php extension. When one of the items was slashdotted, it came to my attention that putting a .php extension on a page, when I’m not really using PHP for anything specific, isn’t a good idea. When it comes to the CPU, waste not, want not.

I ended up regenerating my individual archive items to a new location, /fires, and also changed all their extensions to .htm. However, I’m still getting hits at the old location. What’s to do?

Again, .htaccess comes to the rescue. At my original host, mod_rewrite was not compiled into Apache, so I couldn’t use this feature, but I could use the RedirectMatch regular expression handler for .htaccess. What this does is look for a pattern in the request and literally match this with a pattern in the redirected page.

Easier just to demonstrate. The line added to .htaccess is:

RedirectMatch permanent ^/archives/(.*)\.php$ /fires/$1.htm

This line instructs the web server to redirect all requests to any PHP file located in the archives subdirectory to the same named file (that $1 parameter), but with an HTM extension in the fires directory. Since the move is permanent, a status is returned to the requesting browser or web bot that this move is a permanent and to adjust accordingly.

When I moved to the Wayward Weblogger co-op, I made sure the web server was compiled with mod_rewrite, and the converted the .htaccess file to:

RewriteEngine On
RewriteCond %{REQUEST_URI} \.php$
RewriteRule ^(.*)\.php$ http://weblog.burningbird.net/fires/$1.htm [R=301,L]

Why use mod_rewrite over the Redirect? Performance improvements, plus mod_rewrite gives you the ability to do more sophisticated manipulations than the other .htaccess directives such as redirect — including that aforementioned hot-link prevention to keep people from linking directly to your bandwidth expensive photos.

What happens, though, if the pattern matching can’t be done? Well, in that case what you can do is find a way to generate, or hand write, individual redirect statements for each page.

Liz Lawley was faced with this when she not only moved from her university address to her own domain, she also changed the MT default numeric file name to a format that has considerable popularity — archived subdirectories by name, and then call the file name the same name as the entry title.

There was no regular expression matching that could handle this, so what she did, with advice from Mark Pilgrim, was to create an MT template to run on the new server that looked like the following:

<MTEntries lastn=”999999”>
Redirect permanent /archives/<$MTEntryID$>.html <$MTEntryLink$>
</MTEntries>

Once the .htaccess file was generated, Liz then moved it to her old server.

I wince at the size of the .htaccess file — remember web server load every time a page is accessed. However, knowing that the pattern of access for weblogs is an exponential dropoff in archive page access as soon as an item rolls off the front page, the performance shouldn’t be a problem.

Errorhandler

What if you’re like me and started with weblogging environment that allows no export? Or if you decide not port your archives files, or remove them ? What are your options then?

If your tool doesn’t support any form of export the only way you’ll be able to port these is manually. At thsi point, you may want to consider leaving some of your writing behind, or be in for a long, tedious effort of copy and paste. There are no other options with this type of tool, which is why I don’t recommend their use by anyone for any reason.

For existing posts that you want to remove, one option is to remove the content of the file, but leave the file itself and then add a message that the content has been removed. You might direct the person to the front page, or mention why the content was removed, and so on.

If you want to remove the file itself, or you don’t port it, you can create what is known as an error document and use the ErrorDocument directive in your .htaccess file:

ErrorDocument 404 /missing.php

This file is located at the top-directory, and applies then to all directories underneath it, unless overridden by local .htaccess files. Whenever a file is accessed that doesn’t exist, the person is redirected to a file, in this case, missing.php. My Missing page has a form for searching, as well as a listing of recent entries across the network. Others have other useful information, but the point is, give your readers somewhere to go when they access a page that no longer exists.

Though the directive shows that I’m accessing the missing.php file directly with the ErrorDocument directive, in actuality, I’m accessing an application that does more than just redirect to a specific page. Note that this solution is only for those programmatically inclined.

Programmatic Solutions

Though not for everyone, and not for every circumstance, you can use programming languages to manage missing or moved files.

I am finishing up an application I use to generate RDF/XML files that track the movement of files on my server. Considering that I’ve had web sites for several years, I’m still getting requests for very old code that I pulled because it was obsolete long ago. In addition, I’ve also moved files around to different locations, renamed them, and so on.

When I want to move or remove a file, I log into PostCon and pull the file’s RDF file up (found by inputting the current file name). I then annotate the current event — move, remove, or add for a new resource — and provide the location of the resource once the event occurs. PostCon generates an updated RDF file with the new history. More than that, though, it also annotates an event file — an RDF file of events that’s hosted on the server and that is used to kick off server side processing every hour.

For instance, a move event will result in the file being copied to the new location. Since I’m running Linux, though, it does a bit more — it creates what is called a soft link between the old file location and the new location, using the Unix command “ln -s oldlocation newlocation”.

When the file is accessed at the old location, the link causes the file to be served from its new location, transparently to the client.

A removal is a bit more complex because there are reasons for this move, and these are added to the RDF file. In server-side processing, the file is removed, which is simple, but in addition, the removed file URL is added to a super fast database on the server that handles all 404 errors. This database looks for a file URL when it gets a request for a file that no longer exists and either redirects the request to a newer resource that supersedes the old file, or it loads the RDF file for the removed file and prints out some useful information about why the file is removed. You can see this in action using the URL /samples/scripting/TYPEOF.HTM — note that the look is an old one, and isn’t too pretty. What’s important is the information pulled from the RDF file giving more details about what happened to the file.

PostCon is an open source application, source code freely available, and whose launch will coincide with an upcoming O’Reilly Network article. It’s built with a combination of PHP and Perl, and uses MySql.

A programmatic solution isn’t for everyone. And sometimes the best solution for handling broken links is…do nothing.

Do Nothing

Doing nothing is a highly underrated solution when faced with moving from one weblogging tool or environment to another. When Mike Golby moved, he left his old archives at blogspot and created new archives at the new location. In the old site, he listed a link to his new location. Simple, easy, and uncomplicated.

By the time that his old archive pages are purged from blogspot, references to his old location will probably be few and far between.

If you’re changing tools, but on the same server, again, just consider not doing anything. Requests for old pages go to the old pages, new requests to new and life goes on. The only time that this becomes a problem is if you’re using comments and there are incompatible comment systems between the two. In this case, then, you might consider turning comments off with your old files.

Consider this — I get comments on the average of about 10 per week for older archived entries, those more than a couple of weeks old. I get fewer than a couple of month for anything older.

Not doing anything can be the simplest, easiest, most efficient, less heart breaking, less limiting approach you can use to dealing with old and new archives. It can be the one that gets you going in your new home/tool faster, which means that you’re out writing rather than fussing with the tool.

And isn’t that why we’re here?

Addendum

If you have other broken link, archive redirect, porting solutions, please leave a comment, ping this entry with your solution, or send me an email and I’ll include it here in this section.