Morphing URLs

Recovered from the Wayback Machine.

I signed up at Blogrolling.com to manage my blogroll, and you can the results in this page. Scroll down and you’ll see the ten most recently active webloggers in my virtual neighborhood. Click the “more…” link and you’ll go to my Blogroll page.

I’m using the blogrolling.com feed a couple of different ways. I’m using the raw PHP feed in this page, because it’s simple to process. However, I modified the code of the feed to only display the recent ten updates. I’d create another list, instead, and limit it to the most recent ten (a feature at blogrolling.com), but that’s only for those who have paid, and money is in short supply at the moment. So I tweaked the code on my own.

In the Blogroll page, I’m accessing the feed as RSS, and then using the PHP XML classes to process the data. By doing so, I can access the individual elements of the feed, such as the URL of the weblog, which I then use with my new Talkback feature.

(I’m thinking of accessing the RSS feed in this page and then caching the feed locally, to be used by the blogroll page, and lower the number of hits against the blogrolling.com site. We’ll see.)

Blogrolling.com makes use of changes.xml from weblogs.com to check for recently updated weblogs, a feature I incorporated into my blogroll. I really appreciate this, because it lets me see who’s updated without having to use an RSS aggregator, something I’m not fond of.

The problem, though, is that we’re inconsistent in how we format URLs. For instance, a person might update weblogs.com as “http://www.myweblog.com/”, but a blogrolling.com customer adds them as “http://myweblog.com”. These are two different URLs, syntactically, even though they point to the same weblog. Unfortunately, then, when the person updates their weblog, they’re not floating to the top of my blogroll.

The problem is that we all have different understandings of how a URL works, and what we need to use in a URL, and what not. Time for URL 101, I think.

First, the ‘www’ that is so common in most URLs today. Originally, the ‘www’ part of a URL stood for the hostname of the server on which the website lived. The complete name, ‘www.myweblog.com’ then translated into a specific IP (via DNS lookup of the domain) and a specific server.

Things have changed quite a bit and we now have something called virtual hosting. What this is, among other things, is the ability to create a sub-directory, such as (web server basepath)/weblog, and have the web server map weblog.domainname.com to that sub-directory. For instance, I have the following sub-directories, each of which is paired with the mapped subdomain:

basepath/weblog – weblog.burningbird.net
basepath/rdf – rdf.burningbird.net
basepath/articles – articles.burningbird.net
basepath/www – www.burningbird.net
and so on..

The last one in the listing shows www.burningbird.net, but I don’t have to use “http://www.burningbird.net” to get to my top-level web site — I can use “http://burningbird.net”. The reason is within my web server configuration files, the URLs “http://burningbird.net” and “http://www.burningbird.net” map to the exact same sub-directory, the one named ‘www’. You’ll find with most modern web installations that “http://www.domainname.com” and “http://domainname.com” map to the same sub-directory on the server (something you can easily check through your browser).

Just think: All that time when you’ve been typing in ‘www’, when you could have saved key strokes. Why you probably could have saved enough time to go and buy a Krispy Kreme.

(Note, though, that this mapping isn’t consistent, and you may actually get errors if you omit the ‘www’. Don’t you love individualism in web access?)

So the use of ‘www’ isn’t mandatory. Neither is the use of the trailing forward slash (‘/’) at the end of the URL, as you’ll see some people use.

In olden times, when you used the trailing slash at the end of the URL, the browser knew that you were accessing a directory not a file, and you saved the browser a second trip to the server to determine this. However, all modern browsers now assume that “http://yourdomain.com” and “http://yourdomain.com/” are the same, and you don’t get any performance benefit from the use. However, if your weblog is off of a sub-directory, such as “http://yourdomain.com/somedirectory/”, you will still, usually, get a performance benefit using the trailing slash.

However, the use of the trailing slash is one more difference in our URLs. At this point we have the following variations all pointing to the same web page:

http://www.yourdomain.com
http://www.yourdomain.com/
http://yourdomain.com
http://yourdomain.com/

But there’s yet another variation — specifying a file, explicitly.

For most of us, our weblogs are located in a page named ‘index.someextension’. It could be ‘index.html’ or ‘index.htm’ or ‘index.php’ and so on, but it is the index file, which is the default file to load when a directory is specified without a file name (this differs slightly based on web server and configuration).

To load my weblog, you can access “http://weblog.burningbird.net”, and you’ll get “http://weblog.burningbird.net/index.php”, because my web server is configured to look for files in the following order:

index.html
index.htm
index.php
and so on

As long as I don’t accidentally include an ‘index.html’ file in my directory, you’ll get the index.php page instead.

By not specifically giving the file name extension, what I can do is change the type of file, from index.html to index.php, and you all don’t have to change your links to me because you’re only specifying the directory, not explicitly the file name. In fact, if a person is using the default ‘index’ file name, you shouldn’t use this in your blogroll link to them, because it will break if they go to a new file format.

However, we now have yet another variation of the URL:

http://www.yourdomain.com
http://www.yourdomain.com/
http://www.yourdomain.com/index.html
http://yourdomain.com
http://yourdomain.com/
http://yourdomain.com/index.html

All in all, our use of URLs is about as distinct as we are, and I’m amazed that the bubble up feature of blogrolling.com works, at all.

To attempt to work around these challenges, I added people to my blogrolling.com list when they showed on weblogs.com, using the URL format they used with their pings. In addition, I checked the person’s perma-links, to see if they used ‘http://www.domainname.com’ or ‘http://domainname.com’, and so on. It became a treasure hunt in a way, but the golden egg in this hunt is a correctly bubble upped URL when the person updates.

BUT…

This has left my Talkback feature in a difficult state. The reason is, that the URL you use to ping weblogs.com, usually generated by your weblogging tool, isn’t the same URL you used in my comments. So, you might bubble up to the top of my blogroll, but querying for the blogrolling.com supplied URL in Talkback results in no comments showing.

Pain in the butt.

What we need is consistency. Perhaps we need a URL cleanup day, to clean up the URLs we use in our blogrolls. And a common guideline for URL usage, such as the following:

Use ‘www’ only if you need to. You don’t need to use ‘www’ unless your page doesn’t resolve without it.
Use the default ‘index.extension’ filename for your weblog main page.
If the default filename is used, don’t including this in the blogroll link. You’re putting a burden on the weblogger to have to use redirection if they want to change to a different page format.
Use the same URL in your comments that you use when pinging weblogs.com or blo.gs. In fact, be consistent with your weblog URL regardless of where you use it.