Controlling your data

Popular opinion is that once you publish any information online, it’s online forever. Yet the web was never intended to be a permanent snapshot, embedding past, present, and future in unbreakable amber, preserved for all time. We can control what happens to our data once it’s online, though it’s not always easy.

The first step is, of course, not to publish anything online that we really want to keep personal. However, times change and we may decide that we don’t want an old story to appear in search engines, or MySpace hotlinking our images.

I thought I would cover some of the steps and technologies you can use to control your online data and media files. If I’ve missed any, please add in the comments.

Robots.txt

The grandaddy of online data control is robots.txt. With this you can control which search engine web bot can access what directory. You can even remove your site entirely from all search engines. Drastic? Unwelcome? As time goes on, you may find that pulling your site out of the mainstream is one way of keeping what you write both timely and intimate.

I discussed the use of robots.txt years ago, before the marketers discovered weblogging, and most people were reluctant to cut themselves off from the visitors arriving from the major search engines. We used to joke about the odd search phrases that brought unsuspecting souls to our pages.

Now, weblogging is much more widely known, and people arrive at our pages through any form of media and contact. In addition, search engines no longer send unsuspecting souls to our pages as frequently as they once did. They are beginning to understand and manage the ‘blogging phenomena’, helped along by webloggers and our use of ‘nofollow’ (note from author, don’t use, bad web use). Even now, do we delight in the accidental tourists as much as we once did? Or is that part of a bygone innocent era?

A robots.txt file is a text file with entries like the following:

User-agent: * Disallow: /ajax/ Disallow: /alter/

This tells all webbots not to traverse the ajax or alter subdirectory. All well behaved bots follow these, and that includes the main search engines: Yahoo, Google, MSN, Ask, and that other guy, the one I can never remember.

The place to learn more about robots.txt is, naturally enough, the robots.txt web site.

If you don’t host your own site, you can achieve the same effect using a META element in the head section of your web page. If you’re not sure where this section is, use your browser’s View Source capability: anything between opening and closing “head” tags is the head section. Open mine and you can see the use of a META element. Another example is:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This tells web bots to not index the site and not harvest links from the site.

Another resource you might also want to protect is your images. You can tell search engines to bypass your images subdirectory if you don’t want them picked up in image search. This technique doesn’t stop people from copying your images, which you really can’t prevent without using Flash or some other strange web defying move. You can, however, stop people from embedding your images directly in their web pages, a concept known as hot linking.

There are good tutorials on how to prevent hotlinking, so I won’t cover it here. Search on “preventing hotlinking” and you’ll see examples, both in PHP code and in .htaccess.

Let’s say you want to have the search engines index your site, but you decide to pull a post. How can you pull a post and tell the search engines you really mean it?

410 is not an error

There is no such thing as a permanent, fixed web. It’s as fluid as the seas, as changeable as the weather. That’s what makes this all fun.

A few years back, editing or deleting a post was considered ‘bad form’. Of course, we now realize that we all change over time and a post that seemed like a good idea at one time may seem a terrible idea a year or so later. Additionally, we may change the focus of our sites: go from general to specific, or specific back to general. We may not want to maintain old archives.

When we delete a post, most content management tools return a “404” when the URL for the page is accessed. This is unfortunate because a 404 tells a web agent that the page “was not found”. An assumption could be made that it’s temporarily gone; the server is having a problem; a redirect is not working right. Regardless, there is an assumption that 404 assumes a condition of being cured at some point.

Another 4xx HTTP status is 410, which means that whatever you tried to access is gone. Really gone. Not just on vacation. Not just a bad redirect, or a problem with the domain–this resource at this spot is gone, g-o-n-e. Google considers these an error, but don’t let that big bully fool you: this is a perfectly legitimate status and state of a resource. In fact, when you delete a post in your weblog, you should consider adding an entry to your .htaccess file to note that this resource is now 410.

I pulled a complete subdirectory and marked it as gone with the following entry in .htaccess:

Redirect gone /category/justonesong/

I tried this on an older post and sure enough, all of the search engines pulled their reference to the item. It is, to all intents and purposes, gone from the internet. Except…

Except there can be a period where the item is gone but cache still remains. That’s the next part of the puzzle.

Search Engine Cache and the Google Webmaster Toolset

Search on a term and most results have a couple of links in addition to the link to the main web page. One such link is for the cache for the site: a snapshot of the the last time the webbot stopped by.

Caching is a handy thing if you want to ensure people can access your site. However, caching can also perpetuate information that you’ve pulled or modified. Depending on how often the search engine refreshes the snapshot, it could reflect a badly out of date page. It could also reflect data you’ve pulled, and for a specific reason.

Handily enough as I was writing this, I received an email from a person who had written a comment to my weblog in 2003 and who had typed out his URL of the time and an email address. When he searched on his name, his comment in my space showed in the second page. He asked if I could remove his email address from the comment, which was simple enough.

If this item still had been cached, though, his comment would have remained in cache with his email address until that comment was refreshed. As it was, it was gone instantly, as soon as I made the change.

How frequently older pages such as these are accessed by the bots really does depend, but when I tested with some older posts of other weblogs, most of the cached entries were a week old. Not that big a deal, but if you want to really have control over your space, you’re going to want to consider eliminating caching.

To prevent caching, add the NOARCHIVE meta tag to your header:

To have better control of caching with Google, you need to become familiar with the Google Web tools. I feel like I’ve been really picking on Google lately. I’m sure such will adversely impact on share price, and bring down searching as we know it today–bad me. However, I was pleased to see Google’s addition of a cache management tool included within the Google Webmaster tool set. This is a useful tool, and since there are a lot of people who have their own sites and domains, but aren’t ‘techs’, in that they don’t do tech for a living or necessarily follow sites that discuss such tools, I thought I’d walk through the steps in how to control search engine caching of your data.

So….

To take full advantage of the caching tool, you’ll need a Google account, and access to the Webmaster tools. You can create an account from the main Google page, clicking the sign in link in the upper right corner.

Once you have created the account and signed in, from the Google main page you’ll see a link that says, “My Account”. Click on this. In the page that loads, you can edit your personal information, as well as access GMail, Google groups, and for the purposes of this writing, the Webmaster toolset.

In the Webmaster page, you can access domains already added, or add new domains. For instance, I have added burningbird.net, shelleypowers.com, and missourigreen.com.

Once added, you’ll need to verify that you own the domain. There’s a couple of approaches: add a META tag to your main web page or you can create a file given the same name as a key generated for you from Google. The first approach is what you want to use if you don’t provide your own hosting, such as if you’re hosted in Blogger, Typepad, or WordPress.com. Edit the header template and add the tag, as Google instructs. To see the use of a META tag, you can view source for my site and you’ll see several in use.

If you do host your site and would prefer another approach, create a text file with the same name as the key that Google will generate for you when you select this option. That’s all you need with the file: that it be named the name Google provides–it can be completely empty. Once created, FTP or use whatever technique to upload it to the site.

After you make either of these changes, click the verify link in the Webmaster tools to complete the verification. Now you have established with Google that you are, indeed, the owner of the domain. Once you’re verified the site, clicking on each domain URL opens up the toolset. The page that opens has tabs: Diagnostic, Statistics, Links, and Sitemaps. The first three links most likely will have useful information for you right from the start.

Play around with all of the tabs later, for now, access Diagnostic, and then click the link “URL Removal” in the left side of the page. In the page that opens, you’re given a chance to remove links to your files, subdirectories, or your entire site at Google, including removing the associated cache. You can also use the resource to add items back.

You’ve now prevent webbots from accessing a subdirectory, told the webbots a file is gone, and cleaned out your cache. Whatever you wrote and wish you didn’t is now gone. Except…

Removing a post from aggregation cache

Of course, just because a post is removed from the search engines, doesn’t meant that it’s gone from public view. If you supply a syndication feed, aggregators will persist feed content for some period of time (or some number of posts). Bloglines persists the last 100 feeds, and I believe that Google reader persists even more.

If you delete a post, to ensure the item is removed from aggregator cache, what you really need to do is delete the content for the item and then re-publish it. This ‘edit’ then overwrites the existing entry in aggregator cache.

You’ll need to make sure the item has the same URL as the original posting. If you want, you can write something like, “Removed by author” or some such thing — but you don’t have to put out an explanation if you don’t want to. Remember: your space, your call. You could, as easily, replace the contents with a pretty picture, poem, or fun piece of code.

Once the item is ‘erased’ from aggregation, you can then delete it entirely and create a 410 entry for the item. This will ensure the item is gone from aggregators AND from the search engines. Except…

That pesky except again.

This is probably one of the most critical issues of controlling your data and no one is going to be happy with it. If you publish a fullcontent feed, your post may be picked up by public aggregators or third party sites that replicate it in its entirety. Some sites duplicate and archive your entries, and allow both traversal and indexing of their pages. If you delete a post that would no longer be in your syndication feed (it’s too old), there’s no way to effectively ‘delete’ the entry for these sites. From my personal experience, you might as well forget asking them to not duplicate your feeds — with many, the only way to prevent such is to either complain to their hosting company or ISP, or to bring in a lawyer.

The only way to truly have control over your data is not to provide fullcontent feeds. I know, this isn’t a happy choice, but as more and more ‘blog pirates’ enter the scene, it becomes a more viable option.

Instead of fullcontent, provide an excerpt, as much of an excerpt as you wish to persist permanently. Of course, people can manually copy the post in its entirety, but most people don’t. Most people follow the ‘fair use’ aspect of copyright and quote part of what you’ve written.

There you go, some approaches to controlling you data. You may not have control over what’s quoted in other web sites based on fair use, but that’s life in the internet lane; returning us back to item number one in controlling your data–don’t publish it online.