Controlling your data

Popular opinion is that once you publish any information online, it’s online forever. Yet the web was never intended to be a permanent snapshot, embedding past, present, and future in unbreakable amber, preserved for all time. We can control what happens to our data once it’s online, though it’s not always easy.

The first step is, of course, not to publish anything online that we really want to keep personal. However, times change and we may decide that we don’t want an old story to appear in search engines, or MySpace hotlinking our images.

I thought I would cover some of the steps and technologies you can use to control your online data and media files. If I’ve missed any, please add in the comments.


The grandaddy of online data control is robots.txt. With this you can control which search engine web bot can access what directory. You can even remove your site entirely from all search engines. Drastic? Unwelcome? As time goes on, you may find that pulling your site out of the mainstream is one way of keeping what you write both timely and intimate.

I discussed the use of robots.txt years ago, before the marketers discovered weblogging, and most people were reluctant to cut themselves off from the visitors arriving from the major search engines. We used to joke about the odd search phrases that brought unsuspecting souls to our pages.

Now, weblogging is much more widely known, and people arrive at our pages through any form of media and contact. In addition, search engines no longer send unsuspecting souls to our pages as frequently as they once did. They are beginning to understand and manage the ‘blogging phenomena’, helped along by webloggers and our use of ‘nofollow’ (note from author, don’t use, bad web use). Even now, do we delight in the accidental tourists as much as we once did? Or is that part of a bygone innocent era?

A robots.txt file is a text file with entries like the following:

User-agent: * Disallow: /ajax/ Disallow: /alter/

This tells all webbots not to traverse the ajax or alter subdirectory. All well behaved bots follow these, and that includes the main search engines: Yahoo, Google, MSN, Ask, and that other guy, the one I can never remember.

The place to learn more about robots.txt is, naturally enough, the robots.txt web site.

If you don’t host your own site, you can achieve the same effect using a META element in the head section of your web page. If you’re not sure where this section is, use your browser’s View Source capability: anything between opening and closing “head” tags is the head section. Open mine and you can see the use of a META element. Another example is:


This tells web bots to not index the site and not harvest links from the site.

Another resource you might also want to protect is your images. You can tell search engines to bypass your images subdirectory if you don’t want them picked up in image search. This technique doesn’t stop people from copying your images, which you really can’t prevent without using Flash or some other strange web defying move. You can, however, stop people from embedding your images directly in their web pages, a concept known as hot linking.

There are good tutorials on how to prevent hotlinking, so I won’t cover it here. Search on “preventing hotlinking” and you’ll see examples, both in PHP code and in .htaccess.

Let’s say you want to have the search engines index your site, but you decide to pull a post. How can you pull a post and tell the search engines you really mean it?

410 is not an error

There is no such thing as a permanent, fixed web. It’s as fluid as the seas, as changeable as the weather. That’s what makes this all fun.

A few years back, editing or deleting a post was considered ‘bad form’. Of course, we now realize that we all change over time and a post that seemed like a good idea at one time may seem a terrible idea a year or so later. Additionally, we may change the focus of our sites: go from general to specific, or specific back to general. We may not want to maintain old archives.

When we delete a post, most content management tools return a “404” when the URL for the page is accessed. This is unfortunate because a 404 tells a web agent that the page “was not found”. An assumption could be made that it’s temporarily gone; the server is having a problem; a redirect is not working right. Regardless, there is an assumption that 404 assumes a condition of being cured at some point.

Another 4xx HTTP status is 410, which means that whatever you tried to access is gone. Really gone. Not just on vacation. Not just a bad redirect, or a problem with the domain–this resource at this spot is gone, g-o-n-e. Google considers these an error, but don’t let that big bully fool you: this is a perfectly legitimate status and state of a resource. In fact, when you delete a post in your weblog, you should consider adding an entry to your .htaccess file to note that this resource is now 410.

I pulled a complete subdirectory and marked it as gone with the following entry in .htaccess:

Redirect gone /category/justonesong/

I tried this on an older post and sure enough, all of the search engines pulled their reference to the item. It is, to all intents and purposes, gone from the internet. Except…

Except there can be a period where the item is gone but cache still remains. That’s the next part of the puzzle.

Search Engine Cache and the Google Webmaster Toolset

Search on a term and most results have a couple of links in addition to the link to the main web page. One such link is for the cache for the site: a snapshot of the the last time the webbot stopped by.

Caching is a handy thing if you want to ensure people can access your site. However, caching can also perpetuate information that you’ve pulled or modified. Depending on how often the search engine refreshes the snapshot, it could reflect a badly out of date page. It could also reflect data you’ve pulled, and for a specific reason.

Handily enough as I was writing this, I received an email from a person who had written a comment to my weblog in 2003 and who had typed out his URL of the time and an email address. When he searched on his name, his comment in my space showed in the second page. He asked if I could remove his email address from the comment, which was simple enough.

If this item still had been cached, though, his comment would have remained in cache with his email address until that comment was refreshed. As it was, it was gone instantly, as soon as I made the change.

How frequently older pages such as these are accessed by the bots really does depend, but when I tested with some older posts of other weblogs, most of the cached entries were a week old. Not that big a deal, but if you want to really have control over your space, you’re going to want to consider eliminating caching.

To prevent caching, add the NOARCHIVE meta tag to your header:

To have better control of caching with Google, you need to become familiar with the Google Web tools. I feel like I’ve been really picking on Google lately. I’m sure such will adversely impact on share price, and bring down searching as we know it today–bad me. However, I was pleased to see Google’s addition of a cache management tool included within the Google Webmaster tool set. This is a useful tool, and since there are a lot of people who have their own sites and domains, but aren’t ‘techs’, in that they don’t do tech for a living or necessarily follow sites that discuss such tools, I thought I’d walk through the steps in how to control search engine caching of your data.


To take full advantage of the caching tool, you’ll need a Google account, and access to the Webmaster tools. You can create an account from the main Google page, clicking the sign in link in the upper right corner.

Once you have created the account and signed in, from the Google main page you’ll see a link that says, “My Account”. Click on this. In the page that loads, you can edit your personal information, as well as access GMail, Google groups, and for the purposes of this writing, the Webmaster toolset.

In the Webmaster page, you can access domains already added, or add new domains. For instance, I have added,, and

Once added, you’ll need to verify that you own the domain. There’s a couple of approaches: add a META tag to your main web page or you can create a file given the same name as a key generated for you from Google. The first approach is what you want to use if you don’t provide your own hosting, such as if you’re hosted in Blogger, Typepad, or Edit the header template and add the tag, as Google instructs. To see the use of a META tag, you can view source for my site and you’ll see several in use.

If you do host your site and would prefer another approach, create a text file with the same name as the key that Google will generate for you when you select this option. That’s all you need with the file: that it be named the name Google provides–it can be completely empty. Once created, FTP or use whatever technique to upload it to the site.

After you make either of these changes, click the verify link in the Webmaster tools to complete the verification. Now you have established with Google that you are, indeed, the owner of the domain. Once you’re verified the site, clicking on each domain URL opens up the toolset. The page that opens has tabs: Diagnostic, Statistics, Links, and Sitemaps. The first three links most likely will have useful information for you right from the start.

Play around with all of the tabs later, for now, access Diagnostic, and then click the link “URL Removal” in the left side of the page. In the page that opens, you’re given a chance to remove links to your files, subdirectories, or your entire site at Google, including removing the associated cache. You can also use the resource to add items back.

You’ve now prevent webbots from accessing a subdirectory, told the webbots a file is gone, and cleaned out your cache. Whatever you wrote and wish you didn’t is now gone. Except…

Removing a post from aggregation cache

Of course, just because a post is removed from the search engines, doesn’t meant that it’s gone from public view. If you supply a syndication feed, aggregators will persist feed content for some period of time (or some number of posts). Bloglines persists the last 100 feeds, and I believe that Google reader persists even more.

If you delete a post, to ensure the item is removed from aggregator cache, what you really need to do is delete the content for the item and then re-publish it. This ‘edit’ then overwrites the existing entry in aggregator cache.

You’ll need to make sure the item has the same URL as the original posting. If you want, you can write something like, “Removed by author” or some such thing — but you don’t have to put out an explanation if you don’t want to. Remember: your space, your call. You could, as easily, replace the contents with a pretty picture, poem, or fun piece of code.

Once the item is ‘erased’ from aggregation, you can then delete it entirely and create a 410 entry for the item. This will ensure the item is gone from aggregators AND from the search engines. Except…

That pesky except again.

This is probably one of the most critical issues of controlling your data and no one is going to be happy with it. If you publish a fullcontent feed, your post may be picked up by public aggregators or third party sites that replicate it in its entirety. Some sites duplicate and archive your entries, and allow both traversal and indexing of their pages. If you delete a post that would no longer be in your syndication feed (it’s too old), there’s no way to effectively ‘delete’ the entry for these sites. From my personal experience, you might as well forget asking them to not duplicate your feeds — with many, the only way to prevent such is to either complain to their hosting company or ISP, or to bring in a lawyer.

The only way to truly have control over your data is not to provide fullcontent feeds. I know, this isn’t a happy choice, but as more and more ‘blog pirates’ enter the scene, it becomes a more viable option.

Instead of fullcontent, provide an excerpt, as much of an excerpt as you wish to persist permanently. Of course, people can manually copy the post in its entirety, but most people don’t. Most people follow the ‘fair use’ aspect of copyright and quote part of what you’ve written.

There you go, some approaches to controlling you data. You may not have control over what’s quoted in other web sites based on fair use, but that’s life in the internet lane; returning us back to item number one in controlling your data–don’t publish it online.


Traced Botanical tour

I grabbed my camera and my GPS handheld yesterday to get some test data for application development. The recorded track isn’t the best, primarily because the containing field was small, and the terrain was flat. I also realized from the track that I tend to meander when walking around, taking photos.

screenshot from Google Earth

I can assure you, though, that I didn’t actually go through the lily pond.

Garmin provides it’s own export format, MPS, but I used g7towin on the PC to export the data directly from the device into GPX, another popular format. I also loaded the same track from my device using Ascent on my Mac. All these formats are XML-based–GPX, GML, MPS, KML, and geoRSS. Not to mention the embedded photo info contained in the EXIF data section, which is formatted as RDF/XML. In the debate between JSON and XML, when it comes to the geoweb, JSON has no where to go.

I converted the GPX files into KML files using GPS Visualizer. I also generated a topographic map of the area. With the GPX and KML, I can use any number of GPS applications and utilities, including Google Earth, and the GPSPhotoLinker, which provides the capability to upload a GPS track and any number of photos and geotags the photos, either individually or as a batch. Works in the Mac, too, which is unusual, as most GPS applications are Windows-based.

I can do any number of things with the GPS data since all of it is standard XML and accessible by any number of XML-based APIs and applications. I can generate SVG to draw out the track, as well as create an application to determine from the variance in slope between waypoints or tackpoints, whether the hike is arduous or not. I can use ‘geotagging’ in the images to incorporate photographs into mapping. Or mapping into photography.

Lastly, I can display or otherwise use the EXIF data, though using ImageMagick to generate thumbnails can also strip this information. Eventually even update my old photo/RDF work into something not dependent on a content management tool.

water lilies

Dragonfly on purple

Statue of Woman

Photography Technology

Image Magic

I’ve been experimenting around with software tools and utilities, and for this batch I used ImageMagick’s command line tools to create the thumbnails and add a signature.

The shell script to add the signature is:

for name in *jpg
echo $name
convert $name -font AvantGuarde-Demi -pointsize 24 \
-gravity SouthEast \
-fill maroon -draw "text 27,27 'Shelley'" \
-fill white -draw "text 25,25 'Shelley'" \

Puts a signature on every JPEG in a subdirectory.

I also found a lovely little online photo editor that does everything most people need, and has one of the few truly intuitive interfaces I’ve seen among these tools. It’s called Picnik and it doesn’t cost anything to give it a try, as far as I can see.

I won’t use the WordPress upload software, nor the software I had been developing. These require the ability to have global write to a directory, and this has been hacked now and I don’t consider it safe. Instead, I FTP images to a working directory, run my ImageMagick scripts to do my global changes, and then push them over to the my photos subdirectory. All in all, it’s actually as fast or more than uploading one photo at a time, though it doesn’t ‘hide’ you from the environment as much.

I think that’s a mistake myself: protecting the users from the environment. All we’ve done is help them/you to open their/your environment to hacking. Not particularly responsible of us.



How I got my Harry Potter book

It was late Friday night and I had just gone down to feed the cat before getting ready for bed when I noticed the light that shines through the front window had gone out. Moments later, there was a smart tap tap tap at the door. Somewhat nervously, I looked out through the peephole to see who it was and was met with an astonishing sight.

Standing at the door was a man, not particularly young but not especially old, either, wearing the strangest outfit. He had on a UPS shirt and shorts, which in itself wouldn’t be odd except for the lateness of the hour. However, over it he wore a long black robe, which would catch in the breeze and billow out around him, like a dark cloud before a storm.

He knocked again, more impatiently this time. I called out through the door, asking him who he was and what he wanted.

“UPS, ma’am”, he replied. “I have a delivery for you.”


“It’s a little late for a UPS delivery, isn’t it?”

He responded, words mechanically delivered, as if memorized by rote (or forced by curse), “This is a special delivery created for this very special occasion, as a courtesy for you, our beloved customer, by both Amazon and UPS.”

He then held something up, a dark shape too close for me to get a good look through the tiny hole.

“I have your copy of the new Harry Potter book, Deathly Hollows.”

A book? The man was bringing me a book?

“You did order the new Harry Potter book for same day delivery from Amazon, didn’t you?” he said, with some exasperation.

I had at that, and whether it was the mad outfit or the strangeness of the event, I felt reassured by his words and opened the door. He looked relieved and handed me the box. On the outside were words, white stamped on red, something about ‘…Muggle delivery…’ and not delivering the book until 12:01, July 21st.

He smiled perfunctorily at me, and started to walk away. “Wait,” I called out. “You came this late at night, knocked on my door and for all you know woke me from a sound sleep in order to deliver a book?”

He stopped, frowning slightly. “You did order same day service, did you not?”

“Well, yes, but…”

“If we had been late, you would have been angry, wouldn’t you have?”

“Possibly, but…”

“Then I don’t see what the problem is,” he finished, and again started to walk away.

“Yes, but it’s not the 21st yet.”

He stopped. Slowly he turned around toward me, all traces of smile gone. His face had paled, and it was only at that moment that I noticed he had a really bad black wig on his head, slightly askew. And…were those glasses painted around his eyes?

“I think you are mistaken,” he said, voice so low I had to strain to hear him.

“No, no, I don’t think so,” I held up the cellphone I had in my hand when I had gone to the door, ready to call the police depending on what I found on the other side. “I grabbed this when you knocked. The time in the phone is maintained by the cellphone company and is accurate to the second.”

I beckoned him closer to look for himself. When he bent his head down to peer more closely, I pushed the button to illuminate the phone face, casting a greenish light over his now tautly drawn features, light reflecting redly in his eyes.

“See?” I said. “It’s only 11:01. It’s still the 20th of July.”

He backed away from the phone, his movements fearful, as if the phone had suddenly come alive and hissed at him. He held his watch up to his face, looked at it, shook his arm a couple of times, tapped the face and looked once more.

His arm fell to his side, and his head twisted partially away from me. I could hear sounds that sent chills down my spine. He let out a low  anguished moan, and though I couldn’t see his face well, what I could see showed a man who looked to be in a state of pain. Or, perhaps, a man suddenly gone mad.

“Well, uh, thanks for the book,” I called out, as I drew back into the house and moved to shut the door, feeling suddenly afraid, of what, I had no idea.

Before I could finish shutting the door, the delivery man (moving supernaturally fast) was in the doorway, shoving the door open with his shoulder. He grabbed the book from my hand and though I fought him as best as I could, I was no match for his strength and determination. I let him have the book.

“SosorrydeliveredthisatthewrongtimeandifyoullexcusemetheressomethingImustdo”, he panted out, words sounding like gibberish in the rush. He then took off–across my lawn, bounding over the sidewalk, and sprinting through the lawn of our neighbor across the way. He pounded on my neighbor’s door, pounding with all his might, until my neighbor, a nice older guy who works in insurance I think, came out, wearing a gray robe and looking more than a little peeved.

“I must have the book!” the now seemingly insane delivery man screamed–voice high pitched, frantic, inhuman sounding. My neighbor blinked at him and started to bluster, “Now see here…” but was pushed violently aside, as the delivery person dashed into his home. I heard a faint scream from within the apartment, and the delivery man returned a moment later, another Harry Potter book in his hands.

“Sorry!” he shouted and with a feral grin, raced down the walk to the next apartment, this one rented by a young woman who is a hair dresser, and whose mother is a truck driver (she and I having had a comfortable chat in the laundry room one winter day). The young lady, hearing the commotion and not having much sense (as her mother confided to me), had already opened her door.

She had something in her hand. It was a book. It was another copy of the Harry Potter book. Oh, no.

She froze in terror as he approached her, but when he made to grab the book (perhaps being a bigger fan than I), she held on for dear life. Abandoning the other two books he held, he grasped hers with both his hands and they formed an oddly graceful ballet as they struggled for possession, dancing about in a circle, neither willing to let go.

Other neighbors now appeared, attracted to the noise and the movement. We watched the delivery man and the hair dresser struggle up and down the sidewalk, into trees, and through bushes. I could see both were scratched and bleeding, but neither was willing to give an inch. The two college students down the way from us started laying odds with each other as to who would triumph in the end.

“Well, obviously he’s mad, but that should make him stronger.”

“Yeah, but that cloak can’t be helping.”

“Oooo! That must have hurt!”

“Wow, remind me to never grab anything from a hair dresser.”

“Yeah, not without a protective cup.”

It was the shrub near the sidewalk that was her undoing. The young woman backed into it and tripped. Trying to recover her balance, she flailed about with her arms, letting go of the book. The delivery man–wig now half knocked off, cape torn with UPS uniform showing through, blood dripping down his hairy calves–held the book aloft in triumph.

At that moment, out of the darkness came a light, a blinding white light.

“This is the police,” came a disembodied voice. “Put down the book, put your hands behind you head, and lay down on the ground.”

The scene held together, like a picture on a wall. Everyone froze. Everyone but the delivery man. Slowly, oh so slowly, he began to lower the arm holding the book. When the arm was straight out from his body, he pointed the book at the light, at the police man who held it, and his partner who stood by him, weapon in hand.

“Let me go,” the delivery man said.

“I can’t do that,” said the cop holding the gun.

“Let me go, or I’ll shout out the ending.”

We all sucked in our breaths and released them in unison, in a sort of collective gasp at the implied threat. A woman two doors down from the hair dresser cried out softly, “No! No! Think of the children”, as she moved to put her hands over her daughter’s ears. The child was crying softly, words bubbling out through the tears. “Want book. Want book.”

Emboldened, the delivery man started to move towards the two policemen, slowly stopping to pick up the other two books he had dropped.

“I’ve read the book, you know,” he said, voice calm. “We weren’t supposed to, but I couldn’t stand it, I couldn’t stand the not knowing.” He had reclaimed one book.

“Last night I broke into the warehouse where the books were being kept. There was no one around. No one around at all.”

He had now picked up the second dropped book.

“I grabbed one of the book boxes and oh so carefully slit it open.”

He was now standing between the insurance man and myself, both of us transfixed at the drama unfolding in front of us (such not happening all that often when one lives in the suburbs of St. Louis).

“I then read it, read it right there, sitting on the cement in a storage room filled with treasures.”

He continued to creep forward, coming closer and closer to the police.

“I’ll do it, ” he said, determination shining through the madness. “I’ll tell everyone here whether Harry Potter lives or dies if you don’t let me go.”

The hair dresser raised her arm in supplication. “No”, she whimpered. “Please, please don’t.”

“I’m afraid I can’t let you do it,” the policeman said again, tightening his grip on his gun.

The delivery man stopped. He must have found some inner strength because he seemed to stand taller. He calmly put two of the books under one arm and used his newly freed hand to straighten the wig on his head. “So,” he said, bringing the arm holding the one book closer to his chest. “You refuse to consider my offer, is that it?”

Without warning, he flung his arm back out again, pointing the book directly at the policeman holding the gun.

“Be it on your head then,” he screamed. “In the book Harry Potter…GACK! gAck!”

The ends of the taser stuck out of the delivery man’s chest like horns from some mythical beast. He jerked about and fell to the ground, the books in his arms flying wide into the bushes around him. He twitched and twitched and twitched on the ground before finally lying still.

It was all over. Disaster prevented. Spoiler narrowly averted.

More police came, helping the now docile delivery man to his feet where he was unceremoniously stripped of wig and cloak, but thankfully allowed to keep his brown UPS shirt and shorts. Then he was gone, the police were gone, and most of the neighbors had returned to their homes.

All but three of us: the insurance man across the way, the hair dresser, and myself.

Without speaking we left our homes and each moved to a different section of the bushes. The insurance man bent down, hand reaching out, but I called out to wait.

I looked at my cellphone.

It was 11:59.

It was 12:00.

It was 12:00 and thirty seconds.

It was 12:01.

“OK,” I called softly, and we all reached down and picked up our Harry Potter books and without saying another word, returned to our homes.

JavaScript Technology

Clever document.write killer

Recovered from the Wayback Machine.

Some ad networks and other script-based entities, such as Google Ads, I believe, use JavaScript document.write to write out the content to our web pages. The organizations use this because they want the content embedded in the page at the point where the item is placed, and there isn’t a lot of control where this can happen.

brothercake at SitePoint has come with a way of eliminating the need for document.write, which is to give the script element an identifier and use it as point of reference where to place the content using the proper DOM functions.

It’s something that should have been obvious, but wasn’t until it was pointed out. As one commenter wrote, a real slapping the head moment.

(via Simon Willison)