Making peace with Google

I can’t wait until I get up in the morning and pop on to my machine so I can download 50+ spam emails. One of the funnest games of the day is to try and find “real” email among all of the junk. When I find one, I holler out “email whack!”

As you can tell, I am being facetious. I don’t know of anyone who likes spam, or wants to spend time on it, or wants to waste email bandwidth on it.

So why do we all like the crazy hits we get from Google?

Dave Winer pointed out a posting from Jon Udell discussing a posting from Dave Sims at O’Reilly. In it, Dave Sims wrote:

Google’s being weakened by its reliance on webloggers and their crosslinks
…
If Google wants to evolve into a functional resource for all users, it will have to work itself off this current path, or it will open up an opportunity for The Next Great Search Engine.

Jon responds with:

In the long run, the problem is not with Google, but with a world that hasn’t yet caught up with the web. I’m certain that in 10 years, US Senators and Inspectors General will leave web footprints commensurate with their power and influence. I hope that future web will, however, continue to even the odds and level the playing field.

Sorry, Jon. I’m with Dave Sims on this one. Weblogs are weakening Google.

When I ported the Burningbird to Movable Type and moved to the new location, I also created a robots.txt file that disallowed any web bot other than the blogdex or Daypop bots. And the Googlebot, being a well behaved critter, has honored this (as have several other bots, my referrer log is getting sparkly clean).

In the meantime, I’ve left my old site as is, bot-beaten poor little thing that it is. As a result, in the referrer log I’ve found the following searches:

rufus wainright shrek
devonshire tea graphics
missouri point system drivers license
bill gates popular science
entrenched in hatred
richard ashcroft money to burn
shelley bird
pictures of terrorists burning american flags
south carolina state patrol fishing
pictures of women in afghanistan
we start fire billy joel
fairy tale blue bird
beautiful outlook pictures
fighting fishies
high blood pressure burning
hacking statistics in Australia
lord of the rings pictures and drawings sting sword
add morpheus node

…and on and on

And all of these Google searches happened in three days time. Three days.

Comparing usage estimates, Google was effectively chewing up over 30% of my web site CPU and bandwidth on searches that were on the average accurate 3% of the time.

My regular web sites (Dynamic Earth, YASD, P2P Smoke, and Burningbird Network) have on average seven times the traffic of my weblog, with half the Google traffic and an accuracy of over 98%. This figure means that Google searches resulting in hits to the regular web sites are finding resources matching their searches. People may still continue looking at other sites, but the topic of the search is being met by the topic covered in the page.

Weblogs — might as well call us Google Viruses.

This isn’t to say that Google and weblogs can’t work together, but it isn’t up to Google to make this happen. Google is a web bot and an algorithm; we’re supposed to be the ones with the brains.

Weblogs that focus on one specific topic are ideal candidates for Google scanning. For instance, zem is a weblog focusing on topics of cryptography, security, and copyrights. Because he consistently stays on topic, he’s increasing his accuracy ratio — people are going to find data on the page that meets their search.

Victor, who’s as interested in Google as I am, is trying to work with Google by creating a new weblog that focuses purely on web development resources, Macromedia products, and browser development. It’s early days yet, but as time goes by and more people discover Victor’s weblog, he should increase his Google page rank, resulting in an increase of the number and accuracy of his Google hits.

So what’s a weblogger who just wants to have fun to do? Well, if you don’t mind the crazy searches and the waste of your bandwidth and CPU, don’t do anything. Let all those little bots just crawl all over your weblog’s butt. Google’s bandwidth and accuracy is Google’s problem (time for smarter algorithms, perhaps).

However:

-if you’re saving up to add some nice graphics or MP3 files to your weblog and your bandwidth is restricted, as most servers are or

-if you’re getting tired of crawling through the bizarre Google searches or

-if you’re getting tired of not being able to put “xxx” on your weblog page

then you might want to consider providing a few helpful aids to Google.

Google Helpful Aids

1. Create a robots.txt file and restrict Googlebot’s search to specific areas of your weblog web site — not to include your weblog page or archives.

2. If possible, create individual archive pages for each post. Otherwise, for all posts that deserve to stand alone, copy the generated HTML into a separate file.

3. For your weblog posts that you think will make a great resource, and that stay on topic and don’t meander all over the place, copy or hard link it (if you’re using Unix) to a directory that allows bots to crawl.

4. Avoid the use of ‘xxx’ in any shape and form in any of your Googlized pages

Over time, we’ll add to these aids.

Now, if only I can figure out what to do with all these XML and RDF aggregators that are now crawling all over my server….