Categories
Diversity Technology

So What?

Recovered from the Wayback Machine. Part of the O’Reilly Women in Tech series.

A few weeks back, the book Beautiful Code: Leading Programmers Explain How They Think hit the streets. What a terrific concept: get several prominent programmers to write about their own unique perspective on programming and donate the money to a good cause (Amnesty International). It was, and still is, a good idea and book. What problem could I possibly have with it?

A quick look at the Table of Contents gives you a hint: of the 38 programmers who contributed to the book, only one was a woman. Just one woman. Even by today’s standards with few women staying in the field—and even fewer entering—the ratio of men to women in this book, frankly, sucks.

A discussion arose about the lack of women authors in this book, which included the fact that some of the women who were invited to contribute declined. There were the usual statements, the usual questions asked: “Give us lists of relevant women,” “Who should we have invited?” Yada, yada, yada, business as usual.

However, there were several comments that I found disquieting because they reflected other discussions we’ve had in the last year about the issue of the declining numbers of women in technology. The wording used has differed, but the views basically reduced down to, “So what?”

There is only one woman who contributed to the book. So what?

There are no women presenting at the conference. So what?

No women are listed among these top designers/developers/experts. So what?

Before this last year—regardless of the situation and the participants, regardless of the reasons people give for the growing lack of diversity in the tech field, regardless of solutions offered—one thing all participants in these discussions seemed to agree on was that this lack of women was not a good thing. Lately, though, I’m seeing a disinterest in the whole issue; an increasingly vocal opinion that it just really doesn’t matter.

I never lack for opinion, but this one has me stumped. Here we are in 2007, in an era where the numbers of women in “non-traditional” professions have been increasing, sometimes even past the 50 percent mark. No longer do women have to stay at home or choose only “soft” professions. We now have more choices, and the only limits we seemingly face are those we bring with us. Women serve in the military and die in action, lead major corporations, argue cases in the Supreme Court, and are anything from rocket scientists to neurosurgeons.

Yet in the IT fields, our numbers are dwindling. Significantly. We all have ideas why this is occurring, but nothing concrete that we can point at and say, “There, that’s why!” It’s a true puzzle. What’s more puzzling, though, is how many in the technology field just don’t care. They don’t see that a field that is becoming increasingly only male is a problem.

Is it a problem? Probably not, if only men use the gadgets, only men use the software, only men are impacted by the applications, and so on. Yet, we know that women typically use software as much or more than men. Women use the Internet, as much or more than men. Women buy and use the gadgets. What’s happening is that all the population is using an increasing number of applications that are architected, designed, developed, quality tested, and documented by only half the population. Less than half, because the tech industry lacks diversity when it comes to race, too.

Maybe I’m just being a woman and all, but I look at this and I think to myself: are we really creating the best software? Are we really designing the best gadgets, the most useful web sites, the superior applications? How can we be, if more than half the population has no input in any aspect of the development and design process?

So, so what.

I’ve long felt that the IT field is one of the few where the participants are focused on the tools, rather than the tasks. I believe that integrating IT into the engineering field as a complete and separate discipline was a huge mistake—not the least of which is that engineering is the only other discipline where the numbers of women are dropping (big hint, there).

Our field would be better if it were integrated with the librarian sciences, psychology, business, English, art—associated with tasks and topics, rather than grouped around the tools and processes. This makes even more sense when you realize that many people who enter the field do so with no degrees or with degrees in completely unrelated disciplines. It’s not unusual to hear from both sexes that they drifted into development or design because of a growing interest that was unrelated to their initial course of study. Imagine how much stronger the IT field would be if we bring in all these diverse viewpoints right from the start.

My recommendation? Break up the computer science programs, split the participants into specialized fields within other disciplines, and stop spending all our time on talking about Ruby and how cool it is. See? There’s a solution, and all it requires is basically ripping apart the entire field and rearranging the chunks.

Whatever solutions we arrive at to increase the number of women in technology, none are going to work if there isn’t general consensus that the lack of diversity is a problem. That we all, at a minimum, agree that the computer field, as it is now, is broken. That we need to find solutions. More importantly that we all have to buy into the solutions, because whatever we come up with is going to impact on all of us, including those who say, “So what?”

Categories
Web

Web 9.75

The precision of naming takes away from the uniqueness of seeing. Pierre Bonnard.

Nick Carr comments on Google’s Web 3.0, pointing out the fact that Web 3.0 was supposed to be about the Semantic Web, or, as he puts it, the first step in the Machine’s Grand Plan to take over.

For all the numbers we flash about there really are only so many variations of data, data annotation, data access, data persistence, and whatever version of “web” features the same concepts, rearranged. Perhaps instead of numbers, we should use descriptive terminology when naming each successive generation of the web, starting with the architecture of the webs.

Application Architectures

thin client

This type of application is old. Older than dirt. A thin client is nothing more than an access point to a server, typically managing protocols but not storing data or installing application locally. All the terminal traditionally does is capture keystrokes and pass them along to the server-based application. The old mainframe applications were, and many still are, thin clients.

There was a variation of thin client a while back when the web was really getting hot: the network computer. Oracle did not live up to its name when it invested in this functionality, long ago. The network computer was a machine that was created to access the internet and serve up pages. In a way, it’s very similar to what we have with the iPhone and other hand held devices. There is no way to add third-party functionality to the interface device, and any functionality, at all, comes in through the network.

Is a web application a thin client? Well, yes and no. For something like an iPhone or Apple TV, I would say yes, it is a thin client. For most uses, though, web applications require browses and plug-ins and extensions, all of which do something unique, and require storage and the ability to add third-party applications, as well as processing functionality on the client. I would say that a web application where most of the processing is done on the server, little or none in the browser, is a thin client. However, beyond that, then the web application would be…

client/server

A client/server application typically has one server or group of servers managed as one, and many clients. The client could be a ‘thin’ client, but when we talk about client/server, we usually mean that there is an application, perhaps even a large application, on the client.

In a client/server application, the data is traditionally stored and managed on the server, while much of the business processing as well as user interface is managed on the client. This isn’t a hard and fast separation, as data can be cached on the client, temporarily, in order to increase performance or work offline. Updates, though, typically have to be made, at some point, back to the server.

The newest incarnation of web applications, the Rich Internet Applications (RIA), are, in my opinion, a variation of client/server applications. The only difference between these and application that have been built with something like Visual Basic is that we’re using the same technologies we use to build more traditional web applications. We may or may not move the application out of the browser, but the infrastructure is still the same: client/server.

However, where RIA applications may differ from the more traditional web applications is that RIA apps could be a variation of client/server–a three tier client/server application…

n-tier

In a three, or more properly n-tier client/server application, there is separation between the user interface and the business logic, and the business logic and the data, creating three levels of control rather than two. The reasoning behind this is so that changes in the interface between the business layer and the data don’t necessarily impact on the UI, and vice versa. To match the architecture, the UI can be on one machine, the business logic on a second, and the data on a third, though the latter isn’t a requirement.

Some RIA applications can fit this model, because many do incorporate a concept of a middleware component. As an example, the newer Flex infrastructure can be built as a three-tier with the addition of a Flex server.

Some web applications, whether RIA or not, can also make use of another variation of client/server…

distributed client/server

Traditional client/server: many clients working against one set of business logic mapped to database server running serially. Easiest type of application to create, but one less likely to be able to scale, and from this arises the concept of a distributed client/server or computing architecture.

The ‘distributed’ in this title comes from the fact that the application functionality can be split into multiple objects, each operating on possibly different machines at the same time. It’s the parallel nature of the application that tends to set this type of architecture apart, and which allows it to more easily scale.

J2EE applications fit the distributed computing environment, as does anything running CORBA or the older COM and the newer .NET. It is not a trivial architecture, and needs the support of infrastructure components such as WebLogic or JBoss.

This ‘distributed parallel’ functionality sounds much like today’s widget-bound sidebars, wherein a web page can have many widgets, each performing a small amount of functionality on a specific piece of data at the same time (or as parallel as can be considering that the the page is probably not running in a true multi-threaded space).

Remember, though, that widgets tend to operate as separate and individual applications, each to their own API (Application Programming Interface) and data. Now, if all the widgets were front ends to backend processes running in parallel, and working together to solve a problem, then the distributed architecture shoe fits.

There’s a variation of distributed computing–well, sort of –which is…

Service Oriented Applications

Service Oriented Applications (SOA). Better known as ‘web services’. This is the APIs and the RESTful service requests, and other services that run the web we seem to become more dependent on every day. Web services are created completely independent of the clients, supporting a specific protocol and interface that makes the web services accessible regardless of the characteristics of the client.

The client then invokes these services, sending data, getting data back, and does so without having any idea of how the web services are developed or what language they’re developed with, other than knowing the prototype and the service.

Clean and elegant, and is increasingly running today’s web. The interesting thing about web services is that they can be almost trivially easy to tortuously complex to implement. And no, I didn’t specifically mention the WS-* stack.

Of course, all things being equal, no simpler architecture than…

A stand alone application

A stand alone application is one where no external service is necessary for accessing data or processes. Think of something like Photoshop, and you get a stand alone application.

The application may have internet capabilities but typically these are incidental. In addition, the data may not always be on the same machine, but it doesn’t matter. For instance, I run Photoshop on one Mac, but many of my images are on another Mac that I’ve connected through networking. However, though I may be accessing the data on the ‘net, the application treats the data as if it is local.

The key characteristic of a stand alone application is that you can’t split the application up across machines — it’s an all or nothing. It’s also the only architecture that can’t ‘speak’ web, so we can’t look for the Web 3.0 among the stand alones.

Alone again, naturally…

No joy in being alone; what we need is a little help from our friends.

P2P

P2P, or peer-to-peer applications are built in such a way that once multiple peers have discovered each other through some intermediary they communicate directly–sharing either process, data, or both. A client can become a server and a server can become a client.

Joost is an example of a P2P application, as is BitTorrent. There is no centralized place for data, and the same piece of data is typically duplicated across a network. Using a P2P application, I may get data from one site, which is then stored locally on my machine. Another person logging on to the P2P network can then get that same piece of data from me.

The power to this environment is it can really scale. No one machine is burdened with all data requests, and a resource can be downloaded from many resources rather than just one. It is not a trivial application, though, and requires careful management to ensure that any one participant’s machine isn’t made overly vulnerable to hacking, that downloads are complete, that data doesn’t get released into the wild, and so on. Communication and network management is a critical aspect to a P2P application.

These are the architectures, at least, the ones I can think of off the top of my head. Which, then, becomes the ‘next’ Web, the Web 3.0 we seem to be reaching for?

Web 3.0

Ew Ew Ew! The next generation of the web must be Google’s cloud thing, right. So that makes Web 3.0 a P2P application, and we call it “Google’s P2P Web” or “MyData2MyData”?

Ah, no.

The concept of ‘cloud’ is from P2P (*correct?). It is a lyrical description of how data is seen to a P2P application…coming from a cloud. When we make requests for a specific file, we don’t know the exact location of where the file is pulled; chances are, it’s coming from multiple machines. We don’t see all of this, though, hence the term ‘cloud’. Personally, I prefer void, but that’s just semantics.

The term cloud has been adopted for other uses. Clouds are used with ‘tags’ to describe keyword searches, the size of the word denoting the number of requests. I read once where a writer called the entire internet a cloud, which seems too generic to be useful. Dare Obasanjo wrote recently on the discussions surrounding OS clouds, which, frankly, don’t make any sense at all and, me thinks, using cloud in the poetic sense: artful rather than factual.

The use of ‘cloud’ also occurs with SOA, which probably explains Google’s use of the term. And Microsoft’s. And Apple, if they wanted, but they didn’t–being Apple (Stickers on our machines? We don’t need no stinking stickers!) Is the next web then called, “BigCo SOA P2P Web”?

Let’s return to Google CEO Schmidt’s use of the cloud, as copied from Carr’s post, mentioned earlier:

My prediction would be that Web 3.0 would ultimately be seen as applications that are pieced together [and that share] a number of characteristics: the applications are relatively small; the data is in the cloud; the applications can run on any device – PC or mobile phone; the applications are very fast and they’re very customizable; and furthermore the applications are distributed essentially virally, literally by social networks, by email. You won’t go to the store and purchase them. … That’s a very different application model than we’ve ever seen in computing … and likely to be very, very large. There’s low barriers to entry. The new generation of tools being announced today by Google and other companies make it relatively easy to do. [It] solves a lot of problems, and it works everywhere.

With today’s announcement of Google shared space we’re assuming that Google thinks of third-party storage as ‘cloud’, similar to Microsoft with its Live SkyDrive or Apple with its .mac. It’s the concept of putting either data or processes out on third party systems so that we don’t have to store on our local machines or lease server space to manage such on our own.

In Google’s view, Web 3.0 is more than ‘just’ the architecture: it’s small, fast applications built on an existing infrastructure (think Mozilla, Silverlight, Flex, etc.) that can run locally or remotely; on phones, hand helds, and/or desk sized or laptop computers; store data locally and remotely; built on web services run on one or many machines, created by one or more than one company. I guess we could call Google’s web, the Small, Fast, Device Independent, Remote Storage, SOA P2P Web, which I will admit would not fit easily on a button, nor look all that great with ‘beta’ stuck to its ass.

Not to mention that it doesn’t incorporate all that neat ‘social viral’ stuff. (I knew I forgot something.)

The social viral stuff

Whatever makes people think that Facebook or MySpace or any of the like is ‘new’? Since the very first release of the internet we’ve had sites that have enabled social gathering of one form or another. The only thing the newer forms of technology provide is a place where one can hang one’s hat without having to have one’s own server or domain. That’s not ‘social’–that’s positional.

Google mentions how we won’t be buying software at the store. I had to check the date on the talk, because we’ve been ‘spreading’ software through social contact for years. Look in the Usenet groups and you’ll see recommendations for software or links to download applications. Outside of an operating system and a couple of major applications, I imagine most of us download our software now.

What Google’s Schmidt is talking about isn’t downloaded software so much as software that has a small installation footprint or doesn’t even need to be installed at all. Like, um, just like the software it provides. (Question: What is Web 3.0? Answer: What we’re selling.)

Anyone who has ported applications is aware of what a pain this is, but the idea of a ‘platformless’ application has been around as long as Java has been around, which is longer than Google. It’s an attractive concept, but the problem is you’re more or less tied into the company, and that tends to wear the shininess off ‘this’ version of the web–not to mention all that ‘not knowing exactly what Google is recording about us, as we use the applications’ thing that keeps coming up in the minds of we paranoid few.

Is the next web then, the Small, Fast, Device Independent, Remote Storage, SOA P2P, Proprietary Web? God, I hope not.

Though Schmidt’s bits and cloudy pieces are a newer arrangement of technology, the underlying technology and the architectures have been around some time: the only thing that really differs is the business model, not the tech. In this case, then, ‘cloud’ is more marketing than making. Though the data could end up on multiple sites, hosted through many companies, the Google cloud lacks both the flexibility and freedom of the P2P cloud, because at the heart of the cloud is…Google. I’ve said it before in the past and will repeat: you can’t really have a cloud with a solid iron core.

Though ‘cloud’ is used about as frequently as lipstick at a prom, I don’t see the next generation of the web being based on either Google’s cloud, or Microsoft. Or Adobe’s or Mozilla’s or Amazon’s or any single organization.

If Google’s Web 3.0, or, more properly, Small, Fast, Device Independent, Remote Storage, SOA P2P, Proprietary, Web with an Iron Butt, is a bust, does this mean, then, that the Semantic Web is the true Web 3.0 after all?

Semantic Web Clouds…and stuff

Trying on for size: a Semantic Client/Server Web. Nope. Nope, nope, nope. Doesn’t work. There is no such thing as a semantic client/server. Or a semantic thin client, or even distributed semantics, or SOA RDF, though this one comes closest, while managing to sound like something that belongs on a Boyscout badge.

Semantics on the web is basically about metadata–data about data. Our semantic efforts are focused on how metadata is recorded and made accessible. Metadata can be recorded or provided as RDF, embedded in a web page as microformat, or even found within the blank spaces of an image.

We all like metadata. Metadata makes for smarter searches, more effective categorization, better applications, findability. If data is one dimension of the web, then metadata is another, equally important.

The semantic web means many things, but “semantic web” is not an application architecture, or profoundly new way of doing business. Saying Web 3.0 is the Semantic Web implies that we’ve never been interested in metadata in the past, or have been waiting some kind of solar congruence to bring together the technology needed.

We’ve been working with metadata since day one. We’ve always been interested in getting more information about the stuff we find online. The only difference now from the good old days of web 1.0 is we have more opportunities, more approaches, more people are interested, and we’re getting better when it comes to collecting and using the metadata. Then again, we’re also getting better with just the plain data, too.

Web 3.0 isn’t Google’s cloud and it isn’t the Semantic Web and it certainly isn’ t the Small, Fast, Device Independent, Remote Storage, Viral, SOA, P2P, Proprietary, Smart Web with an Iron Butt. Heck, even Web 3.0 isn’t Web 3.0. So what is the next great Web, and what the devil are we supposed to call it?

Web 9.75

It is a proprietary thing, this insistence on naming things. “From antiquity, people have recognized the connection between naming and power”, Casey Miller and Kate Swift wrote.

We can talk about Web 1.0, or 2.0, or 3.0, but my favorite is Web 9.75, or Web Nine and Three-Quarters. It reminds me of the train platform in the Harry Potter books, which could only be found by wizards. In other words, only found by the people who need it, while the rest of the world thinks it’s rubbish.

There are as many webs as there are possible combinations of all technologies. Then again, there are many webs as people who access them, because we all have our own view of what we want the web to be. Thinking of the web this way keeps it a marvelously fluid and ever changing platform from which to leap unknowing and unseeing.

When we name the web, however, give it numbers and constrain it about with rigid descriptions and manufactured requirements, then we really are putting the iron into the cloud; clipping our wings, forcing our feet down paths of others’ making. That’s not the way to open doors to innovation; that’s just the way to sell more seats to a conference.

Instead, when someone asks you what the next Web is going to be, answer Web 9.75. Then, when we hear it, we’ll all nudge each other, wink and giggle because we know it’s nonsense, but no more nonsense then Web 1.0, Web 2.0, Web 3.0 or even Google’s Web-That-Must-Not-Be-Named.

*As reminded in comments, network folks initially used ‘cloud’ to refer to that section of the network labeled “…and then a miracle happens…”

Categories
Technology

SnagIt equivalent for Mac

I love SnagIt for the PC. I’m using it for this book, and I’ve included a description of it in the book, as one of the tools covered. It’s a great screen capture tool.

Only problem: no version for the Mac.

Does anyone have any suggestions for a comparable tool for the Mac? Other than Grab? What I’m looking for is a tool that not only does the great screen captures in multiple ways (window, selection, timed, desktop, paged), but also provides the post-capture annotation, such as the nice looking arrows, cursors, and graphics, as well as the tasks such as select and magnify, and so on.

If it has a download trial or is shareware or even free, all the better.

More about SnagIt for Mac

Categories
Diversity Technology

Caltech: Glimmer and Glomming

Recovered from the Wayback Machine.

Susan Kitchens points out that the number of women in the freshmen class at Caltech has increased from 28.5 last year to 37 percent this year. That’s a significant rise, even though it doesn’t match other tech colleges (42 to 47 percent), or colleges in general (with 57 percent women).

Interesting how Caltech increased the enrollment of women:

Caltech officials said the school did not lower its admission standards, but did more actively and shrewdly recruit women this year.

For example, Caltech made its female applicants more aware that they could be physics majors but also study music and literature, said Rick Bischoff, director of undergraduate admissions.

“That’s not to say men are not interested in those issues,” but those seem to resonate more with women, Bischoff said.

In other words, Caltech made a specific decision to increase women’s participation, pursued such actively and was successful. In some circles hereabouts, the feelings seem to be that actively recruiting women as participants is equivalent to ‘lowering’ the overall quality of the participants.

Susan, and the article, both mention the concept of ‘glomming’, where groups of young men at Caltech will follow a young woman around, lie in wait for her, and sit staring at her.

Personally, everyone participating in this should be expelled from school. Such juvenile behavior belongs in Kindergarten, not college. Perhaps if these boys would be encouraged to take literature and music, they might act like well-rounded and healthy men.

The only issue I have with all of this is that I hope that bringing more women into Caltech isn’t seen as a way of making the educational experience better for the men–you know, more dates for the poor geeks. We do not exist to keep you guys from feeling lonely.

We don’t exist for you guys at all.

Categories
Web

Controlling your data

Popular opinion is that once you publish any information online, it’s online forever. Yet the web was never intended to be a permanent snapshot, embedding past, present, and future in unbreakable amber, preserved for all time. We can control what happens to our data once it’s online, though it’s not always easy.

The first step is, of course, not to publish anything online that we really want to keep personal. However, times change and we may decide that we don’t want an old story to appear in search engines, or MySpace hotlinking our images.

I thought I would cover some of the steps and technologies you can use to control your online data and media files. If I’ve missed any, please add in the comments.

Robots.txt

The grandaddy of online data control is robots.txt. With this you can control which search engine web bot can access what directory. You can even remove your site entirely from all search engines. Drastic? Unwelcome? As time goes on, you may find that pulling your site out of the mainstream is one way of keeping what you write both timely and intimate.

I discussed the use of robots.txt years ago, before the marketers discovered weblogging, and most people were reluctant to cut themselves off from the visitors arriving from the major search engines. We used to joke about the odd search phrases that brought unsuspecting souls to our pages.

Now, weblogging is much more widely known, and people arrive at our pages through any form of media and contact. In addition, search engines no longer send unsuspecting souls to our pages as frequently as they once did. They are beginning to understand and manage the ‘blogging phenomena’, helped along by webloggers and our use of ‘nofollow’ (note from author, don’t use, bad web use). Even now, do we delight in the accidental tourists as much as we once did? Or is that part of a bygone innocent era?

A robots.txt file is a text file with entries like the following:

User-agent: * Disallow: /ajax/ Disallow: /alter/

This tells all webbots not to traverse the ajax or alter subdirectory. All well behaved bots follow these, and that includes the main search engines: Yahoo, Google, MSN, Ask, and that other guy, the one I can never remember.

The place to learn more about robots.txt is, naturally enough, the robots.txt web site.

If you don’t host your own site, you can achieve the same effect using a META element in the head section of your web page. If you’re not sure where this section is, use your browser’s View Source capability: anything between opening and closing “head” tags is the head section. Open mine and you can see the use of a META element. Another example is:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This tells web bots to not index the site and not harvest links from the site.

Another resource you might also want to protect is your images. You can tell search engines to bypass your images subdirectory if you don’t want them picked up in image search. This technique doesn’t stop people from copying your images, which you really can’t prevent without using Flash or some other strange web defying move. You can, however, stop people from embedding your images directly in their web pages, a concept known as hot linking.

There are good tutorials on how to prevent hotlinking, so I won’t cover it here. Search on “preventing hotlinking” and you’ll see examples, both in PHP code and in .htaccess.

Let’s say you want to have the search engines index your site, but you decide to pull a post. How can you pull a post and tell the search engines you really mean it?

410 is not an error

There is no such thing as a permanent, fixed web. It’s as fluid as the seas, as changeable as the weather. That’s what makes this all fun.

A few years back, editing or deleting a post was considered ‘bad form’. Of course, we now realize that we all change over time and a post that seemed like a good idea at one time may seem a terrible idea a year or so later. Additionally, we may change the focus of our sites: go from general to specific, or specific back to general. We may not want to maintain old archives.

When we delete a post, most content management tools return a “404” when the URL for the page is accessed. This is unfortunate because a 404 tells a web agent that the page “was not found”. An assumption could be made that it’s temporarily gone; the server is having a problem; a redirect is not working right. Regardless, there is an assumption that 404 assumes a condition of being cured at some point.

Another 4xx HTTP status is 410, which means that whatever you tried to access is gone. Really gone. Not just on vacation. Not just a bad redirect, or a problem with the domain–this resource at this spot is gone, g-o-n-e. Google considers these an error, but don’t let that big bully fool you: this is a perfectly legitimate status and state of a resource. In fact, when you delete a post in your weblog, you should consider adding an entry to your .htaccess file to note that this resource is now 410.

I pulled a complete subdirectory and marked it as gone with the following entry in .htaccess:

Redirect gone /category/justonesong/

I tried this on an older post and sure enough, all of the search engines pulled their reference to the item. It is, to all intents and purposes, gone from the internet. Except…

Except there can be a period where the item is gone but cache still remains. That’s the next part of the puzzle.

Search Engine Cache and the Google Webmaster Toolset

Search on a term and most results have a couple of links in addition to the link to the main web page. One such link is for the cache for the site: a snapshot of the the last time the webbot stopped by.

Caching is a handy thing if you want to ensure people can access your site. However, caching can also perpetuate information that you’ve pulled or modified. Depending on how often the search engine refreshes the snapshot, it could reflect a badly out of date page. It could also reflect data you’ve pulled, and for a specific reason.

Handily enough as I was writing this, I received an email from a person who had written a comment to my weblog in 2003 and who had typed out his URL of the time and an email address. When he searched on his name, his comment in my space showed in the second page. He asked if I could remove his email address from the comment, which was simple enough.

If this item still had been cached, though, his comment would have remained in cache with his email address until that comment was refreshed. As it was, it was gone instantly, as soon as I made the change.

How frequently older pages such as these are accessed by the bots really does depend, but when I tested with some older posts of other weblogs, most of the cached entries were a week old. Not that big a deal, but if you want to really have control over your space, you’re going to want to consider eliminating caching.

To prevent caching, add the NOARCHIVE meta tag to your header:

To have better control of caching with Google, you need to become familiar with the Google Web tools. I feel like I’ve been really picking on Google lately. I’m sure such will adversely impact on share price, and bring down searching as we know it today–bad me. However, I was pleased to see Google’s addition of a cache management tool included within the Google Webmaster tool set. This is a useful tool, and since there are a lot of people who have their own sites and domains, but aren’t ‘techs’, in that they don’t do tech for a living or necessarily follow sites that discuss such tools, I thought I’d walk through the steps in how to control search engine caching of your data.

So….

To take full advantage of the caching tool, you’ll need a Google account, and access to the Webmaster tools. You can create an account from the main Google page, clicking the sign in link in the upper right corner.

Once you have created the account and signed in, from the Google main page you’ll see a link that says, “My Account”. Click on this. In the page that loads, you can edit your personal information, as well as access GMail, Google groups, and for the purposes of this writing, the Webmaster toolset.

In the Webmaster page, you can access domains already added, or add new domains. For instance, I have added burningbird.net, shelleypowers.com, and missourigreen.com.

Once added, you’ll need to verify that you own the domain. There’s a couple of approaches: add a META tag to your main web page or you can create a file given the same name as a key generated for you from Google. The first approach is what you want to use if you don’t provide your own hosting, such as if you’re hosted in Blogger, Typepad, or WordPress.com. Edit the header template and add the tag, as Google instructs. To see the use of a META tag, you can view source for my site and you’ll see several in use.

If you do host your site and would prefer another approach, create a text file with the same name as the key that Google will generate for you when you select this option. That’s all you need with the file: that it be named the name Google provides–it can be completely empty. Once created, FTP or use whatever technique to upload it to the site.

After you make either of these changes, click the verify link in the Webmaster tools to complete the verification. Now you have established with Google that you are, indeed, the owner of the domain. Once you’re verified the site, clicking on each domain URL opens up the toolset. The page that opens has tabs: Diagnostic, Statistics, Links, and Sitemaps. The first three links most likely will have useful information for you right from the start.

Play around with all of the tabs later, for now, access Diagnostic, and then click the link “URL Removal” in the left side of the page. In the page that opens, you’re given a chance to remove links to your files, subdirectories, or your entire site at Google, including removing the associated cache. You can also use the resource to add items back.

You’ve now prevent webbots from accessing a subdirectory, told the webbots a file is gone, and cleaned out your cache. Whatever you wrote and wish you didn’t is now gone. Except…

Removing a post from aggregation cache

Of course, just because a post is removed from the search engines, doesn’t meant that it’s gone from public view. If you supply a syndication feed, aggregators will persist feed content for some period of time (or some number of posts). Bloglines persists the last 100 feeds, and I believe that Google reader persists even more.

If you delete a post, to ensure the item is removed from aggregator cache, what you really need to do is delete the content for the item and then re-publish it. This ‘edit’ then overwrites the existing entry in aggregator cache.

You’ll need to make sure the item has the same URL as the original posting. If you want, you can write something like, “Removed by author” or some such thing — but you don’t have to put out an explanation if you don’t want to. Remember: your space, your call. You could, as easily, replace the contents with a pretty picture, poem, or fun piece of code.

Once the item is ‘erased’ from aggregation, you can then delete it entirely and create a 410 entry for the item. This will ensure the item is gone from aggregators AND from the search engines. Except…

That pesky except again.

This is probably one of the most critical issues of controlling your data and no one is going to be happy with it. If you publish a fullcontent feed, your post may be picked up by public aggregators or third party sites that replicate it in its entirety. Some sites duplicate and archive your entries, and allow both traversal and indexing of their pages. If you delete a post that would no longer be in your syndication feed (it’s too old), there’s no way to effectively ‘delete’ the entry for these sites. From my personal experience, you might as well forget asking them to not duplicate your feeds — with many, the only way to prevent such is to either complain to their hosting company or ISP, or to bring in a lawyer.

The only way to truly have control over your data is not to provide fullcontent feeds. I know, this isn’t a happy choice, but as more and more ‘blog pirates’ enter the scene, it becomes a more viable option.

Instead of fullcontent, provide an excerpt, as much of an excerpt as you wish to persist permanently. Of course, people can manually copy the post in its entirety, but most people don’t. Most people follow the ‘fair use’ aspect of copyright and quote part of what you’ve written.

There you go, some approaches to controlling you data. You may not have control over what’s quoted in other web sites based on fair use, but that’s life in the internet lane; returning us back to item number one in controlling your data–don’t publish it online.