June 12th, 2007

Google responded to the European Union's privacy concerns with a response that includes a decision to begin to 'anonymize' search results at 18 months. Rather than allay concerns, the company's response only raises more questions.

Let's leave aside, for a moment, what other companies do and do not do. Whatever ends up happening with Google in regards to data storage will probably greatly impact on other companies. Google, in this case, becomes the bellwether when it comes to storing 'incidental data' such as searches, ad clicks, map uses, emails, and so on.

Google's reasoning for storing data, and the fact that it is vague responding to questions about use of cookies, how long personal information is kept once original data is deleted, and whether and how profiling is happening–not to mention what kind of data is stored relating to Reader, Analytics, and so on, don't really address the concerns about privacy.

For instance, Google uses its spell checker as one example of the 'further' research necessary for storing data for such longer periods of time. This makes no sense, though, because if a person uses one spelling at one time, chances are they'll use alternative spellings within a short time afterwards because the original spelling won't return what they need. Do people really do multiple searches using different spellings 18 months apart?

In regards to the efficiency of the search, yes, clicking on the first item would show the search worked. But does a person make a search, keep the results up on their screen for 18 months, and then click through?

I was also surprised that preparations for a DoS (Denial of Service) occurs through search engine results for months before an attack. Leads to another question: does this happen a lot? And what do the search result patterns show? A lot of people looking up "How do I create a Denial of Service Attack" or "DoS for Dummies"? I'm neither a search engine wiz, or a security expert. I guess this is all beyond my abilities and understandings.

In regards to detecting click fraud, again, I would assume something like this would be apparent at some point, and storing information related to such an event makes sense — but what about the world of data completely outside such patterns?

As for having to maintain this data for 18 months because of 'government' regulations: which ones? Google keeps mentioning these, but the ones the company references in Europe have to do ISPs, not search engines. The US laws mentioned in the response are focused on financial transactions, and the data storage needs here have to do with storing data until invoices are paid–exactly how long does it take Google to pay people?

That the Justice Department and others in the US talk about storing data for years is just that: talk. Until and unless laws are enacted, we have to remember that the current Justice Department in this country is still acting under the shotgun reactions of a paranoid idiot who isn't smart enough to be hired to clean Google's floors. Times are changing: we won't always live in this constant state of generated fear.

What was fascinating was Google's claim that it can only support one global privacy standard. Does that mean that if Tuvalu passes a law that search engine companies must retain raw search results and other personal data for two years (or five or ten), Google is then going to use this to establish its privacy requirements for, say, the US, Japan, Europe, and all points from there?

"Well, there we go. It's out of our hands now. Time to build another dam for another data center. Say, the Mississippi looks like it would really drive some turbine–what do you people in Missouri feel about being known as 'Lake Google'?

Search results, cookies, unknown data collection patterns, amount of profiling, types of profiling, persisting data even when accounts are deleted–instead of accusing people of 'wearing tin hats' for asking legitimate questions about data retention, it's time for Google to put away its PR and its team of lawyers and have a honest discussion with the people who helped it become the multi-billion dollar success it is. This is not asking the company to give away corporate secrets and unveil it's deepest, darkest algorithms. This is asking for specifics, when all we have been given in the past is vague generalities. Better yet, this is asking to let us have some say in all of this.

These questions and concerns raised this week are not going away. They are going to persist. Probably about as long as the data Google stores.

Comments
1
2
Bud Gibson - 8:03 pm 6/12/2007

As for having to maintain this data for 18 months because of 'government' regulations: which ones? Google keeps mentioning these, but the ones the company references in Europe have to do ISPs, not search engines. The laws mentioned in the response having to do with US are focused on financial transactions, and the data storage needs here have to do with storing data until invoices are paid–exactly how long does it take Google to pay people?

I think you hit the nail on the head. Google is not just a search company. To my reckoning, it does all of these things. The issue is likely broader than just Google. It touches on our general control of the information cloud around us.

3
Shelley - 8:40 pm 6/12/2007

nurbick, now _that_ is an interesting comment thread. Danny Sullivan should realize when he's really lost an argument.

Bud, the biggest issue isn't so much that Google does many things. It's that it is incredibly opaque about what it does about data for each of its many operations.

4

I think you meant "bellwether" (which, by the way, is also the title of a great novel by Connie Willis) not "bell ringer".

For instance, Google uses its spell checker as one example of the 'further' research necessary for storing data for such longer periods of time. This makes no sense, though, because if a person uses one spelling at one time, chances are they'll use alternative spellings within a short time afterwards because the original spelling won't return what they need. Do people really do multiple searches using different spellings 18 months apart?

Part of what concerns me in this whole debate (on both sides) is the constant conflation between retained data and non-anonymized data.

For the spell checking, for example, it is perfectly feasible to conclude that for less common words and their misspellings, more than 18 months of data may be required to come to statistically significant conclusions, but that isn't to say that the necessary data can't be anonymous.

However, storing older spelling attempt clusters stripped of any identifying information presupposes that you know what it is that you're looking for and can tease that out and store it prior to the anonymization.

If you don't yet know what it is that you are going to want to look for, you can't anonymize the data without potentially losing the correlations you will need.

So, what Google is saying here is that for future currently unanticipated R and D projects, less than 18 months of identifiable data may not be enough to tease out those correlations.

Personally, I'm actually inclined to believe them, but weighing Google's opportunity cost (ie. new 20-percent projects needing longer lead times to collect and store enough anonymized correlations) against the potential societal cost of deliberate or accidental breaches of privacy, I'm inclined to say that Google is now more than large enough to just suck it up and deal. If it takes six more months to collect the data necessary to make some new feature work, then so be it. We actually know that Google is going to be around that long. It is no longer a small vulnerable startup scrabbling for a toehold in the market.

Dominant vendors must be held to higher standards than the rest of the market, and this does not only apply to antitrust considerations.

5
Shelley - 8:48 pm 6/13/2007

Michael, thanks for the correction.

Good point on how can you determine if 18 months is enough if you don't know what you're looking for.

Thanks to all those who have contributed to the discussion. Comments are now closed, but you can contact the author of the post directly.