May 22nd, 2007

Now that I have your attention…

As part of my site upgrades, I'm exploring using OpenID in order to enabled comment edits for those people who login using the OpenID identity system. When I searched in Google on "Wordpress OpenID", to look for plugins, one of the sites that returns in the front page of the results is one of the million or so sites that Google now identifies as a malware site. Clicking the link to the site could likely load all sorts of crap on to your PC.

Example of malware Google Search

Identifying potentially harmful sites such as these is part of Google's new security effort. The only thing you don't know, though, is whether the site really is dangerous, or is a false positive of Google's security algorithm–you don't want to click through to find out.

Links to sites such as this, which from the URL looks like a harmless weblog post, is probably the number one vulnerability in weblog comments. I want to know if I can tap into Google's database to identify the URLs for these sites so I can determine if a link should be scrubbed or highlighted as potentially risky in my comments.

Of course, then that begs the question: how accurate is Google's algorithm when we can't, yet again, see what factors it uses on which to base its decision? There's a major difference between a site having low page rank, and marking a site as potentially harmful.

Comments
1
Evan - 2:13 pm 5/22/2007

I don't know about malware, but it wouldn't be too hard to hack something together for phishing. http://jon.oberheide.org/blog/2006/11/13/google-safe-browsing/ should get you started.

2
Michael - 2:43 pm 5/22/2007

You can certainly find out if Google thinks that your sites contain malware:

http://googlewebmastercentral.blogspot.com/2006/11/badware-alerts-for-your-sites.html

I don't think it says specifically what triggers the warning, but it should be a good start.

3
Melinda - 3:41 pm 5/22/2007

Allow readers to edit their comments here? Oh that would be fantastic.

4
Shelley - 7:05 pm 5/22/2007

I had it at one time, but wanted to try something more rigorous.

5
Doug Alder - 10:33 pm 5/22/2007

Interesting - McAfee Site Adviser when installed will show you security results for each site that shows up in Google. I wonder if they are using the same database - it would be interesting to compare results.

6
Doug Alder - 10:40 pm 5/22/2007

OK I just went and compared and now it gets even harder to know - they obviously are not using the same database because for the the-notebook.org/12/01/2006/openid-comments-for-wordpress site Google says "This site may harm your computer." whereas McAfee says

??????? ??????? » OpenID Comments for Wordpress…
the-notebook.org

We've tested this site and found no significant problems.
1 green download

7
Doug Alder - 10:41 pm 5/22/2007

OK I just went and compared and now it gets even harder to know - they obviously are not using the same database because for the the-notebook.org/12/01/2006/openid-comments-for-wordpress site Google says "This site may harm your computer." whereas McAfee says

OpenID Comments for Wordpress… the-notebook.org

We've tested this site and found no significant problems.
1 green download
1 e-mail/ month
Linked to green sites

Green being McAfee's way of saying everything is OK

Your guess is probably better than mine :)

8
Mark - 5:11 pm 5/23/2007

Um, did anyone bother to view-source on the home page of the-notebook.org? It contains hidden iframes that load content from traffic-converter.biz and xhpyldpdbk.com. Dunno if that was put there by the blog author or whether the site was hacked somehow, but it's there clear as day.

I tested in Firefox with NoScript and other precautions; I don't have a virtual machine I'm willing to throw away to test in IE, but I'd bet real money that if you weren't patched to the gills (and maybe even if you were) that you'd end up with some nasty shit on your PC.

There is an argument to be made against Google policing the web for malware, but this site isn't helping your argument.

9
Shelley - 5:32 pm 5/23/2007

Sure it's a bad site. I don't have a problem with Google providing this helpful information. But again, we don't know what triggers this type of behavior, and what could inadvertently push a site into the 'harmful' zone. Google could say they don't want to provide this info, because it will aid the enemy. I think this is a different beastie, though, then those folks who 'game' Google for page rank.

Viewing the source on this one specific example really isn't the point on this post, Mark.

10
Mark - 7:26 pm 5/23/2007

Shelley said: "The only thing you don't know, though, is whether the site really is dangerous, or is a false positive of Google's security algorithm."

Doug said: "OK I just went [to SiteAdvisor] and compared and now it gets even harder to know."

My point is that it is not, in fact, "hard to know" why this site is listed as dangerous. And it is not, in fact, a false positive. And it took me all of 30 seconds to determine this, a level of effort which appears to be beyond you and your readers.

(BTW, I can't even begin to comment on the breathtaking double-talk involved in the statement, "I want to know if I can tap into Google's database to identify the URLs for these sites," when your latest post complains that you're canceling your Feedburner account because you don't want to give Google any more information. You can't have it both ways. Or did you expect Google to let you download a static list of URLs? Do you have any idea how fast that list changes? Providing anything other than an up-to-the-nanosecond web service would be even more dangerous than providing nothing at all. So which is it? Do you want to send Google the URLs of every weblog commenter on your site, in real-time? Or do you want to shut the fuck up?)

11
Shelley - 7:43 pm 5/23/2007

Mark, you're deliberately obfuscating the issue.

Yes, this site is 'bad'. If nothing else, the Russian probably gives you that idea. But what does Google use to make this determination? Does is send out Google employees, each with NOSCRIPT and ready to view source? That's a million sites. That's a lot of sites.

Wanting to know what factors it uses to determine these sites is not a bad thing to ask. In other words — open source the algorithm, let us see what is used so we can either help Google improve it, or if it can't be improved and is so good, use the same when protecting the URLs in our sites.

And if a person's site is targeted as 'harmful' they'll have a good idea of what caused such.

For a person who is so behind open source, I would expect you to applaud this.

As for your comment, "Shut the fuck up", you're no longer welcome here. Go take your Google bought soul and pander your condescending bullshit elsewhere.

Do. Not. Ever. Return. Here.

12

Shelley, I don't think there is 'an algorithm' per se. There may even be several independent systems that vote according to some weighted scheme that is dynamically adjusted. Any system like this that is capable of dealing with the quantity and variety of data that Google must is probably at least semi-opaque in it's operation to it's builders, and likely will cease working rather quickly if the builders are not continuously tweaking it.

Consider the contortions Netflix went through to set up a fair competition for improved movie recommendations that was objectively and measurably 'better'. And that is in a situation where there is no-one actively hostile trying to make the recommendations worse.

I really don't think that Google can 'open source' in any meaningful sense the parts that would be of most interest from a public policy perspective. At most they can contribute chunks of infrastructure code that make devising, running, and adjusting such systems possible.

13
Shelley - 10:46 pm 5/23/2007

Michael, Google provides a paper that discusses their research in this regard. But am I responding to this post about the new malware detection, or the post on Feedburner? I've had both addressed here in comments.

I think Google is providing a service with this, but I think it's ultimately less than useful if there's no way to use the same functionality to test URLs in other situations. Perhaps the company doesn't care except as regards its search, and that's fair.

But it provides a level of false sense of security–use Google and you need never fear again. A better approach would be to train people in how to protect themselves regardless of where they 'click that link'. But that's a social solution, not an algorithmic one. That one embeds the power within ourselves, and not an external agency.

I also see where something such as this could be very damaging to a perfectly innocent site. And I wonder if this is step 1, what will be the next step in this new security initiative?

14

Well, if we've disposed of the notion that the 'algorithm' can be open-sourced, then what remains is your request that this service be exposed as an API. This would be useful, I think, even though it does give Google more data, because a weblog system that utilized this need not test every link in a comment, just those that are suspect for some other reason (anonymous or newly registered users, for example).

One thing I'd like to see is more organizations implementing services that can be accessed by clones of various Google APIs. Regardless of how useful Google's services are, it is never smart to have a single point of failure in a system, and it ought to be trivial to switch to a new endpoint for a service.

15
Shelley - 12:08 am 5/24/2007

I don't think we've disposed of any such thing, Michael. I think it's only fair that if Google touts its support for open source, it put its own collective money where it's mouth is.

However, this post wasn't about algorithms. It was about Google's new security initiatives, including highlighting supposed malware sites. We don't need a new web service to test every URL that comes across our path. What we do need is an understanding of how to secure our systems in such a way that clicking on any link won't cause permanent damage.

16
Karl - 6:46 am 5/24/2007

Hey Mark, "Do you want to send Google the URLs of every weblog commenter on your site, in real-time? Or do you want to shut the fuck up?)"

Do you think any employer will appreciate you talking to any customer like that?

17

Um, Shelley, as I said, I don't think Google could open source this particular feature in any meaningful way.

I do think they could release more code than they do, but in all likelihood most of the code behind their value-added inferences will be dependent on Google's unique scale to work.

In terms of educating people to secure their systems… Google sponsors StopBadWare.org, for example. Is that the sort of thing you mean?

18
Shelley - 11:28 am 5/24/2007

Michael, you're repeating your response but I'm saying different things. Feel free to continue, but don't be offended if I decide not to respond back. I believe that Google could share a hell of lot more about how it operates, and chooses not to. You might think it acceptable, I don't. We disagree.

But honestly, I'm not so dense in technology that you have to repeat yourself so that I 'get it'.

Yes, informative sites such as stopbadware.org that provide good solid information about how one can prevent malware are helpful. It's one of several. I'm not sure about their 'enter a malware site' approach, because after all, as has been pointed out, the list of malware sites changes constantly.

Nor do I think Google's malware indicator is particularly helpful — if I really wanted to spread a link to a malware site, I'd embed it in comments such as these.

19

Shelley, I don't disagree that Google could and should share more (both information and code), I just felt this wasn't a good example of where such sharing was lacking.

As for spreading malware links retail through comments, that's mainly a tactic used to raise the rank of those URLs in search listings. Sites with enough traffic to make them attractive as direct infection vectors per-se usually police, filter, or sanitize their comments fairly diligently too.

So, reducing the payoff of the higher search ranking is what Google is basically doing here. I wouldn't be surprised if increasing the cost of getting the higher ranking (probably accomplished with something like the service you suggested) in the first place came next, most likely rolled out into Blogger first.

Thanks to all those who have contributed to the discussion. Comments are now closed, but you can contact the author of the post directly.