Categories
Social Media Weblogging

WordPress and the hidden articles

Recovered from the Wayback Machine.

An interesting story appeared today about the WordPress site, and several thousand articles that could be found in a http://wordpress.org/articles.

Disclaimer. I’m hesitant to even write about this, knowing the web’s fondness for angry mob justice, but I feel like it’s an important issue that needs to be addressed. My one request: please be calm and rational. WordPress is a great project, and Matt is a good guy. Think before piling on the hatemail and flames.

The Problem. WordPress is a very popular open-source blogging software package, with a great official website maintained by Matt Mullenweg, its founding developer. I discovered last week that since early February, he’s been quietly hosting almost 120,000 articles on their website. These articles are designed specifically to game the Google Adwords program, written by a third-party about high-cost advertising keywords like asbestos, mesothelioma, insurance, debt consolidation, diabetes, and mortgages. (Update: Google is actively removing every article from their results. You can still view about 25,000 results on Yahoo. Or try this search tool, which searches multiple Google datacenters.)

(Several links within the original material.)

From comments left, it would seem that the content with the links to the articles is hidden within the WordPress main page, therefore passing on the high Google rank the site gets to the articles, themselves, while still not providing a visible indication of this on the site page.

<div style="text-indent: -9000px; overflow: hidden;">
<p>Sponsored <a href="/articles/articles.xml">Articles</a> on <a href="/articles/credit.htm">Credit*lt;/a>, <a href="/articles/health-care.htm">Health</a>, <a href="/articles/insurance.htm">Insurance</a>, <a href="/articles/home-business.htm">Home Business</a>, <a href="/articles/home-buying.htm">Home Buying</a> and <a href="/articles/web-hosting.htm">Web Hosting</a></p>

</div>

Since the words used in the pages are high ‘rate’ words within the Google AdSense program, we can assume this could be lucrative to the company that provided the articles. According to Matt’s response in a thread at the WP support forum, WordPress itself received a set fee for hosting the articles.

How much? Well, enough to hire the first employee of WordPress, Inc..

I am not one of those who believes that the only decent open source project is one where the people do the work only as a labor of love. I don’t think there’s anything wrong with people making money from their art. But of course, I would say this, as I try to put together an online store with goods featuring my photos, as well as still trying to find buyers for my books and/or articles–and after I had added, and pulled, Google Ads.

It’s all very good to say, “We should do this because we love to do it”. But it’s hard to be motivated to write and create when one is worried about what the next month holds. Nobel to say, “Well, I would deliver pizza if needs be, to keep my art free of contamination.” Tell me, though: how many of you have delivered pizza? Want to try it at 50?

Still, I can also see that there’s been a dimming of the joy of this medium, as more and more people turn to these pages as a way to make a buck. What did Jonathon Delacour write, in a nice twist on Talleyrand?

Those who did not blog in the years before the revolution cannot know what the sweetness of blogging was.

Very sweet, indeed. Sweet and impossible–a castle made of spun sugar.

But to return to the story, this is about WordPress and what amounts to actions that could be considered scamming Google.

Google is now removing all of the articles from it’s databases, but one could say that the company was hoist on its own petard (following along with English usage that Tallyrand would appreciate) with this action–its own pagerank was used against the company. Perhaps if it wasn’t so easy to be gamed, events like this wouldn’t occur.

Still, this is using weblogs to play the system, and not really different than what the comment spammers do, though at least this isn’t in our space.

I learned about the WordPress article through Stavros who wrote:

I challenge you to think about the creative output of artists and artisans whose work has touched you. Think of your favorite books, your favorite paintings. That piece of handmade furniture or that gloriously handtooled little application. The music you listen to or the writers-on-the-web you read because they get into your heart and fill you with the ineffable, simple joy of being alive and having a mind. I wonder how many of them would have done their work whether or not they eventually got paid for it. My guess is ‘most’.

I’m not saying that people shouldn’t be paid. Hell, if I could get paid for making the things I make because there’s something inside me that impels me to do it, I’d be thrilled. It’d be a dream come true, by crikey. But I do it, regardless. And so do you, probably, if you’re reading this.

For some reason I’m reminded of Michelangelo and the Sistene Chapel. Michelangelo didn’t like to paint, he prefered sculpture. He didn’t even want to do the work, and only did so after pressure from the Pope. And then there was the fee.

There’s art, and then there’s art.

Bottom line is: do you like WordPress? Do you like using WordPress? Can you still get it for free? Is it still GPL? Then perhaps that’s what should be focused on, and however or whatever Matt does with the WordPress page is between him and Google; because what matters is the code, not the purity of actions peripherial to the code, or its release.

I am also reminded of the story of the Roman general returning in triumphant parade through the city after a great victory; and the man who stood behind him in the chariot, holding the victory wreath made of leaves over his head. “Thou art mortal”, he would whisper, over an over again into the general’s ear, as reminder that no matter how great the triumph, how beloved of the people, the general is, after all, only human.

update

WordPress, Inc. first employee on this issue.

Categories
Social Media

Google and bad banning

I dislike banning. I dislike blacklists based on proxy, domain, IP address, and keyword. No matter how sophisticated the applications that support blacklisting and no matter how good intentioned the sites doing the banning, someone innocent always gets hurt.

My favorite banning story so far is from Jonas Luster’s weblog where he talks about showing some law enforcement people WordPress, only to discover that the San Diego Police Department was on the Real-Time Spam Blacklist. My less than favorite banning story was when the dedicated server I was leasing ended up on SPEWS–another blacklist.

A current favorite now is to ban comments or trackbacks that come in through open proxies, since comment spammers use these to post comments. Unfortunately, open proxies can be found at libraries and schools, and have even been used to route around censorship in countries like China.

I wouldn’t be as critical of blacklisting if it weren’t for one thing: once you’re listed, it can become almost impossible to get de-listed. Most of the blacklisting organizations assume you’re guilty until proven innocent, and you almost have to have an act of Congress to be proven innocent. Well, since our sites aren’t hooked up to a feeding tube, the latter is unlikely to happen. Then you go through weeks, months, even years, trying to get your site cleared so you can send email or post comments.

It would seem that Google also fits in the guilty until proved innocent camp. Karl Martino from paradox1x wrote the following last week:

Help me please – PhillyFuture was probably banned from Google

I’ve had the domain back for one year. Googlebot has not come to index the site. After exhausting all other reasons I suspect that Google banned phillyfuture.org from it’s index. Remember – the preceeding year a porn company had it and was using it for redirection.

If anyone out there can help me – please – please do.

(Philly Future is Karl’s excellent community weblog and site for Philadelpha weblogs.)

Come on Google, a whole year to fix a problem? What do we have to do, use comment spam to get it listed?

(Thanks to Rogi for pointing this out, which also reminded me to update my subscription at Bloglines to the correct feed at Karl’s. Oh, and I did the background graphics, and thanks for the compliment, Rogi!)

Categories
Social Media

Search Engine antics

Another couple of tech issues appeared several times in my overworked aggregator: Google’s AutoLink and Yahoo’s API.

As soon as I read about the Yahoo API, I knew I wanted to try it out with the new site. If you look at the bottom of the sidebar, you’ll see several links that use the API to pull back search data and then format it within the existing site look. I plan on changing the topic of each search whenever a new and interesting one comes to mind, but for now, you can see the results for searching on orchids among images; about Social Security in the news; check out what’s happening with tagback in the web; and for all of my political friends, a whole mess’a Jon Stewart videos.

This capability will be built into Wordform as part of the new metadata functionality. It’s not major tech, but it’s fun.

What’s also been fun is reading all the different reactions to Google’s AutoLink. Dave Winer doesn’t like it:

The AutoLink feature is the first step down a treacherous slope, that could spell the end of the Web as a publishing environment with integrity, and an environment where commerce can take place.

Cory Doctorow loves it, though I think his analogy comparing AutoLink to a ‘beloved butler’ is a stretch. My idea of a beloved butler is someone who keeps my house clean, draws my bath after a hike, and massages my feet when I’m tired. AutoLink pales, badly, in comparison. However, Tim Bray thinks it’s evil:

Before, the Web, publishing was about words and pictures. Now it’s about words and pictures and links. I’m OK with reformatting and aggregating and all sorts of other things, but I don’t want downstream software fucking with my words. Or my pictures. Or my links. A lot of us feel this way.

Robert Scoble agrees with Tim and Dave Winer, writing:

I believe that anything that changes the linking behavior of the Web is evil. Anything that changes my content is evil. Particularly anything that messes with the integrity of the link system.

One word for you, Robert: Nofollow. This little doohickey, which you love so much is going to change the linking behavior of the Web faster than toolbar option that only works in IE, and only when the reader clicks a button, and only if you have an ISBN, address, or other obscure piece of data embedded in your page that isn’t currently already linked. Still, as Phil Ringnalda points out in facetious response to another weblogger in a fascinating comment thread, you can’t trust them sneaky readers:

I can’t trust my readers (an unsavory lot, though I love them dearly) to understand the sacred nature of your every word (some of them *gasp* will even copy text and paste it elsewhere!), so I removed your link. Let me know when you are providing your “web”log as either a signed PDF or one large image, so that they may be trusted to behave according to your anti-web rules, and I’ll put it back.

Hey Phil, don’t remove the link: just add “nofollow” to it. (And sorry that I, um, copied and pasted your text here, which is ‘elsewhere’..but it was my evil twin’s fault! I though she was gone for good, but she hitched a ride back with me from Florida, where she was working as a Mary Poppin’s Disney Character; working that is, until she hit some kid over the head with her umbrella when he whined about wanting to see Goofy, instead.)

One of the better ‘anti-AutoLink’ writeups was provided by Paul Boutin at Slate, who wrote:

I don’t think Google is evil for naively launching this feature. I do think they’ll be an accessory to evil if their tool prompts Yahoo!, Microsoft, or my ISP to start handing out similar software that’s a little more aggressive about stuffing in the links. Lots of companies have a different definition of “evil” than the Google guys—leaving money on the table is the ultimate sin.

If for no other reason, Google should yank AutoLink because it’s a poorly designed, oddly un-Googlish feature for a company that made its name on unobtrusiveness and unambiguous results. Most of all, it’s unsavvy. Google’s clever reinvention of Web ads won instant praise from both surfers and advertisers. AutoLink makes me wince. There’s got to be a better way to present map and book links than clumsily editing someone else’s HTML.

A good argument, particularly in comparison with Google’s other efforts: it is an un-Googlish form of technology–except for the fact that AutoLink is about a link, and there’s nothing more Googly than a link. In addition, if we measure every new technology against a possible evil abuse by other parties at some future time, we should have stopped email, cold, and told Tim Berners-Lee he could keep this new Web thing he’s promoting. And let’s burn Dave Winer in effigy for hooking us all on weblogs; my mama always told me to beware the pusher man.

What surprised me about this entire conversation is that people like Winer and Scoble are deathly against AutoLink, yet they push webloggers to publish their entire posts to their syndication feeds; where they can be pulled and massaged and combined with who knows what by any Tom, Dick, or Harry who comes along. I once had my writing appear in a published syndication feed at another weblogger’s site, surrounded by X-rated material, which changes the context of my writing a whole lot more than someone adding a link to a map based on an address.

And we’re talking about a toolbar that only works in Internet Explorer, the browser that’s almost guaranteed to take your carefully designed web page and muck it up so that it’s barely legible; leaving people who use it to view your site to think that you’re the worst ever page designer. True, it doesn’t do anything with your links. Frankly, though, on balance, if we’re that worried about our pages, I think we should keep the AutoLink and throw out the browser.

Now, if Google thinks about implementing a form of Hailstorm, I’ll bunny thump the ground with warnings of dire deeds and nefarious doings; but I give AutoLink a “mildly interesting” at best, and a “who cares” at worst.

Categories
Semantics Social Media Weblogging

Introducing Tagback

Recovered from the Wayback Machine (includes comments).

The purpose of Trackback initially was to ping the readers of another’s post about something they may want to know about. Of course, we immediately started using it as a referrer link (“Hi, I linked to you!”)

So, we’re dropping trackback and we need something in its place. I provided the how-tos to add Blogline citations and Technorati links in the previous post, and these will provide you a listing of who has linked to the article directly. But that’s the limitation: these solutions are dependent on a link. How can we point a person’s readers to another post or article, without linking to the post directly?

Easy: Tagback.

For each post, I create a tagback consisting of the words of of my individual post, stripped of white space and dashes, preceded by ‘bb’ to differentiate my posts from other people’s posts. I also include a link to the Technorati tags page for this tag, which forms my ‘tagback’. You can see the tagback for this post at the end.

Now, you can either use the tag with a photo in flickr, or you can use it in del.icio.us to annotate any bookmark: your post, another person’s post, an article, a reference to a specification, whatever.

Since Technorati scarfs up delicious tags and flickr tags, all of these items will eventually appear in my Tagback page, along with weblog posts where people have linked to the tag directly in the post. And if Technorati excludes googlebots and other bots in the tags pages, thereby denying any pagerank to the tag pages, there is no incentive for spammers to spam this page.

As long as Technorati denies pagerank for the individual tag pages. Hint. Hint.

Now, regardless of what weblogging tool you use, including Blogger, WordPress, Movable Type, Typepad, ExpressionEngine, whatever, you can participate in discussions, and without having to install any code. Just use whatever tags or function calls you use in your weblogging tool to get the title, and create your own version of a tagback. Or you can manually create a tag for each post you’re interested in designating as a ‘to be discussed’ item, and leave it off from those posts you don’t want to create a tagback page for.

So, you guys were right – tags are handy. I could get the hang of this folksonomy stuff.

I did have to update the code to strip out dashes, and just create a one word tag. I don’t like it, but flickr can’t deal with dashes, and it seems like del.icio.us wants to use spaces, and Technorati seems to not care. Since there is no standardized word delimiter with all of these systems, I just stripped out anything that isn’t a alphanumeric character.

Categories
Social Media Specs

I broke Nofollow

I’m still trying to write something on Technorati Tags. What’s slowing me up is there’s been such a great deal of interesting writing on the topic that I keep wanting to add to what I write. And, well, the weather warmed up to the 60’s again today, and who am I to reject an excuse to go for a nice walk. Plus I also watched Japanese Story tonight, so there goes yet, even more, opportunity to write to this weblog.

Thin excuses for sloth and neglect aside, it is interesting that a formerly obscure and rarely used attribute in X(HTML), rel, has been featured in two major technology rollouts this week: Technorati Tags and the new Google “nofollow” approach to dealing with comment spam. Well, as long as they don’t bring back blink.

Speaking of the new spam buster, after much thought, I’ve decided not to add support for rel=”nofollow” to my weblogs. I agree with Phil and believe that, if anything, there’s going to be an increase of comment spam, as spammers look to make up whatever pagerank is lost from this effort. And they’re not going to be testing whether this is implemented — why should they?

But I am particularly disturbed by the conversations at Scoble’s weblog as regard to ‘withholding’ page rank. Here’s a man who for one reason or another has been linked to by many people, and now ranks highly because of it: in Google, Technorati, and other sites. I imagine that among those that link, there was many who disagree with him at one time or another, but they’re going to link anyway. Why? Because they’re not thinking of Google and ‘juice’ and the withholding or granting of page rank when they write their response. They’re focusing on what Scoble said and how they felt about it, and they’re providing the link and the writing to their readers so that they can form their own opinion. Probably the last thing they’re thinking on is the impact of the link of Scoble’s rank.

Phil hit it right on the head when he talked about nofollow’s impact, but not its impact on the spammers — the impact on us:

But, again, it’s not so much the effects I’m interested in as the effects on us. Will comments wither where the owner shows that he finds you no more trustworthy than a Texas Hold’em purveyor, or will they blossom again without the competition from spammers? Will we do the right thing, and try to find something to link to in a post by someone new who leaves a comment we deem not worthy of a real link, or will new bloggers find it that much harder to gain any traction?

That Phil, he always goes right to the heart within the technology–but blinking, lime green? That’s cruel.

No, no. I don’t know about anyone else, but I’ve spent too much time worrying about Google and pageranks and comment spammers. A few additions to my software, and comment spam hasn’t been much of a problem, not anymore. I spend less than a minute a day cleaning out the spam that’s collected in my moderated queue. It’s become routine, like clearing the lint out of the dryer after I finish drying my clothes.

Of course, if I, and others like me, don’t implement “nofollow” we are, in effect, breaking it. The only way for this to be effective as a spam prevention technique is if everyone uses the modification. I suppose that eventually we could fall into “nofollow” and “no-nofollow” camps, with those of us in the latter added to the new white lists, and every link to our weblogs annotated with “nofollow”, as a form of community pressure.

Maybe obscurity isn’t such a bad thing, though; look what all that page rank power does to people. But I do feel bad for those of you who looked to this as a solution to comment spam. What can I say but…