Categories
Social Media Weblogging

WordPress and the hidden articles

Recovered from the Wayback Machine.

An interesting story appeared today about the WordPress site, and several thousand articles that could be found in a http://wordpress.org/articles.

Disclaimer. I’m hesitant to even write about this, knowing the web’s fondness for angry mob justice, but I feel like it’s an important issue that needs to be addressed. My one request: please be calm and rational. WordPress is a great project, and Matt is a good guy. Think before piling on the hatemail and flames.

The Problem. WordPress is a very popular open-source blogging software package, with a great official website maintained by Matt Mullenweg, its founding developer. I discovered last week that since early February, he’s been quietly hosting almost 120,000 articles on their website. These articles are designed specifically to game the Google Adwords program, written by a third-party about high-cost advertising keywords like asbestos, mesothelioma, insurance, debt consolidation, diabetes, and mortgages. (Update: Google is actively removing every article from their results. You can still view about 25,000 results on Yahoo. Or try this search tool, which searches multiple Google datacenters.)

(Several links within the original material.)

From comments left, it would seem that the content with the links to the articles is hidden within the WordPress main page, therefore passing on the high Google rank the site gets to the articles, themselves, while still not providing a visible indication of this on the site page.

<div style="text-indent: -9000px; overflow: hidden;">
<p>Sponsored <a href="/articles/articles.xml">Articles</a> on <a href="/articles/credit.htm">Credit*lt;/a>, <a href="/articles/health-care.htm">Health</a>, <a href="/articles/insurance.htm">Insurance</a>, <a href="/articles/home-business.htm">Home Business</a>, <a href="/articles/home-buying.htm">Home Buying</a> and <a href="/articles/web-hosting.htm">Web Hosting</a></p>

</div>

Since the words used in the pages are high ‘rate’ words within the Google AdSense program, we can assume this could be lucrative to the company that provided the articles. According to Matt’s response in a thread at the WP support forum, WordPress itself received a set fee for hosting the articles.

How much? Well, enough to hire the first employee of WordPress, Inc..

I am not one of those who believes that the only decent open source project is one where the people do the work only as a labor of love. I don’t think there’s anything wrong with people making money from their art. But of course, I would say this, as I try to put together an online store with goods featuring my photos, as well as still trying to find buyers for my books and/or articles–and after I had added, and pulled, Google Ads.

It’s all very good to say, “We should do this because we love to do it”. But it’s hard to be motivated to write and create when one is worried about what the next month holds. Nobel to say, “Well, I would deliver pizza if needs be, to keep my art free of contamination.” Tell me, though: how many of you have delivered pizza? Want to try it at 50?

Still, I can also see that there’s been a dimming of the joy of this medium, as more and more people turn to these pages as a way to make a buck. What did Jonathon Delacour write, in a nice twist on Talleyrand?

Those who did not blog in the years before the revolution cannot know what the sweetness of blogging was.

Very sweet, indeed. Sweet and impossible–a castle made of spun sugar.

But to return to the story, this is about WordPress and what amounts to actions that could be considered scamming Google.

Google is now removing all of the articles from it’s databases, but one could say that the company was hoist on its own petard (following along with English usage that Tallyrand would appreciate) with this action–its own pagerank was used against the company. Perhaps if it wasn’t so easy to be gamed, events like this wouldn’t occur.

Still, this is using weblogs to play the system, and not really different than what the comment spammers do, though at least this isn’t in our space.

I learned about the WordPress article through Stavros who wrote:

I challenge you to think about the creative output of artists and artisans whose work has touched you. Think of your favorite books, your favorite paintings. That piece of handmade furniture or that gloriously handtooled little application. The music you listen to or the writers-on-the-web you read because they get into your heart and fill you with the ineffable, simple joy of being alive and having a mind. I wonder how many of them would have done their work whether or not they eventually got paid for it. My guess is ‘most’.

I’m not saying that people shouldn’t be paid. Hell, if I could get paid for making the things I make because there’s something inside me that impels me to do it, I’d be thrilled. It’d be a dream come true, by crikey. But I do it, regardless. And so do you, probably, if you’re reading this.

For some reason I’m reminded of Michelangelo and the Sistene Chapel. Michelangelo didn’t like to paint, he prefered sculpture. He didn’t even want to do the work, and only did so after pressure from the Pope. And then there was the fee.

There’s art, and then there’s art.

Bottom line is: do you like WordPress? Do you like using WordPress? Can you still get it for free? Is it still GPL? Then perhaps that’s what should be focused on, and however or whatever Matt does with the WordPress page is between him and Google; because what matters is the code, not the purity of actions peripherial to the code, or its release.

I am also reminded of the story of the Roman general returning in triumphant parade through the city after a great victory; and the man who stood behind him in the chariot, holding the victory wreath made of leaves over his head. “Thou art mortal”, he would whisper, over an over again into the general’s ear, as reminder that no matter how great the triumph, how beloved of the people, the general is, after all, only human.

update

WordPress, Inc. first employee on this issue.

Categories
Weblogging Writing

Telling a story

Loren Webster has taken his new addictionfascination with PhotoShop and combined it with philosophical reminisces of cars he’s owned into a set of really lovely posts, beginning with this one about a boy and his Studebaker.

I like every form of writing I find in weblogs, being more interested in the person and/or work rather any specific type, but there’s a special place in my heart for writings such as this: works that add art or photographs or poetry or music, sometimes with asides from history or linguistics or philosophy; all mixed in, subtly, with personal views and a little personal history. It’s the type of writing I love to do most, and enjoy reading whenever the whimsy strikes any of you. Even within the technology writings, I like those that sprinkle humor and humanity among all the angle brackets and arguments of which is better: Part A or Part B.

But as Anil Dash wrote recently, someone somewhere will say this isn’t weblogging. And though I think we can safely say that not everyone loves Anil (”Just joking Anil! Truce!”), he’s hit it dead on when he writes:

One good sign that a community is maturing is that some of the earlier or more influential members start trying to dictate how it should be done. Use more bold letters! Don’t use comments! Insert more pictures! Whatever the rule, it’s generally being used to assert authority over the nascent community, or to defend some arbitrary choices that have been made and are now being questioned.

This came up this weekend in another context, circumstances and participants withheld to protect me, because the lord knows if I don’t watch my butt no one else will; and as usual it grates on me and saddens me because we put a great deal of our creative effort into works that shouldn’t even exist according to these people. Worse, to some of these arbiters of great weblogging, doing so demeans the seriousness of this medium, yada yada yada.

Every year there is a new crop of people going out into the world armed with formal concepts and rules about how this all works; and every year we then have to follow along behind, tagging the clean, careful concepts with the purple and red graffiti of revolt and trashing the rules like the anarchists we are.

I have contributed to a book on weblogging in the past, but if I were asked to write a “Weblogging for Dummies” book now, it would look as follows:

Chapter One:

Page one:

“Begin.”

Now I’ve just saved you all a lot of money, which you can soon spend on limited edition “Burningbird” refrigerator magnets. Collect as many as you can; trade ‘em with your friends.

And stop by Loren’s and share your own car story.

Categories
Technology Weblogging

Update

Just a quick FYI in how this is going:

I need to integrate fulltext in the application. This allows people to view a single page in a multi-page posting.

I’m still trying to get the RDF meta-data component finished, using RAP (RDF API for PHP). Some troubles with data updates.

Still hunting down SQL statements that have been embedded in the process files, and isolating them in the backend.

Few other odds and ends. I had thought about not worrying about multi-blog support, but I think I will add this in, after all. I think all in all it would be easier to add it in from the beginning then to try and incorprate after the tool’s been used.

Lot’s of work. Most of it fun. Really like the metadata thing, and consider the discussion about datablogging timely, too. It’s not going to be that polished, though, because the metadata functionality will be an add on, whereby people provide a vocabulary and the functionality enables it for each post. But I agree with Danny: this is the perfect use for RDF/XML.

Categories
Weblogging Writing

The syndication feed fair warning indicator

Recovered from the Wayback Machine.

This week I’ll be posting writings that violate the concept of ‘proper weblog entry’ all to heck–either by the use of fiction or the length of the writing, or both.

As happens most times I do this, one or more people access the entry expecting to find a traditional weblog entry and, instead, find writing. Good writing, bad writing, doesn’t matter. It’s the form that disturbs them.

If the work is fictional, I almost invariably get someone who writes in comments, “This is b***s**t” or a variation on, “This is stupid.” If the work is longer, some of the commenters sound a bit tired when they leave notes, as if I’ve made them run through a marathon they weren’t expecting.

Now, the longer writings will give a me a chance to test out my new Wordform Fulltext feature, but that’s not the reason for the writing. The writing is the reason for the writing.

However, in fairness to those who are expecting traditional weblog entries, otherwise known as the Slam, Bam, thank you Ma’am posts, I’m working at adding a new meta item to my syndication feeds called “The Fair Warning Indicator”. This indicator will, hopefully, get picked up in the syndication feed aggregators, letting you know whether the post is a traditional weblog entry or not. I have the meta-data part, I just have to figure out which field in the existing feed infrastructures to subjugate to my evil ways.

With the Fair Warning Indicator, when I do publish these works online, if you want forgo a ‘non-weblogging reading experience’, you can. And, hopefully, the brave and intrepid (or bored or unknowing) souls who do venture in, will then feel free to comment purely on the writing, itself–not the fact that I’m not following the Blogging rules of etiquette.

Now, for any syndicators in the audience, suggestions on what would be the best modification to the feeds to incorporate the Indicator? By feed type?

Categories
Technology Weblogging

The Survival Guide to LAMP: MySQL and Saving the Pig

In the last two weeks, two WordPress weblog sites have had their sites suspended or moved to interim servers because of performance issues. In both cases, the ISPs who hosted the sites (different companies) sent snapshots of the MySQL processes that caused the problems with the emails.

I worked with one of the sites offline, but the owner of the second site, Ampersand, posted the copy of the email he received at the WordPress support site. I grabbed a sample of it, as follows:

| 2073 | theenn2_amptoons | localhost | theenn2_MT | Query |
4 | Copying to tmp table | SELECT alas_posts.*,
MAX(comment_date) AS max_comment_date FROM alas_comments,
alas_posts WHERE alas |
| 2078 | theenn2_amptoons | localhost | theenn2_MT | Query |
4 | Copying to tmp table | SELECT alas_posts.*,
MAX(comment_date) AS max_comment_date FROM alas_comments,
alas_posts WHERE alas |

Plain as dirt what the problem is, eh?

Both of the weblogs are WordPress, but the SQL that generated the performance hit differed. With one, it was the latest comment plug-in; with the other, it was the SQL to support a category listing. In addition, this isn’t specific to WordPress–it could occur with the PHP-based version of Movable Type, ExpressionEngine, or any other MySQL based tools that have dynamic access.

Ostensibly, there is something wrong with these two sites. However, they’re only representative of what we’ll most likely see more of in the future. Our weblogging tools are becoming increasingly sophisticated, the data richer and more complex, the functionality modular and extensible by every person with a text editor and a yen–all packaged up in one reusable, standard, one-size-fits-all package. To make it all even more interesting, these applications are being installed on systems where you can get all you can eat for $5.88 a month, which means that a lot of sites need to be hosted on each server in order for the company to break even.

Things are bound to start breaking. But hey, it’s all fun, until you get that email that says your site is a pig, and it’s just been sent to the butcher.

Please! Won’t someone save the pig!

Now that we’ve established that the world is out to get our weblogs, let’s focus back on these problems, and in particular, the information that the ISPs sent.

In both cases, the messages contained a phrase “Copying to tmp table” and then what looks like a standard SQL query. If you look for this phrase in any of the WordPress code, you won’t find it–it’s a MySQL process that only shows up when you run SHOW PROCESSLIST within MySQL, or access the same from PHPMyAdmin.

Now, depending on your ISP’s tech person and how proficient they are with database optimization, you may be told that “Copying to tmp” is a ‘bad’ thing and shouldn’t happen, and therefore something is wrong and the code is crappy. Well, this isn’t true.

MySQL optimizes queries before they’re executed, to get the maximum amount of data in finished format with the minimum amount of processing and time. Part of the optimization can be to build a temporary table to hold an intermediate set of data before finishing the query. In addition, if the order the data (the sorting sequence) is on one column, but it’s grouped on another, or on a column in a different table, MySQL uses a temporary table.

There is a function, EXPLAIN, that provides information about how MySQL will execute a query. Developers use this in order to fine-tune the SQL so that the use of ‘expensive’ operations are avoided. If you have access to PHPMyAdmin, and run a query, the option to run EXPLAIN is provided with the results. Still, you can only tweak a query so much, and sometimes even the optimal SQL results in MySQL creating a temporary table.

When the use of a temporary table is always bad is when MySQL doesn’t have enough memory to hold all of the contents of the temporary table; the tool then needs to copy the contents to disk. Anytime MySQL has to go to the disk rather than memory, performance takes a hit. This shows up in the processes as:

“Copying to tmp table on disk”

However, what showed up with both of the weblogs that had problems is:

“Copying to tmp table”

An open question at MySQL asks whether this is the tmp table or memory, but from the impact to the servers, we can probably assume it’s to disk.

If it is, then the problem could be that tmp space allocated isn’t enough, and MySQL is having to write to the disk frequently. Or it could be, depending on your type of MySQL, that the maximum space allocated for memory is less than the maximum size allocated for the tmp space. Or other variations of settings at the database level.

Or it could be a badly written plugin, or too many plugins, or a cheap host that doesn’t allocate enough space for the sites hosted, or less than optimum SQL query, or even a trackback attack

Get used to sleeping with the pig

Ampersand’s site, Alas, a Blog gets between 1000 and 2000 visitors a day. It’s a popular site and gets lots of comments, and spam, of course, so several plug-ins were installed to help contain it. Evidentally, it was one of these plug-ins that started to have problems, because the query is not part of the WordPress code. In particular, if you look for max_comment_date in WordPress, you don’t find it.

However, with the second weblog, the query is within list_cats, a built-in WordPress function, and looks like the following:

SELECT cat_ID, cat_name, category_nicename, category_description, COUNT( wp_post2cat.post_id ) AS cat_count
FROM wp_categories
INNER JOIN wp_post2cat ON ( cat_ID = category_id )
INNER JOIN wp_posts ON ( ID = post_id )
WHERE post_status = ‘publish’
AND post_date_gmt < ‘2005-03-18 20:48:04’
AND cat_ID <>1
GROUP BY category_id
LIMIT 0 , 30

While Ampersand’s weblog problem was being discussed in the WordPress forum, a third person also had the same problem — most likely in a completely different area.

Bacon

With the growing number of ‘cheap’ hosting sites, and an increasing use of sophisticated PHP applications and MySQL queries, not to mention the extensibility of the tools, we’ll see more of this problem–especially as we demand more and more from our tools. Think about it: how many plugins are you using with your WordPress installation?

So what can you do? As a starter, you might want to look at that ‘good’ deal you get from your host. Not all inexpensive hosts cut costs and have less than optimal installations, but you’re less likely to have a host that will patiently work through problems with you if you’re only there for the 5.88 special.

You could also trim the fat by dropping plugins that you really don’t need, and make sure that whatever tools or applications or plugins you use are fully baked, i.e. have been through a healthy bug fix period.

If a problem does occur, make sure to file a report with the developers of whatever tool you’re using, providing all the information the host provides. The SQL used in the tool may not be optimum, and being informed of problems provides necessary feedback.

Hopefully this hasn’t been more info than you want or need (”too much sharing!”). At the least, if this situation comes up, you’re not going to be as intimidated when your host sends you an email that tells you …your site is a pig, fix it, or we’re kicking you off.