Links not wanted

Feedster released its own version of a link ranking system, Feedster 500. It matches previous lists, but also has a number of surprises.

Unlike other lists, or even link aggregators, Feedster has been very forthcoming about how it derives its list and, more importantly, how it finds the incoming links it uses as the key component of its list: it finds them in syndication feeds. This will explain why there are some unexpected results in this list. First, blogrolls are left out of the calculation, as they are not part of syndication feeds, or at least, not traditionally part of syndication feeds. Second, and this is the kicker, if you publish a syndication feed that doesn’t provide full content, then your links are not being picked up by the service and used in its calculations.

My links weren’t picked up. In fact, when working with my Linkers tool, and the more sophisticated Talkdigger, I have found that none of my links to other sites are being picked up by any of the services. And when I went looking for how the services work, none of the tools, other than Feedster, publishes its process to find links and/or other searchable material.

This is frustrating because if I don’t care about lists and ranks, I do care about letting people know that I’ve written something about their posts. Since I don’t support trackback anymore, the only way another weblogger will know I’ve made comments on their work is if they read my weblog regularly, someone else tells them about my post, I put a link into their comments, or they see my URL show up in their referrer logs. And with abuse of referrers, these are less than useful nowadays, or even unavailable for some webloggers.

Besides, I don’t want just the weblogger to know I’ve written about their posts–I want others to know, too.

Now I know how Feedster works and that if I want links to show up in that service I have to provide full content. I don’t want to do this, I’ve never wanted to do this but either I decide to blow off inter-weblog communication, or I provide full feeds. The question then becomes: what about the other services?

Supposedly Technorati uses the syndication feed if this provides full content; otherwise it grabs the the main page and scrapes the data. By accessing only the front page, if I use the -more- link to split a larger post into a beginning excerpt with a link to the individual page, the links in this split apart page are then not included. If I then want to have my links picked up from a post, I either have to make sure they show in the very first part of the post, or not use the -more- capability.

Even when I don’t use -more- capability, my links are not showing up in Technorati. Nor in IceRocket, nor in Bloglines, nor in any of the other services as far as I can see. Now, I’m beginning to suspect that most services now use only the syndication feeds, which means I’ll have to use full content for them, also. As a test, I’ve set my site to provide full feed for now, and I’m linking to several sites in and at the end of this post to see which service, if any, picks up the links.

Other factors that could influence the feed being picked up include me repeating my permanent link to a post in the title and at the bottom of a post; publishing links to weblogger’s URLs in my comments (which could trigger spam filters); not pinging weblogs.com or blo.gs; perhaps even the fact that I only support one feed type (RDF/RSS). Without knowing how each of the services process links, your guess is as good as mine.

If I’m frustrated with the services, I also know how difficult it is to collect ‘good’ data from a site, as separated from ‘bad’; how to determine which links are coming from the outside (a commenter’s URL) versus ones from the site author; and a static link (blogroll) from a dynamic one (one included in a page). I can respect the challenge involved even as I am critical of the results.

What would I do if I were creating a service like this?

First, I wouldn’t scrape weblogs off of the global services, such as weblogs.com. These are mined by spammers so badly now as to make them useless. What I would do is provide a ping service that a person could trigger manually, or through their tool if it provides this facility.

I would access the syndication feed, and if full content is provided, I would process this for data and URLS. Otherwise, I would access these URLs directly to pick up links. By doing this, I’ll also be accessing URLs in comments and anything in the sidebars, which is why most services don’t want to access the individual entries — but I’d rather be more liberal than not when it comes to gathering data.

I would also like to send a bot once a day to access the main page, just to make sure updates haven’t happened that haven’t been reflected in the feed, and to access the blogroll and other more static data.

At this point in time, we have a lot of data. Pulling blogrolls and other static links out of content isn’t that hard if you have the storage to maintain history and can compare if a link provided today was also provided yesterday. About the only time I would refresh this in the database is if the link changed in some way– it was there one day, not the next. Or the content in which it occurred changed (and this could require a way of annotating context of a link, which could be pricey in storage and computation).

One interesting way of looking at this is to remove duplicate links when it comes to aggregation for lists, but to refresh the item in the most recently updated queue if it shows in fresh content at the site being scanned. With this you don’t need to have much context, and if a person is interested in finding out who is talking about a specific post, these top-level links won’t show.

As for links for comments — here is where the vulnerability to spam enters, but using an algorithm to find and discard multiple repeated URLs could help to eliminate these. Looking for domains that have been determined to be spamming is also another approach. Sometimes, though, we have to accept that some crap gets through. I’d rather let a little crap through than to discard ‘good’ stuff–just because I feel I’m in some kind of war with the spammers.

It could help to annotate links for blogrolls and links for comment URLs and so on. Not that abysmal ‘nofollow’, but with something meaningful, like ‘commenter URL’ or ‘blogroll link’ or something of that nature. We do something like this with tags, and though I don’t care much for tags in weblog post, I don’t agree with Bloglines’ Mark Fletcher that tags generally suck–especially when it comes to effective uses of microformatting to annotate links.

(Speaking of which, what kind of a post is: I was going to blog something about how tags are bad, evil horrible bad, and highlight the failure of existing search technology, but I couldn’t muster the energy. High level message: tags suck and are unnecessary except in cases where no other textual data exists (like photos, audio or video). Discuss amongst yourselves.. How’s this: Bloglines is indulging in evil censorship of my communication because it doesn’t pick up the links from my posts. Discuss among yourselves.)

Unfortunately, microformats generally require some technical expertise on the part of the person using them, and to base any kind of measurement on this is irresponsible.

Once I have data that is reasonably clean and fresh, if I were to create a list, I would do one based on popularity versus influence, and I would differentiate these by the number of blogroll links for a site, as compared to the number of dynamic links. A person that has a large number of dynamic links compared to static blogroll-like links to me would be a more influential person (hi Karl) than one who has a fairly even ratio between the two. I wouldn’t mind seeing this ratio in a list rather than the counts — we could then find who is influential within groups, even if the groups are smaller. Regardless, I would also provide the raw data to others, and let them derive their own lists if they want.

Why give away precious data? Because by keeping the source of the data and algorithms open, I establish credibility. In addition, flaws will be found and smart people will provide suggestions for improvement. Most importantly, I give those who would be critical of any of my processes nothing to hook on to — the algorithms are public, and mutable; the data is available to all. I have, in effect, teflon coated myself with Open Source. I agree with Mary Hodder a hundred percent on the advantages of openness when it comes to data gathering techniques and processing, and providing access to raw data–but not just for ranking.

As for business model, well knowing the algorithms and having access to the data is one thing; being able to use these effectively, consistently, and in a manner that scales is the bread and butter of this type of technology. Google never would have been Google if it was slow.

Additional links:

Joseph Duemer is teaching a class in weblogging today. Welcome to weblogging, Joe’s colleagues. Just as an FYI, I’m on the Feedster 500 list, which makes me a weblogging princess. If I were in the top 100, I would be queen. If I were in the top 10, well, I would be a lot wealthier than I am now.

Someone who is in the top 100 is the Knitty Blog. Now, this site ably demonstrates the nature of influence over popularity — it’s not that it’s linked statically by a lot of sites; but it is referenced in a large number of posts. That, to me, is influence.

Dare Obasanjo just uploaded 50 photos from his recent trip home to Nigeria. What I want to know, Dare, is why you took so many photos of billboards?

Fulton Chain carries the best b-link bar there is: with links to stories that cover a range of topics, such as a praying mantis eating a hummingbird, and how to build your own homemade flamethrower. Then there’s the Ode to Rednecks. Come on down and visit me in the Ozarks. Hear?

And that’s about enough about linking.