Categories
Semantics Web

The business of algorithms

Recovered from the Wayback Machine.

Algorithms are big business. Recently I’ve seen several jobs where the company wants someone who is “…good with algorithms”. Microsoft is competing with Google is competing with Yahoo to hire the best algorithm wranglers (which evidently, according to the article, does not mean women). IBM is releasing it’s unstructured data architecture (UIMA), including it’s concept-based search algorithms into open source by year end. Even within weblogging the debate, and the race, is on to find the best algorithms to mine us, otherwise known as the higher income people without lives.

Suddenly, the hip and cool kids on the block can “do” algorithms.

With all this interest, though, is a lot of confusion and misunderstandings, starting with but not limited to, the very concept of algorithm– a concept which is now taking on such mystical properties that those who can “do” algorithms are being vested with an almost god-like prescience. It is time, and past time, to put the brakes on the hyperbole surrounding algorithms.

Starting with the basics: what is an algorithm.

What is an algorithm

An algorithm is nothing more than the description of the steps necessary in order to reach a goal. The goal may be something as simple as baking a cake, or as complex as mapping the gene sequence of humans; however, the concept of algorithm doesn’t change with each goal–only the steps.

You have three apples, and someone asks you for one, how do you know how many apples you have left? Seriously, this is not a joke–simple addition, multiplication, division, and subtraction are algorithms, and each is represented by specific equations that introduce specialized operators. To better see this, sometimes you need to remember what it was like to learn math, and then program this knowledge into a computer.

For instance, add two number: 14 and 17. No, not by memorization — by working out the steps. First you line up the rightmost digits, or the ones column, and perform addition on the numbers in this column: 4 and 7. The act of addition is taking one number, breaking it down into units, and then adding these units to another number: 4 + 1 is 5; 5 + 1 is 6; 6 + 1 is 7, and so on. You know this; you remember your first exposure to a calculating device–your fingers. So add 4 and 7, just like you did when you were younger.

Start with 4, turn down your left thumb, that’s 5. Turn down your left index finger, that’s 6. Turn down your left middle finger, that’s 7. Turn down your left ring finger, that’s 8. Turn down your pinky, and that’s 9. Then switch to the other hand, and continue. Turn down your right thumb, that’s 10. Finally, turn down your right index finger and that makes 11. Stop at this point, because you’ve turned down 7 digits. Of cource, we could have started with 7 and added 4, but chances are when you were younger you started with the number on the left and added the number on the right (though this may change based on culture and language).

So now you have 11. That was an amazing accomplishment. Do it enough times, and you remember the result and you don’t have to turn down fingers when you’re asked to change some figures during, say, a board meeting.

But now you have a problem: you have a value in the ones column, but you also have a value in the tens column. So what you do is ‘carry’ that number over to the tens column, and add it to the other numbers that were already there–leading to addition on three numbers: 1 and 1 and 1. Luckily, the numbers are small or we’d probably have to start removing our shoes.

Combine all these steps into sequence of actions, and you have a very complicated, multi-step algorithm. Extend these same basic steps, and you can also do subtraction, multiplication, and division. In fact, once you managed the algorithm for addition, you had the basic skills necessary in order to work with any algorithm. It is only a few short steps from addition to something like Newton’s Method. The only barrier to taking these few short steps is interest and intimidation: interest, because not everyone really wants to learn how to do Newton’s Method; and intimidation because after a while, it’s a lot easier to define new equations and new operators to represent higher-level algorithms, and our first exposure to these sends many of us running for the door.

Aside from the intimidating equations, that’s all an algorithm really is: a formalization of the steps necessary in order to reach a goal. So when I’m asked on an interview if I can “do” algorithms, in my mind I hear: can you add 17 and 14 without having to take off your shoes? I can then reply without hesitation that yes, I can.

I am now ready to work at Google.

Well, not quite.

Pattern Matching, Hypothesis, and Proofs

Any of us can follow an algorithm if we’re interested and patient enough. But it takes something more to be able to derive the algorithm in the first place.

In his book, “Vision”, David Marr writes about the steps he and his fellow researchers took to discovery a computational theory of vision. The first was to create a representational framework, a hierarchical framework of vision, from the simplest edge detection to more complex visual processes. They then searched for existing algorithms that matched the observed and representational behavior:

These ideas suggest that in order to detect intensity changes efficiently, one should search for a filter that has two salient characteristics. First and foremost, it should be a differential operator, taking either a first or second spatial derivitive of the image. Second, it should be capable of being tuned to act at any desired scale, so that large filters can be used to detect blurry edges, and small ones to detect sharply focused file detail in the image.

Marr and his co-researcher Hildreth, eventually came up with the Laplacian of Gaussian, also known as the Marr Filter, or Marr-Hidreth filter.

Marr and Hildreth were able to derive their filter, their algorithm, because of training in math and neuroscience, as well as new research in the fields of artificial intelligence and vision. The training provided the tools they needed: a catalog of existing algorithms, as well as the necessary protocols; the research then provided the necessary new data.

If you use Photoshop’s Unsharp Mask, you can judge for yourself the success of Marr’s efforts. The point is that Marr and Hildreth established a goal, observed behavior, and then went shopping to find which algorithms came closest to matching the observed behavior — in this case a combination of algorithms: applying a Guassian filter to blur the image and remove structure, and the Laplacian to detect the ‘edges’ or differences of intensity that remains.

So now if I had the opportunity to work with the late Dr. Marr, and he asked me if I can ‘do’ algorithms, in my mind I hear: “Ohmigod, what am I doing here. Do you think anyone will notice if I slip out?”

Okay, but we’re not trying to invent artificial intelligence here

Now that we’ve gone from adding two apples to programming human sight, we’ll focus on algorithms located somewhere inbetween.

Though I wouldn’t have the background to work directly with Dr. Marr, this isn’t to say I can’t work with algorithms. Anytime I write code I make use of, or even create algorithms. When I work with data, either in RDF format or as relational data, I am using algorithms. As Dr. Marr was an ‘expert’ in computer vision and neuroscience, I’m, equally, an expert in my field of interest.

Most of us work with algorithms, though we may not be aware of the fact. When we follow a recipe for Beef Wellington, the instructions for putting together a model airplane, or to knit a complex baby blanket pattern, we’re following algorithms. And if we create a new recipe, knitting pattern, or computational theory of human behavior, we’re creating new algorithms–usually derived from existing ones, if possible. (It’s easier to work with existing, proven, algorithms then have to go through formal proofs with new ones.)

In other words: there is no ‘algorithm’ gene that some people have, and others don’t. One doesn’t need a PhD to work with algorithms; the ability to work with algorithms is pretty much universal.

Cool. So where was I? Oh yes, the business of algorithms.

Blogorithms

I had to do it before someone else did it. It was only a matter of time.

Now that weblogging has established its credibility (i.e. can be used to make money) and there are millions of us (”over 14 million served daily”), the interest in creating algorithms to make use of all the rich, seductive unstructured data we generate is very strong. Understandably so.

However, unlike previous research projects such as Dr. Marr’s, current weblogging effort seems to focus on the algorithms rather than the goal. Because of this, we’re measuring every last bit about ourselves, but not coming up with anything useful. By focusing on the tools rather than the end point we’re mixing search with popularity, marketing with discovery, and then we’re throwing in a little structured data–just to make things interesting.

For instance, looking at mixing search with popularity:

The Technorati 100, Blogdex, and Daypop Top 40 are all representatives of the same general type of algorithm: notification of update, extract out the links, increment the count for any matched link, with the top n number of link holders placed on the list, ordered by number of links. Though the data is treated differently–Technorati persists the number of links per page, while Daypop and Blogdex are only interested in reflecting ‘fresh’ links–the concept, and hence, the algorithm is the same: when a link to a specific page is encountered, add a link to the source to a list, and increment a counter. Then adjust the list accordingly.

None of this activity has anything to do with search. Each tool may also grab data for a search component, but the algorithms for popularity are not specific to search. Where the two get confused is when popularity is used as a factor of search.

Adjusting search results based on popularity is to combine two different algorithms, but there is no rhyme or reason for doing so. The fact that one page is more ‘popular’ than another does not make it a better authority. It’s not the same as something like Google’s PageRank, because PageRank is not a measure of popularity.

Ian Rogers wrote a very nice writeup on the original PageRank algorithm, breaking down the formula into the various steps. Summarizing, PageRank is based on incoming links, but it is the value of the links that helps push up the PageRank, and this value is dependent on how many outgoing links a page has.

This sounds like popularity but it isn’t. The whole purpose of the PageRank algorithm is to approximate a random surfer, and the probability that they would end up at the page after randomly clicking through so many pages. According to the Sergey Brin and Lawrence Page’s original page:

PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the “random surfer” will get bored and request another random page. One important variation is to only add the damping factor d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking.

The PageRank in use today is not the same one just described; the one in use today is said to feature over 100 different variables. It’s not surprising that the computation has changed, but we have to suppose that the reasoning remains: it’s purpose is not to calculate popularity, but to capture the behavior of the random surfer. The only problem is, webloggers are not random surfers.

Weblogs combine the threaded chat behavior of a bulletin board system, with the separate domains of more traditional web pages. As such, the genre creates a threaded web linking behavior that must play havoc with the traditional search engine paradigms. We don’t link just because of interest or as reference; we also link based on a personal attachments, likes/dislikes, as part of self or other promotion, and a host of other reasons, none of which had anything to do with the traditional web linking of long ago.

danah boyd discussed this in a recent post, after reviewing 400 weblogs for linking patterns. She wrote:

Linking Patterns:

The Top 100 tend to link to mainstream media, companies or websites (like Wikipedia, IMDB) more than to other blogs (Boing Boing is an exception).

Blogs on blogging services rarely link to blogs in the posts (even when they are talking about other friends who are in their blogroll or friends’ list). It looks like there’s a gender split in tool use; Mena said that LJ is like 75% female, while Typepad and Moveable Type have far fewer women.

Bloggers often talk about other people without linking to their blog (as though the audience would know the blog based on the person). For example, a blogger might talk about Halley Suitt’s presence or comments at Blogher but never link to her. This is much rarer in the Top 100 who tend to link to people when they reference them.

Content type is correlated with link structure (personal blogs contain few links, politics blogs contain lots of links). There’s a gender split in content type.

When bloggers link to another blog, it is more likely to be same gender.

As danah mentioned, 400 weblogs is too few to extrapolate any global behavior, but we’ve seen one or more of these ourselves–in particular not linking to someone but giving the person’s name; or not even directly mentioning a name (a behavior that is becoming more common).

Whether there are gender differences in linking has been the subject of much debate. danah found in her examination of 400 weblogs that gender linked to like gender more often than not. If this is true, and women account for about 50% of the weblogs, then we should expect more weblogs by women in the popularity lists. That we don’t shows that we need to continue our observation before we can begin to derive algorithms related to weblogging and popularity, and weblogging and search.

As for marketing and discovery, and the thin vein of structured data (microformats, syndication feeds, FOAF, et al) that runs through all of this unstructured mess–this will be good for a follow on topic someday.

Onward

Mary Hodder had originally listed out several different metrics for consideration when it comes to developing an algorithm and asked if the approach she proposed was the right one:

So this is my first post think about making an open source algorithm. And I’m wondering, is this a useful approach? I think it could be worthwhile, done right, and I put it out there to the blogging community to determine what is best here. As I said, after seeing what people who want to work with smaller topic communities are doing, it may be in blogger’s interest to think about how this might be done so that is it more in keeping with the desires and views of the blogosphere.

The approach–reviewing all the different metrics, looking at representative data, searching for repetitive behaviors–are all good, and, equally, all for nought unless the purpose for the effort is clearly understood. This is is worth repeating: an algorithm is nothing more than a formalization of the steps necessary to meet a goal. And replacing Technorati 100 because it ’sucks’ is not a particuarly good goal. So my suggestion to Mary would be to establish the end points for this effort, first.

And now rumors abound that Technorati is being sold, possibly to a major search company. Well, that’s one way to get rid of the Technorati 100–convert it to the Yahoo 100. Semi-serious joking aside, if we know one thing by now about all of this, it’s that the unstructured data that weblogging sits on is this year’s hottest commodity–second only in value to the algorithms used to mine it.

Categories
Semantics Web

Snapshot in semantic time

Danny AyersI’m guessing Shelley’s comment was where the ‘lower-case semantic web’ thing originated.

You betcha. And I’m going after royalties, baby. I figure a dime for every mention, and my taxes will be paid and groceries bought. For a month.

Yup, I just hope that Technorati has lots of money. Lots and lots of money. Corante, too. And Marc Canter! He oughta be worth a car payment, at least. HeYAH. In the dough now, children. In the dough.

I also wish more folk would take the time to pull together the threads in a meaningful way like Peter Van Dicjk did with the early semantic web discussions. Creating a tool to make it simple to create this type of page would be an excellent programming project. A project that could be used to… generate even more money!

money money money…MONEY. money money money…MONEY!

Nobody can’t say I ain’t got my priorities straight. No siree Bob.

update

The links to my writings in the paper haven’t survived the many weblog tools moves I’ve made. I’ve since found the posts, published below:

A Semantic Conversation

Deconstructing the Syllogistic Shirky

Categories
Diversity Technology Weblogging

Technology is neither good nor evil

Recovered from the Wayback Machine.

Technorati has been coming under fire a great deal lately, including this latest by Om Malik, who writes:

So this is where I lose the plot – I tag my post, Technorati benefits, and despite all that, my tags help spammers who clog my RSS readers gain more readers. That’s absolutely rotten! So essentially the spammers can write a script, generate tags, stay high on the Technorati listings and fool people into visiting their sites. By tagging I am helping this scumbags, the RSS-link blog spammers. This is clearly not going to help Technorati (or infact anyone’s reputation) as a good search tool.

This is a conversation we’ve been having for months, as I noted in Malik’s comments. He wrote an email asking for references, saying he’d searched on Google for the terms and I responded back with several links to several posts, including some of my own — most found by going to Technorati rather than Google. And therein lies the rub: a year ago, Technorati could do no evil. Now, Technorati can do no good. Neither is the absolute truth, because we’re applying terms such as ‘good’ and ‘evil’ to what is nothing more than technology, and technology just is. I don’t care for Technorati’s Top 100, because I see absolutely no value in it from a technological perspective, but much harm based on how the list is used to attach ‘authority’ to certain webloggers. However, I do like the ability to look at one of my posts in Technorati and see who has linked to it and, more importantly, what they’ve said. There is no inherent implication of authority in this, nor any indication of moral righteousness: it is just a reading of what is, and can’t be, shouldn’t be misused in implying value in the post. Now that many of the scaling and implementation issues are being taken care of, this service is very helpful. As for tags, I disagree with Technorati automatically converting categories into tags, as there is no overall value in this. I have categories that make no sense outside of the context of the weblog, and all they’re doing is polluting the environment. However, I found explicit tags to be helpful when I was trying to keep up with all of the recent BlogHer writings; not to mention those on Hiroshima. As for spamming, I’ve noticed that tags that are both very active and very time specific overwhelm the spam. BlogHer in a month may be filled with spam, but right after the conference, the links were relevant. Again, challenges in implementation aside, there is no implication of good and evil in tags; they are a service, nothing more. Not, I hasten to add, a way of ‘defining’ social groupings or any of the other glorious sounding purpose associated all too often with ‘authority’ (read that link) based technologies. Speaking of BlogHer, among those posts found in Technology, I was happy to read about the pushback against lists such as Technorati 100 that happened at BlogHer, but less than thrilled when this was, in my opinion, misconstrued as an ‘opportunity’ to replace ranking indexes such as Technorati 100 with something better. In particular Mary Hodder suggested …a community based algorithm, based on more complex social relationships than links. She had this idea going into the BlogHer conference, based on a dinner she attended one night with several people, including Ross Mayfield, who wrote at Many to Many:

Following Liz’s read of BlogHer, one of the more interesting points to come out of the conference is the need for constituent algorithms — ways of revealing hidden groups. For the BlogHer community, the Technorati 100 was more than a whipping boy, but an index where a group was under-represented. Mary Hodder’s approach, spot on, is to develop alternative indexes.

Ross then goes on to discuss the limitations on indexes, such as the authority implicit with each, which left me puzzled as to why he would approve of the development of alternative indexes. Yesterday, Mary released her new effort to identity alternate algorithms based on a dinner she had with Ross, Doc Searls, Halley Suitt, and others in Paris a couple of months ago. It’s a very detailed and thoughtful post, and I respect the amount of work she put in it, but it seems to me that no matter how much the community is involved in this effort, it’s just propagating the same problems, because the issue isn’t about technology, it’s about people and how we behave. If women are not as visible in weblogging (or technology or politics and so on) because of some esoteric to do with technology, then our problems could be easily solved. I would personally devote my life to finding the Woman Algorithm — the algorithm to give equality to women. But, as we’ve seen with the recent linking to BlogHer reports, the issue isn’t that simple. Even considering the fact that BlogHer was about women in weblogging, the single most linked individual post on the conference, was Jay Rosen’s–one of the few men to attend the conference. Why was Jay’s the most linked? Well, some of it was because he provided a viewpoint that led to debate. He used a ‘confrontational’ term that was guaranteed to trigger furious discussion. I linked to him for that specific reason, as did other people. However, Halley Suitt also wrote a post that generated much debate, and though it was also well linked, not as much as Jay’s. Does this, then, mean that Jay’s was a better post? No, not necessarily. If you look at those who linked to Jay, you’ll see two patterns: people who linked to Jay because of what he said, and others who linked to Jay because of who he is. What is the common characteristic of those who linked to Jay without specifically referencing the ongoing discussion? They were all men. Is this relevant? Well, considering the purpose behind Blogher, I would say the results aren’t irrelevant. I suppose we could compensate by having all the women at BlogHer link to Halley just because–but that doesn’t solve the problem, it just ‘hides’ it in this particular instance. Links based on the work are something that can be measured accurately with technology, and used to derive some overall value–interest if nothing else. But the latter, this linking because of who a person is, can’t be normalized with technology; no matter how clever the algorithms or how open the process. Not unless we start adding demographic metadata to our weblogs such as sex, age, economic classification, race, married status, political party, and so on. Though there are those who wouldn’t hesitate to put this information online, most would, rightfully, look askance at the process. Even if we tried to analyze a person’s links to another, we can’t derive from this anything other than person A has linked to person B several times. If we use these to ‘define’ a community to which we belong, and then seek to rank ourselves within these communities, all we’ve done is create a bunch of little Technorati 100’s — and communities that are going to form barriers to entry. We see this ‘communal’ behavior all too often: a small group of people who know each other link to each other frequently and to outsiders infrequently; basically shutting down the discussion outside of the community. Continuing, Mary writes:

So the tension is, do we in the blogosphere figure out a more sophisticated, open standard based metric that reflects the way we see blogs, within and across communities, in order to score blogs? And do we do this within topic areas? Or does using a more sophisticated algorithm across all blogs make more sense? Or do we allow this all to be done for us, possibly in an opaque way by some of the blog search engines or by people who are trying to figure out blogger influence and communities for their clients, or do we write off those efforts because we know they cannot possibly understand us anyway? I have to say, I’ve resisted this for the past year, even though many people have asked me to work on something like this, because I hate rankism. I think scoring, even a more sophisticated version of it, akin to page-rank, is problematic and takes what is delightful about the blogosphere away, namely the fun of discovering a new writer or media creator on their terms, not others. What I love is that people who read blogs are assessing them over time to see how to take a blogger and their work. But more recently, as I said, I’m seeing these poorly done reports floating around by PR people, communications companies, journalists, advertising entities and others trying to score or weight blogs. And after hearing the degree to which people are upset by the obtuseness of the top counts, and because they do want to monetize their blogs or be included into influencer ranks, I’m at the point where I’d like to consider making something that we agree to, not some secretly held metric that is foisted upon us.

I think Mary should stop with …I hate rankism. I understand the motivations behind this work, but ultimately, whatever algorithm is derived will eventually end up replicating the existing patterns of ‘authority’ rather than replacing them. This pattern repeated itself within the links to Jay Rosen’s post; it repeated itself within the speaker list that Mary started for women (“where are the women speakers”), but had its first man within a few hours, and whose purpose was redefined within a day to include both men and women. Rankings are based on competition. Those who seek to compete will always dominate within a ranking, no matter how carefully we try to ‘route’ around their own particular form of ‘damage’. What we need to challenge is the pattern, not the tools, or the tool results. Of course, I realize that mine is just one opinion among many and this work will continue regardless of what I write. As a tech, I’ll be interested to see the algorithm develop and even provide whatever insight I can. That is, if my rank gives me the necessary authority.

Categories
Diversity Technology Weblogging

Lurking tool hit one

Recovered from the Wayback Machine.

I was asked in the BlogHer chat if my ears were burning and I can see why — I was mentioned in the opening session debate of the conference.

Here is a liveblogging account where Mena Trott says that I dismiss her, and criticize the company because there are no women in it. I don’t think I’ve mentioned Mena Trott for the longest period of time, but I have been critical of Six Apart as the company has not hired women to do back end development. In fact, I don’t think any women are involved with the development of the tool at all now, but could be wrong.

As for Meg Hourihan, I’ve never dismissed her contribution as a Blogger developer, and Catarina Fake knows I hold her in high esteem. Interesting.

What followed is a note about Marc Canter and if we don’t like how the tools work, create our own and tell the guys to fuck off.

A great idea Marc. I’ll start something and then see if Joi will fund me. Then the blogging guys could do what they want, undistrubed by us pesky women asking embarrassing questions.

Most definitely looking forward to hearing the transcript now.

Categories
Internet

Still wired

I did not cancel my internet connection on Monday, after all. I needed to keep it up for some work I’m doing this week, but planned on canceling it at month end. However, I had a nice conversation with a person from Charter Communications today and we worked out an equitable solution to our existing contract dispute.

I am canceling the television portion of the cable, but keeping the internet at discounted price (which is a good price). In return I agree to keep the connection until December, the remaining time on the contract according to their records.

Note that I expect the SciFi fans among you to keep me updated on the Stargates and Battlestar Galactica storylines.