Cheap Eats at the Semantic Web Cafe

Recovered (comments and all) from the Wayback Machine.

It’s a rare event when several seemingly disparate items of interest all come together to form a compelling, coalescent whole. This event happened for me the past few weeks; an experience formed of discussions about digital identity and laws of same, LID, Technorati Tags, new and old syndication formats, Google’s nofollow, and the divide between tech and user. Especially the divide between tech and user.

I’ve written about digital identity and LID and nofollow recently, so I want to focus on Technorati Tags in this writing, and then, later, bring in the other technologies relationship to same. Besides, for someone who is interested in lowercase semantic web, how can my ear not be all a quiver when I hear about a new way of ‘adding meaning’ to what can be a meaningless web at times?

Tag, you’re it

If you’re unfamiliar with Technorati Tags, it’s a new implementation of an existing concept previously enabled by other sites such as del.icio.us and flickr. With Technorati tags, webloggers can annotate their entries to add keyword associations to their work forming a quasi-classification on the hoof, so to speak.

When you update your weblog, and ping Technorati (or some other service that results in Technorati’s web bot consuming your post), the link to your post is then added to the other most recent additions to the other entries that share the same tag. Not only that, but items at delicious and flickr are also shown in the page, as this entry labeled Folksonomy demonstrates.

From reading other webloggers, the main excitement behind Technorati Tags is its ability to socialize a classification. David Weinberger wrote the following when the concept was first rolled out:

This is exciting to me not only because it’s useful but because it marks a needed advance in how we get value from tags. Thanks to del.icio.us and then flickr in particular, hundreds of thousands of people have been introduced to bottom-up tagging: Just slap a tag on something and now its value becomes social, not individual.

Cory Doctorow shared in this enthusiasm, writing:

Technorati Tags are keywords that map to category names, keywords, and other cues in blog posts. When you bring up a Technorati Tag for “computers,” you get all relevant blog posts that Technorati knows about, presented on a page with relevant Del.icio.us links and relevant Flickr images. Technorati Tags blend three different Internet services and three services’ worth of tags to tease meaning out of the ether. Brilliant.

Ross Mayfield writes

But below all that global heady stuff, what tags do really well is aid social discovery.

Simon Waldman jumped in with:

Smart. Smart. Smart. If a little rough round the edges.

And Suw Charman enters the lists with:

All in all, this is an interesting way of using emergent tagsonomies to pull together diverse datastreams in one place. As it happens, I’ve had a number of different conversations recently with friends about such things, and this is a useful first step along the way to creating a single entry point for a variety of sources.

It might seem at first exposure that the enthusiasm for Technorati Tags is a little difficult to understand. After all, we’ve been able to classify our writings for a long time in our weblogs; as for searching on specific topics, we’ve had considerable experience using keyword searches in Google and Yahoo. However, the interest in Technorati Tags seems to be focused on its value as a social grouping rather than as a way of categorization. Waldman referenced the term “self-organizing web”, to describe the concept.

For instance, if I were using Technorati Tags in this post, I would add whatever tags I felt represented the content of this writing, such as Folksonomy, Digital_Identity, Tags, and Old_Mills. Of course, when checking Old_Mills, I find that this is fresh meat from a Technorati perspective, as there no previously annotated weblog listings using this tag. This leads me to believe that perhaps there’s a different tag I want to use. After all, if I’m going to go through the bother of using a Technorati Tag, I’m would rather use one that puts me into an active social classification than one that doesn’t. So I try Missouri instead, because after all, the photos of old mills in this writing are in Missouri. I see a gratifying number of entries for this tag, providing positive feedback of my choice.

This process of refining exactly which tags to use demonstrates what we’re told is the true power of Technorati Tags–not that we, as individuals, can categorize our writing any way we want; but that people will seek out existing tags that represent their material, and therefore begins a grass roots taxonomy–or folksonomy to use what is becoming a popular term.

Returning to my ‘socialized choice’, among the other entries tagged “Missouri” are pointers in del.icio.us to a Metafilter discussion on the recent ruling about the KKK being allowed into the highway cleanup program, and an interesting story in reference to the New Mardras fault, both stories I’ve written about and if had tagged previously, would also show in the list. This does demonstrate the positive grouping effect of these tags.

Still, there are other entries that look more like ads than entries related to Missouri, including ones for mobile DJs. This demonstrates one of the negative aspects of Technorati Tags: their vulnerability to spammers. Another vulnerability that has been quickly pointed out is that the material can be seen as inappropriate to the topic or even offensive when placed next to the other material that’s published in the same category.

Bad tag. Bad.

Rebecca Blood was one of the first to make note of inappropriate material within the content tagged with “MLK” for Martin Luther King day.

Now, that photo is perfectly appropriate on Flickr as part of an individual’s collection, and as documentation of Sunday’s rally. It’s perfectly appropriate as an illustration for ‘protests’, or even ‘Israel’ and ‘Palestine’, even though it surely will offend some people wherever it appears. But it is not appropriate to illustrate a category tagged ‘MLK’. I personally was offended–these sentiments reflect the polar opposite to those espoused by Dr. King. More to the point, such an illustration is inappropriate–that poster has as much to do with Dr. King as would a picture of a banana peel.

Foe Romeo also noticed this, especially when looking at the Teen tag and noticing links to a pornography weblog and suggests that Technorati has taken on new roles as both editor and moderator with the introduction of Tags. In her comments, Kevin Marks responds to her concerns with:

We have confirmed with Flickr that pictures flagged with offensive are not included in external feeds, so the advice to Rebecca to visit Flickr to warn about the picture was correct; we also removed the german porn spam blog you noticed from our database.

We are still feeling our way here, and adding community moderation is one possibility.

But another commenter, Beerzie Yoink (who links to an interesting website, btw) wrote:

I’m not a technical genius, but quite frankly don’t see how they are going to manage this. Won’t tags used by spammers, pornographers, racists, and other jerks will be hard to separate from legitimate posts? It will be interesting to see how this plays out.

(em. mine)

Within a day or so of Tags being released, questions have been asked about separating out ‘good’ material from ‘bad’, and finding ways of altering Technorati so as to eliminate offensive material. Of course, as Julian Bond points out, there’s a mighty big chasm between here and there when it comes to this type of change:

We seem to be playing out the same old, same old pattern once more that’s been done a million times before in online communities. The Politically Correct Police (PCP) are making lots of noise about how “This isn’t right and SOMETHING SHOULD BE DONE”. The Anti-PCP come along, who love a good flame war, and are finding ways to wind them up. The poor developers get backed into a corner and end up coming up with a series of nasty hacks to sanitise what was once a nicely elegant, simple and minimalist solution. What makes me laugh in all this are the ludicrous solutions put forward by the PCP who clearly have never been anywhere code.

One of the challenges with self-forming community efforts is that each member brings with him or her different interpretations of why the group has formed, and what it’s purpose is. What’s particularly fascinating about it is that the same people who exult the ease with which the group can form, are also the same people who then pick through the members, saying which ones can stay, and which ones have to go.

While some of those who have questioned the overall goodness of Technorati tags have focused on the correctness of the content, others focused on the quality of the overall effort. In other words, can cheap semantics scale?

Get yer semantics here! Red hot semantics! Get ’em while they last

I took the title for this post from Tim Bray’s discussion about Technorati tags, where he wrote:

I’ve spent a lot of time thinking about metadata and have written on the subject; the most important conclusion was: There is no cheap metadata. I haven’t seen anything to make me change my mind.
…
Having said that, and granting the proposition that The Simplest Thing That Could Possibly Work usually wins, I still have to say that the Technorati Tags all being in a single flat namespace does seem a little, well, brittle.

Liz Lawley also wrote on her concerns about the long-term viability of tags and folksonomies, specifically, whether group concensus leads to valid, or best, results:

On the one hand, as a librarian, I understand completely the value of controlled vocabularies and taxonomies. I don’t want to have to look in six different places for information on a given topic—I want some level of confidence that the things I want are grouped together. On the other hand, I don’t share the optimism that so many of my colleagues in this field seem to have that the collective “wisdom of crowds�? will always yield accurate and useful descriptors. Describing things well is hard, and often context-specific.

Bang on the money except that I would extend this further to read, “…describing this well in such a way as to be meaningful to a great proportion of the populace…” All of us can describe things easily understood by ourselves or our immediate social groups.

Both Liz and Tim reference a post by Clay Shirky where he writes that though folksonomies (the concept to which Technorati Tags has been linked) may not have the quality of well-designed vocabularies, they’ll still persist and ultimately triumph, primarily because these efforts minimize cost and maximize user participation.

This is something the ‘well-designed metadata’ crowd has never understood — just because it’s better to have well-designed metadata along one axis does not mean that it is better along all axes, and the axis of cost, in particular, will trump any other advantage as it grows larger. And the cost of tagging large systems rigorously is crippling, so fantasies of using controlled metadata in environments like Flickr are really fantasies of users suddenly deciding to become disciples of information architecture.
…
Any comparison of the advantages of folksonomies vs. other, more rigorous forms of categorization that doesn’t consider the cost to create, maintain, use and enforce the added rigor will miss the actual factors affecting the spread of folksonomies. Where the internet is concerned, betting against ease of use, conceptual simplicity, and maximal user participation, has always been a bad idea.

Yet it’s interesting that those who support the concept behind folksonomies tend not to use it as effectively as they could, as pind’s dot com discovered when looking at the del.icio.us tags used by Liz and Clay. What’s needed, he then writes, is technology that helps him, and the rest of us, do a better job of classification. But then that takes us back to Julian’s statement about taking minimalistic solutions such as Technorati Tags and telling developers to ‘make them better’–make them so that they perform as well as controlled vocabularies, but without requiring any effort, expertise, or discipline on the part of the users of such technologies.

The concensus among all those who wrote on Technorati Tags seems to be that if folksonomies are not as sophisticated as we would wish, may not scale well, or have the quality that controlled vocabularies have, they’re still based on typically simple solutions; easily applied by the user, controlled by the user, and therefore are better than not having anything when it comes to trying to build this semantic web of ours. Or as Clay wrote:

The advantage of folksonomies isn’t that they’re better than controlled vocabularies, it’s that they’re better than nothing, because controlled vocabularies are not extensible to the majority of cases where tagging is needed. Building, maintaining, and enforcing a controlled vocabulary is, relative to folksonomies, enormously expensive, both in the development time, and in the cost to the user, especailly the amateur user, in using the system.

I grant that tags (Technorati, Flickr, and other) and the other tools of folksonomies are better than having nothing at all; but is there a possibility that they are also worse than having nothing at all?

Bad habits are hard to break

Recently I, and others, wrote about a new single sign-on digital identity system called Light-Weight Digital Identity (LID). What caught our attention wasn’t necessarily that LID was the best digital identity system proposed–there are a lot of unanswered questions inherent with the current implementation–but that it was the first that actually delivered code into the hands of the user that empowered us to control our own identities.

When I wrote on LID, I was asked in several emails what I thought of the Identity Common’s effort with XRI ((eXtensible Resource Identifiers) and XDI (XRI Data Interchange)–universal identification and data exchange protocol specifications, respectively; particularly since I am such an adherant to RDF and both are dependent on URI (Uniform Resource Identifiers) to identity objects of interest, and the implementations of the two could be made interchangable through existing technologies. I answered that I was ‘briefly’ familiar with them, the briefly based on the fact that both are still primarily in specification stage and there is no implementation that I can put my hands on. I could agree that many of the issues about digital identity and problems associated with it have been addressed by the documentation for XRI/XDI — but where’s the goodies?

In other words, XRI/XDI may be the more robust solution, but there’s nothing that I can work with (pre-alpha sourceforge projects not withstanding); where LID, perhaps not as robust, does provide something I can not only use immediately, and I can use without any form of centralized architecture being in place to support it.

Or as was noted in the mailing list for the Identity Commons efforts, sometimes the … “simplest thing that could possibly work” is very attractive indeed.

While I was being questioned about XRI/XDI, several people had emailed Kim Cameron to ask his opinion of it. Kim has become somewhat of a leader in the digital identity community through his interest and not the least because of a set of ‘laws’ he started defining for digital identity implementations.

Rather than address it directly, Kim released a sixth law of digital identities that read as follows:

The Law of Human Integration

The universal identity system MUST define the human user to be a component of the distributed system, integrated through unambiguous human-machine communications mechanisms offering protection against identity attacks.

This law references one of the difficulties inherent with the efforts behind much of the digital identity movement, in that most of the solutions are focused on organizations protecting themselves from abuse and fraud, rather than on individuals being able to safely and easily use whatever solution is provided. This would seem to support LID. However, Kim also provided a scenario earlier in his lead up to his sixth law that plays more subtly on this issue:

To take a very simple example, suppose you have a browser with an address bar showing you the DNS name of the site you are visiting. And suppose there is a “lock icon” which appears when a “secure connection” is in place. What is to prevent a piece of code running on your machine from overwriting the DNS name and throwing up a fake lock icon – so you are convinced you are visiting one secure site when you are actually visiting another insecure one? And so on.

Of course our usual immediate reaction to this type of problem is to find the most expedient single thing we can do to fix it. In the example just given, the response might be to write a new “safe address bar”. And who am I to criticise this, except that in the end, the proliferation of address bars makes things worse. By inventing one, we have unintentionally made possible the new exploit of getting people to install an address bar with evil intent built right into it. Further, who now can tell which address bar is evil and which one is not?
…
The point I am trying to make is that the new distributed identity system needs to be something other than an “expedient compensation”, something beyond a tactical riposte in the fight for security. And since the identity system has to work on all platforms, it must be safe on all platforms. The properties that lead to its safety can’t be obscurantist or derive from the fact that the underlying platform or software still has a small adoption.

In other words, the expedient solution may not be the best overall solution.

Whether LID can be seen as an ‘expedient solution’ or not, if LID had implementations in PHP or Python that would be simple to install and use, and there was more clarity on the license, it would have fired enough grassroots support to make it a contender for the de facto digital identity implementation, thus making it that much more difficult for other, perhaps more ‘robust’ solutions to find entry into the community at a later time.

This also applies to the concept of meta-data. If people become used to receiving value, even if it is only limited value, from folksonomies based on very little effort on their part, they’re going to become reluctant when other more robust solutions are provided if these latter require more effort on their part. Especially if these more robust or effective solutions take time to be accessible ‘to the masses’ because the creators of same are *enclosured behind walls built of scholarly interest, with no practical means of entry for the likes of you and me.

Clay expands on his general theme of the suckiness of ontologies, as compared to folksonomies because the former forces a future prediction of structure while the latter allows for dynamic growth; the former is based on a graph, with predefined nodes, each requiring a progenitor, while the latter is based on sets, and the only barrier to entry is forming a decision to belong.

Ontology is a good way to organize objects, in other words, but it is a terrible way to organize ideas, and in the period between the invention of the printing press and the invention of the symlink, we were forced to optimize for the storage and retrieval of objects, not ideas. Now, though, we can scrap of the stupid hack of modeling our worldview on the dictates of shelf space. One day the concept of creativity can be a subset of a larger category, and the next day it can become a slice that cuts across several categories. In hierarchy land, this is a crisis; in tag land, it’s an operation so simple it hardly merits comment.

The move here is from graph theory (arrange everything in a tree graph, so that graph traversal becomes the organizing principle) to set theory (sets have members, and the overlap or non-overlap of those memberships becomes the organizing principle.) This is analogous to the change in how we handle digital data. The file system started out as a tree graph. Then we added symlinks (aliases, shortcuts), which said “You can organize things differently than you store them, and you can provide more than one mode of access.�?

Yet, as we’ve already started to see with Technorati Tags, as with other implementation such as del.iciou.us tags and flickr, low barrier to entry usually doesn’t scale well. Something like the Missouri Tag may have few enough entries to make finding the meaningful data easy, but something like Weblog results in so many members as to make it difficult to differentiate from the populace as a whole. The same applies to social networks, where people collect so many ‘friends’ as to make being a ‘friend’ of the person inherently meaningless.

So then we start exploring ways and means to make these simple systems and folksonomies more effective. In the case of Google, the developers create algorithms that try to add meaning to the results returned on a search by basing the results on number of links and popularity of a site, with an assumption that popularity equates to authority. In the case of Flickr, social behavior is incorporated into the tags, and members can label photos as ‘offensive’, in which case the photo is excluded from external feeds. However, without having a clear, not to mention shared, idea of what ‘offensive’ means, the results will always be suspect. After all, some would say that photos of a woman’s bare breasts or a man’s penis are offensive; others would say any photo of President Bush is offensive.

All of these solutions and the tricks to make them work better are based on the fact that the rich context of the data is not captured along with the data, and therefore there is only so much good we can wring out of these ‘cheap’ semantic web solutions before they’re wrung dry and spit out like overchewed tobacco cud. Or before they’re gamed by people such as the comment spammers, and then we, the blades of grass within the grassroots efforts, have to add more effort to our input in order to ‘refine’ (read that ‘fix’) the results, as witness the recent release of Google’s nofollow attribute.

(One could say that Peter Kaminksi is prescient when he remarks January 15th about annotating links in a similar manner to Technorati tags, so that Google could also participate in the new, more meaningful web.)

It is the structure, the future prediction, careful classification, and directed graph nature that Clay disdainfully rejects that allows us to capture the rich nuances of data that will persist longer than the quick transitory interests that meet efforts such as Technorati Tags. One only has to compare the Technorati Tag for Terrorism with the Weapons of Mass Destruction, Terrorist, and Terrorist Type ontologies, and associated instance database to see where the discipline to apply more robust metadata concepts can result in much more controlled, and specific, result sets. And since the data is defined in a universally understood model, RDF, you don’t even have to use the ontology creator’s own search tool (try who, what, where for the three values, in that order)–you could use my much more crude, but quickly hacked together Query-o-Matic, based on existing technologies.

Louis Rosenfeld discusses the strength of searches among controlled data sources as compared to that of folksonomies:

Lately, you can’t surf information architecture blogs for five minutes without stumbling on a discussion of folksonomies (there; it happened again!). As sites like Flickr and del.icio.us successfully utilize informal tags developed by communities of users, it’s easy to say that the social networkers have figured out what the librarians haven’t: a way to make metadata work in widely distributed and heretofore disconnected content collections.

Easy, but wrong: folksonomies are clearly compelling, supporting a serendipitous form of browsing that can be quite useful. But they don’t support searching and other types of browsing nearly as well as tags from controlled vocabularies applied by professionals. Folksonomies aren’t likely to organically arrive at preferred terms for concepts, or even evolve synonymous clusters. They’re highly unlikely to develop beyond flat lists and accrue the broader and narrower term relationships that we see in thesauri.

Returning to Kim Cameron’s sixth law, which states there must be an unambiguous and non-corruptable interface between the user and the technology, we could also apply to this metadata: the costs to support controlled vocabularies/ontologies and uncontrolled vocabularies/folksonomies are the same. At some point a human has to intervene with the technology to refine and validate the result. With ontologies, the intervention occurs before the data is captured; with folksonomies, the intervention occurs with each search.

I put my money on the ‘refine and validate just once’ solution.

Isgood but…is good?

Though Rosenfeld and most others I’ve listed here support folksonomy efforts, some with caveats, others unreservedly, as just one of a variety of technologies that help people find what they need, I tend to be of the camp that believes focusing on easy solutions will make it more difficult to get acceptance for ‘better’ solutions that may require a little more effort. This puts me in the exact **opposite camp of Clay Shirky.

Clay believes that ultimately ontologies will fall to folkonomies, as the latter gain rapid acceptance because of their low cost and ease of use; I believe that ultimately interest in folksonomies will go the way of most memes, in that they’re fun to play with, but eventually we want something that won’t splinter, crack, and stumble the very first day it’s released.

What we don’t need are more cheap solutions, and ultimately, I find that Technorati Tags are a ‘cheap’ solution, though a compelling one, and useful for generating conversation if no other reason. And I don’t want to deginerate Technorati’s efforts with this, because I feel in the end Technorati is going to play a major role in our semantic efforts. Still, no matter how many tricks you play with something like tags, you can only pull out as much ‘meaning’ as you put into them.

What we need, instead, is a way of making richer solutions more accessible to people, and in that, I do agree with Clay–lower the barrier of participation. In the email list for the Identity Commons effort, the members talked about how the URL which serves as identifier within LID is also a URI, which forms the basis for XRIs, and how the group should look at ways of achieving synergy with this new effort. Rather than being disdainful, they sought to turn LID into an opportunity.

This type of attitude is what we need more of–how can we make the richer, more robust solutions available to folks like you and me. In some ways, FOAF, the ontology used to identity ourselves and who we know is an example of this because its very accessible to ‘regular folk’; yet its also based on a robust and highly interchangable data model, which means it could be easily merged with other data that shares the same identity.

One hell of a ride

Clay states that whether we’re supportive of folksonomies or not, they’re going to happen–we are in a kayak floating along a river of change:

It doesn’t matter whether we “accept�? folksonomies, because we’re not going to be given that choice. The mass amateurization of publishing means the mass amateurization of cataloging is a forced move. I think Liz’s examination of the ways that folksonomies are inferior to other cataloging methods is vital, not because we’ll get to choose whether folksonomies spread, but because we might be able to affect how they spread, by identifying ways of improving them as we go.

To put this metaphorically, we are not driving a car, with gas, brakes, reverse and a lot of choice as to route. We are steering a kayak, pushed rapidily and monotonically down a route determined by the enviroment. We have a (very small) degree of control over our course in this particular stretch of river, and that control does not extend to being able to reverse, stop, or even significantly alter the direction w’re moving in.

I consider that the difference between the ‘web’ and the ‘semantic web’ to be one based on ‘meaning’ alone, not on toys and attachments. If my opinon holds true, is the transformation of the web to the semantic web equivalent to a ride in a kayak? Pulled along by forces with little control over direction and speed?

I will concede to Clay the challenging, swift nature of the transport, but argue that only a fool would put themselves into a narrow sliver of wood, hide, or plastic on a raging river without training, accepting to fate to ensure we don’t end up smashed, bloodied, and drowned. And it’s equally foolish to believe that we can, somehow, with the right use of technology, exponentially derive complex meaning out of what is, essentially, flat data.

I agree with Clay that the semantic web is going to be built ‘by the people’, but it won’t be built on chaos. In other words, 100 monkeys typing long enough will NOT write Shakespeare; nor will a 100 million people randomly forming associations create the semantic web.

* No enclosured is not a real word, but should be because it adds more description of the effect than ‘enclosed’.

** Of ontologies, Clay writes …don’t get me started, the suckiness of ontology is going to be my ETech talk this year…, which is probably one reason my own proposal, which is diametrically opposite to Clay’s talk, was not accepted. Well that and I mentioned the ‘p’ word.

Archived, with comments, at the Wayback Machine

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30