Categories
Copyright RDF

The little CC license that could, or when technology is all busted up

Recovered from the Wayback Machine.

Phil Ringnalda points to the new Yahoo Creative Commons search engine and notices that because the engine is relying purely on links to CC licenses to pull out content that is supposedly licensed as CC, there is going to be a lot of confusion related to what is, or is not, CC licensed.

An issue with CC has always been how to attach CC license information in such a way that automated processes could work with it. The solution has been to use RDF/XML embedded within HTML comments to indicate what is licensed on the page. However, this is kludgy and doesn’t validate within XHTML and people are dropping it, and just including the link to the specific license. More, even if they include the RDF/XML they do so in such a way that it looks like everything in the page is under the specific license–HTML, writing, CSS, photos, whatever.

In other words, they take the rich possibilities inherent with using RDF, and dumb it down until it’s equivalent to the link.

Phil then pointed out that Yahoo releasing this search that just looks for links to the license in a document, and doing so without any legal disclaimers, warnings, or asides, is about the same as somebody accidentally putting a GPL license on the next version of Windows. In other words: it’s a a really dumb move:

But if I was the Yahoo! lawyer who vetted their Creative Commons search, and let it loose without any disclaimer that “Yahoo! makes no assertion about what, if any, content in these results is actually offered under a Creative Commons license” I’d be hanging my head in shame.

To make matters worse, in the associated FAQ for the new search is the following:

This search engine helps you quickly find those authors and the work they have marked as free to use with only “some rights reserved.” If you respect the rights they have reserved (which will be clearly marked, as you’ll see) then you can use the work without having to contact them and ask. In some cases, you may even find work in the public domain — that is, free for any use with “no rights reserved.”

Yup. I think this is a case for the new Corante legal weblog.

I tried the search with my weblog’s name, and found one interesting result: the bbintroducingtagback tagback in Technorati. It seems that Technorati has linked to one of the CC licenses that allows non-commercial use. But used in the way it is, it implies that all the material in the page is licensed this way. Wait a second, though: that’s my photo in the page, pulled in from Technorati via flickr. I don’t license my work as CC–it’s still too damn vague a licensing, usually applied badly (as we’re seeing now).

(Marius, what do you think about that? And this picture is still too cute for words.)

Phil calls this accidentally by link association form of CC licensing, viral and viral it is, indeed; through bad implementations of a vague license, I may, by allowing my photo to be copied (while holding all rights), have lost rights to that photo by implication and effect. At a minimum. who holds the copyright on the photo has been lost when it filters through both the Technorati tag and the search engine results.

I’ve been in a discussion about the CC license and the issue of how to record more specific information with Mike Linksvayer (who is on the staff at cc) at Practical RDF. I brought up the issue of lack of precision in the licensing and Mike mentioned that one approach CC is looking at is to use, again, the ‘rel’ attribute as a way of marking metadata. But this can only go so far — it’s really not much more than just linking to the license and assuming this implies usage.

(And, frankly, our use of ‘rel’ is becoming a bit of a stretch–we’re trying to stuff all the meaning in the internet in one little bitty attribute.)

The approach I’m using for complex metadata (which is what CC is) in Wordform is to generate a separate RDF/XML feed that explicitly states which element is licensed, which isn’t within a page, and exactly how the licensed element can be used (among other metadata). I link to this page through a LINK element in the header, as many of you do with auto-discovery of feeds right now. However, Mike’s response to this was:

A separate RDF file is a nonstarter for CC. After selecting a license a user gets a block of HTML to put in their web page. That block happens to include RDF (unfortunately embedded in comments). Users don’t have to know or think about metadata. If we need to explain to them that you need to create a separate file, link to it in the head of the document, and by the way the separate file needs to contain an explicit URI in rdf:about … forget about it.

But if we don’t explain to people how all this works, and provide a way for folks to be more precise, problems like the Yahoo CC search and the Technorati tag page are going to continue. By ‘protecting’ people from the technology, we are, in effect, doing more to harm them then help them.

What we should be doing is providing the tools to allow people to use rich metadata, richly; not make assumptions that “people can’t deal with it” and then dumb it down accordingly. We should be helping people understand how to use something like the CC license wisely and effectively–using clear, non-technical language to explain how all the bits work–not depend on technology to somehow ‘guess’ what a person wants and act accordingly.

Because as we’ve seen, technology almost invariably guesses wrong.

Categories
RDF Semantics

Dumbing down of America

A recent spate of postings at Planet RDF revolve around a two-day session on SPARQL that’s coming up in Europe. It was reading through these that something I noticed recently became more apparent: that most of the semantic web effort, or the effort that’s involved with RDF, is happening in Europe (with some side trips into Canada when the weather is good).

In the United States, on the other hand, most of the discussion is about folksonomies. We are a nation filled people raising excited fingers from both coasts to point at delicious, flickr, Technorati, and Wikipedia; matched with solemn assurances that these new ‘bottom up’ systems are going to kick the butt of ‘formal’ ontologies.

Leaving aside whether one would want a doctor who learned biology the ‘folksonomic’ way, is there a geographical split to the direction of study for the semantic web? Are ‘folksonomies’ becoming the fast food of semantics–the McDonald’s of taxomonies? If so, then are we in the US going to end up with obese vocabularies, barely able to clasp the belt of understanding around their middles?

And I want to know why events like these never happen in St. Louis. Is it an European/Canadian plot to slowly dumb down America until they can quietly invade us one day, and we don’t even know it until a tag appears in Technorati labeled “AllYourMetaBelongToUs”?

All I can say is I didn’t vote for him!

As for having a meeting here, we have beer, too. Good beer. In fact, Budweiser is located her…

As for having a meeting here, we have wine, too. Good wine. Stomped by only the finest squirrel and beaver.

Categories
RDF

Update: Yahoo search

I had made an assumption that Yahoo Search was using the RDF/XML embedded with the CC license information to build its search results; Mike Linksvayer, though, was kind enough to clarify in comments that the company is using the CC license links, only, to capture this information.

This is disappointing, as I feel that there is more about the CC licensed objects that Yahoo could provide and doesn’t because it’s only after the links. That’s about the same as running a mine for rubies and tossing aside the diamonds you find.

Mike also mentioned about the use of RDF-A to bypass problems with embedded RDF/XML. Trying to define yet another new syntax when there’s an option already available doesn’t make sense. The RDF/XML Syntax document stresses the use of <LINK> for linking to a separate RDF/XML document with whatever metadata is defined for the resource. This is a good approach, and I’m not sure why folks are resistent to this. It’s not as if the extra documents will take up a lot of space; for dynamic systems, such as many of the ones we’re using today for weblogging, commerce, and so on, the document can be generated on demand.

A scenario for use with CC could be that when the CC license is generated, the person is told to create a file and copy in the generated RDF/XML. Then to take this LINK and add it to the header of the page. If they also want to add a icon and a link to the license in human readable format, then copy this link and put it into the page.

Is this that much more complicated for the people? Yes and No.

No in that people who host their own sites could probably do this without much problem; especially if tools start providing ways of editing pages on the site. However, for hosted sites, this is a problem – and will continue to be a limitiation of these types of sites. Now, a smart hosted site will be one that eventually gets that they need to provide some mechanism to allow for this type of activity. But until then, yes this is a limitation.

But CC could solve this for the hosted sites, by hosting the license files themselves and giving the person the link to the file to put into their document. Even with a weblogging tool, you could do this just by embedded a tag for the individual file name as the name of the metadata file into the header.

Eventually, we need ways of merging data for many uses into these pages. One way would be to provide the RDF/XML document URI to these tools, and the tools would then read in the existing RDF/XML and add the additional statements. Another would be for tools to provide a way of reading in a block of RDF/XML, pull out the individual statements, and then merge into those that already exist.

There’s code everywhere to do this type of data merging, and best of all: it’s RDF/XML, which means you don’t have to worry about namespaces and collision.

All we would need, then, is nice search bots that grab this and pull all this info into a nicely consumable spot. With API that returns individual data query results, or RDF/XML.

Yahoo! Yahoo! *knock knock knock* Opportunities knocking. Don’t blow it.

Categories
Copyright RDF

Yahoo CC Search

Yahoo released the beta of the Yahoo Creative Commons Search allowing us to search among CC licensed material. Since CC licenses are recorded using a standardized meta language and syntax *cough* RDF/XML *cough* it’s more a matter of just checking for this information in the process of their normal operations.

There’s still a lot missing. First of all, the CC license tends to be added to a page, and this can get associated with the keyword. For instance, search on “Shelley Powers” and you won’t find CC material by me, but CC material that includes something about me. (Unless it’s a publication that is CC.). Also, there’s no way to differentiate images, video, and writing with this, though with Yahoo’s separation of media in the regular search engine, this is probably a temporary issue.

Tag this under “RDF rules”.

Categories
RDF Semantics

Accidental smarts

Recovered from the Wayback Machine.

Responding to the recent discussions about folksonomies and tags, AKMA was forced to make a confession: he is tags challenged:

“…I should pause to say that I’m not a natural for “tags.” I’ve hardly ever used deli.cio.us tags. I didn’t begin tagging my pictures for flickr for ages; even now I’m liable to tag pretty cursorily (no, I don’t mean “with a computer pointing device”). I don’t use categories in my own Moveable Type posts, although the Seabury site that used to be (and may someday live again) integrated categories into its architectural rationale. And once I started thinking about tags, I felt chagrined; the folksonomized Web that David envisioned, that Kevin and Stewart and all had begun to implement, presents such a tremendous opportunity — but here I was, too lazy to tag. I had worked on my to care about valid mark-up, and I emphasized this aspect of the Seabury site. But I just wasn’t sure I had the determination to add Technorati tags to my posts. You’re too polite to complain, but I get long-winded — how would I tag my monologues without repeating most of the words? I was going to be a stick between the spokes of the organic semantic Web, when my friends were building and turning the wheels.”

Knowing AKMA for a few years now, lazy is not word I would have used to describe him. Dave Winer responded, comparing categories/tags to the old PIMs, writing:

Users got all excited about them too, and set them up imagining how great it was going to be to finally have an orderly life. They happily entered appointments, until they spaced out or got lazy and didn’t enter one. All it takes is one for the excitement to turn to guilt…The category stuff works the same way. At first I delighted in the ease of routing stuff to categories. Eventually I would only route to one or two categories, and then I stopped altogether. Not because it wasn’t easy enough, but because the guilt had taken over.

Knowing Dave for a few years now, feeling guilty is not a phrase I would have associated with him.

All joshing aside, among those that responded to both writings, Dan Bricklen wrote that Instead of making you feel bad for “only” doing 99%, a well designed system makes you feel good for doing 1% and proposed …another design criteria for a type of successful system: Guiltlessness. Ross Mayfield had an interesting take saying that guilt is good:

Perhaps a system isn’t social if it only has first order commons dilemmas (governing the resource) and doesn’t support management of the second order (governing each other). When a group explicitly forms around a tag, guilt may come into play (for example, shame on you people for not posting really ugly and fairly pointless parking lot photos!), and that’s not necessarily a bad thing.

Though both Dan Bricklen and Ross Mayfield had excellent responses to AKMA’s and Dave’s writing, I kept returning to AKMA’s statement of I’m not a natural for “tags.” I don’t think that AKMA is lazy when it comes to tags and categories, as much as he doesn’t see the magic that will bring all the pieces together, and he later expanded on his earlier writing, saying something to that effect. In that post he agreed with the limitations of library systems, based on controlled vocabularies, and also agreed that a bottom up folks-based approach might be better, but, as he wrote, …we haven’t turned up the device that’ll kick that engine into gear, not yet.

It can cook your pop tarts without burning the edges

If you’ve ever watched the movie “Twister” one of the better lines in the movie was delivered by Bill Paxton’s girlfriend, the reproductive therapist Melissa Reeves. When faced with the odd barrel shaped device with all sorts of gizmos on it, surrounded by beaming, happy strange people, she responded with, “Wow, it is great.” After staring at it a moment longer, she gets a pained expression on her face and asks, “What is it?”

Give that woman a weblog, she’s discovered the secret of meme!

Seriously, with the mixed bag that is weblogging you have people who see a new innovation and go, “finally!” while others look at the same thing and go: “Will it make the comment spam go away. I don’t want to hear about it unless it makes the comment spam go away. Where is the ‘Kill all spammers’ switch?” The rest fall somewhere in-between.

When I read AKMA’s statement, my first thought was, what do we want this engine to do. In other words, what does each of us expect to get out of tags and folksonomies?

Returning to David Weinberger’s after dinner speech, again, another item that caught my attention was how David perceived tags. Tags, to him, were more than a way of routing around the Dewey Decimal System–they were a way for him to keep up with as many writings as possible on specific subjects. So, for instance, David subscribes to the taxonomy tag at delicious, and with this, he’s able to see what new items pop up under this designation.

I thought this was a very compelling reason for tags–enough so that I subscribed to feeds for several tags in delicious. Through these I was able to find several new resources, including some referenced in this writing. And I found them again, and again, and again, and…

Déjà vu all over again

Tags by themselves aren’t useful for anything; it’s how they’re used in support of other services that makes them more interesting. For instance, delicious (or more properly del.icio.us) provides a bookmarking service that happens to use tags. If you’re interested in a site or a specific page, you add it to your bookmarks, annotating it with various tags. Your bookmarks are public, so anyone interested in those tags is also, then, notified about the site. A fascinating study of distributed interest, and seemingly a great way of discovering gems hidden in the shadows of the online giants, since any link has its moment in the sun, so to speak.

However, the very nature of the concept of ’shared bookmarks’ means that the more successful a writing, the less signal per noise ratio you get. For instance, in the folksonomy category, a new, popular piece can effectively wipe out any other entry from the ‘top page’ because so many people add the site to their bookmarks. And if you subscribe to the feed, you can be treated to link after link to the same resource, added by different people. Additionally, if you’re like me and subscribed to many similar topics, you’re treated to the same link showing up in other feeds. After a while, you don’t think the Guardian’s new flickr article is all that particularly interesting.

In addition, there is nothing in delicious to indicate whether a URL is fresh or not, as I found when a link to a five year old article appeared under the RDF tag. This could be considered more of a perk than a problem– especially if it provides visibility to older works that may no longer be on today’s radar but are still valid. However, if you’re expecting fresh content, older links could clash with your expectations and may even decrease the value of the feed.

Another issue having to do with delicious really has to do with all of the tag-based services and that’s agreement as to which tag gets attached to what item. And who decides what is the ‘right’ use of a tag, in a system with few gatekeepers?

Fleas on a Dog’s Back

del.icio.us, and Furl, are bookmarking services, used by people to publicly share their reading lists. How the lists are organized is through tags, and this is what connects these systems with other tag-based systems, such as Flickr and Technorati. These latter, though, are primarily focused at us tagging our own work; I can add ‘folksonomy’ as a tag to this post and it would show up in Technorati tags under Folksonomy. In addition, if I used folksonomy with one of my existing orchid photos in Flickr, it would also show up.

Though primarily of interest to individuals, there’s nothing inherent with either Flickr or Technorati tags that prevents other from adding tags to those we use. For instance, consider how tags are used in Flickr. Most people who use Flickr are more interested in a way of providing access to photos for friends and family, and tag the photos accordingly. Their interest isn’t in how the public perceives a tag, but how they perceive it. But since Flickr provides the ability for others to tag photos, you can, as Stewart Butterfield said in an excellent interview by Richard Koman, …upload a photo and go to sleep, and in the morning there are tags all over it.

I don’t have a large network at flickr, but since I’ve added “folksonomy” as a tag for my one orchid photo, others have since added “tag”, “tags”, and “SillyTag”. Now, I added “folksonomy” as a tag for the photo because it shows plant roots overlying the images of the plants themselves, and symbolizes the ‘grassroots’ nature of folksonomies. However, others added tag and tags probably because these are terms related to folksonomies, and most likely because they were indulging in a fit of playful mischief. We can safely assume this is the impetus behind “SillyTag”.

Now I can delete these other tags, but I won’t because I know the context of their usage, and in this context, they are meaningful to me, and to those others who share the history in regards to this example. Of course, to someone who doesn’t know me, or the others, all of these tags probably seem very puzzling. More, if this person searched on “folksonomy” in Flickr or Technorati Tags, they are going to be confused about this orchid photo showing up in the midst of conference snapshots and diagrams written on whiteboards.

One term, one understanding, and many different but legitimate reasons for attaching the tag, hence a valid folksonomy example; but to someone coming in without the proper context, I’ve just decreased the signal to noise ratio of this tag.

Still many are willing to accept the seeming chaos of expectations about tags, in favor of their dynamic and open capabilities. Though aware of the challenges associated with tags, in his in-depth essay on folksonomies and tags, Adam Mathes prefers to focus on the positive benefits of getting users involved in defining metadata:

A folksonomy represents simultaneously some of the best and worst in the organization of information. Its uncontrolled nature is fundamentally chaotic, suffers from problems of imprecision and ambiguity that well developed controlled vocabularies and name authorities effectively ameliorate. Conversely, systems employing free-‍form tagging that are encouraging users to organize information in their own ways are supremely responsive to user needs and vocabularies, and involve the users of information actively in the organizational system. Overall, transforming the creation of explicit metadata for resources from an isolated, professional activity into a shared, communicative activity by users is an important development that should be explored and considered for future systems development.

Following on this empowerment theme, Nick W from ThreadWatch writes Simply put, tags are important because they allow your users to generate content and classify that content in their own unique way.

I did it m-y-y-y-y-y w-a-a-a-a-ay-!

In the original discussions related to tags, Allan Jenkins wrote about the issue of tags and weblogger discipline:

I keep running up against two issues.

First, since tags are self-applied by tens of thousands of Flickr users and other bloggers, I suspect we are bound to end up with common categories too large to be useful (Parties, Dogs, NewYork) and, because no one need agree to any one taxonomy, a plethora of tags that refer to the same thing (insulinpump, insulin_pump, insulininjectiondevice).

Second (but related) is how we bloggers can discipline ourselves to apply tags judiciously; moreover, how will and should tags affect how we design blogs. For example, Technorati already interprets Typepad categories as tags. Does that mean Typepad bloggers should drastically expand their category lists? It would seem to be a good tagging idea, but it would also render “categories” fairly worthless.

Rather than muck around with my categories, most of which would definitely generate ‘noise’ in tagging, I tried something new: I introduced a tags-based systems as a way of grouping discussion about a topic and doing away with trackback. I proposed this approach for two reasons; the first is that trackback is now being badly spammed, and shopping for alternatives seems like a feasible activity, especially considering that we never really used trackback that accurately in the first place; the second is that tags can loosely join several separate resources around a specific topic, and to me that’s the original intent of trackback.

Instead of trackbacks, I said, we’ll create ‘tagbacks’ and then use Technorati, and other tags services, as a way of tracking related information about the post/discussion.

To demonstrate, in the post covering this new concept I created a unique tag called , based on the title of the post, but also covering the basic concept of the discussion surrounding the post: it was about me introducing a new tags-based discussion thread tracking system called tagback. I then pointed a couple of photos at the topic, by using the same tag in Flickr, as well as some additional related material by creating bookmarks in delicious and Furl. Others picked up on the concept, adding new entries to delicious, as well as using the Technorati tag.

Though there is interest in this idea, others were concerned about the use of tags in this way. For instance, if we were to create individual tags for individual posts, we would looking at running into tag name collision eventually. Even if I were to ‘namespace’ my tags, as I have by placing a ‘bb’ in front of the name, it’s still very conceivable that BoingBoing, another ‘bb’ weblog, could define a tag equivalent to one I’ve created. So intermixed with post after post about technology and hiking, would be the odd post on copyright and sex.

However, the biggest concern expressed on the use of tags in this way is that this may violate the concepts behind tags. In comments, Hans Gerwitz wrote:

This feels like an abuse of tagging, in that you are programmatically generating tags that are far too specific to contribute to the ecosystem.

If I browse the tagonomy for trackbacks on del.icio.us I find that blog, spam, and even politics are related. The bbintroducingtagback tag is not likely to ever bubble up to relevant status, though.

Moreover, these “manufactured tags” are never going to be stumbled upon by someone else tagging their own content; they will never contribute to the organic self-organizing soup of tagspace.

So, if these tags don’t play with the other tags, what purpose are they serving?

Simon Willison wrote:

Unfortunately, the very nature of tags is that they are designed to be shared rather than globally unique, which seems to make the concepts incompatible.

In Clay Shirky’s first post on folksonomies, he addressed concerns about synonym control and precision, writing:

Lack of precision is a problem, though a function of user behavior, not the tags themselves. del.icio.us allows both hierarchical tags, of the weapon/lance form, as well as compounds, as with SocialSoftware. So the issue isn’t one of software but of user behavior. As David pointed out, users are becoming savvier about 2+ word searches, and I expect folksonomies to begin using tags as container categories or compounds with increasing frequency.

In response to Clay’s writing, Thomas Vander Wal, the originator of the term folksonomy wrote:

The narrow-folksonomy, where one or few users supply the tags for information, such as Flickr, does not supply power tags as easily. One or few people tagging one relatively narrowly distributed item makes normalizing more difficult to employ an tool that aggregates terms. This situation seems to require a tool up front that prompts the individuals creating the tags to add other, possibly, related tags to enhance the findability of the item. This could be a tool that pops up as the user is entering their tags that asks, “I see you entered mac do you want to add fruit, computer, artist, raincoat, macintosh, apple, friend, designer, hamburger, cosmetics, retail, daddy tag(s)?”

Since this time Flickr has added the ability for friends and family (and possibly contacts) to add tags, which gives Flickr a broader folksonomy. But, the focus point is still one object that is being tagged, where as del.icio.us has many people tagging one object. The broad-folksonomy is where much of the social benefit can be derived as synonyms and cross-discipline and cross-cultural vocabularies can be discovered. Flickr has an advantage in providing the individual the means to tag objects, which makes it easier for the object to get found.

According to both Shirky and Vander Wal, then, a compound tag consisting of terms, such as ‘bb introducing tag back’ is not only acceptable, it’s to be encouraged because it adds to the ability to find the item so tagged. And since others can use other terms to tag the item, it’s part of a broader folksonomy that can then be traced back to the item through a query that combines these tags; or by using the specific tag, bbintroducingtagback, as an alias to a tag query, such as tag+tagback.

But this does highlight a problem with folksonomies and tags, and one that may be leading to AKMA’s, and other’s, wariness of their usage: no one knows exactly what are the rules related to these objects and their aggregations. We’re making all this stuff up as we go along. Or, as the wise man said, “She who gets there first, wins”.

Path Cutters

I once wrote on an ingenious experiment in social-driven architecture, when the architects of a new building planted grass but did not put in sidewalks. Over time, paths were cut into the grass, and these paths were eventually cemented over. The premise behind the effort was that the people would determine the best, and most effective way to approach the building.

However, you don’t see this approach used elsewhere, and it isn’t just because builders are concerned about liability and access of the buildings while the paths are being trampled; it’s also in that these paths may not be optimum for all people. In fact, the paths may be optimum only for a certain segment of the population. For instance, men will more likely create more of an impression in the grass than women because of their heavier weight and stronger shoes; women may not attempt to tred in anything approaching unmarked grass because women’s shoes tend to be high heeled or less sturdy than men’s. Older people will also more likely follow even a hint of a trail over non-trail because it’s just plain easier, which means younger people will also dominate in the trail cutting. Finally, as a whole society frowns on marking paths into unmarked landscape and the ones who are most likely going to cut the path are either earlier comers, who have no choice but to walk on unmarked grass; or people who don’t care, either about society, or about the appearance of the landscape.

Ultimately, in the end you have paths marked by young guys who don’t give a shit.

What could be said of the paths could also be said for the use of tags and folksonomies. Either people will search out and follow existing tag usage, or they’ll go their own way; if their way has enough appeal, they then become the path cutter. The aggregations that result in tags, then, may not arise from a true representation of the people forming these aggregations. In other words, rather than represent a collective intelligence, folksonomies may reflect the tag equivalent of young guys, who don’t give a shit.

About the dominance of path cutters

One of the first actual conflicts related to tags was Rebecca Blood’s issue with the use of the MLK tag with a possibly offensive photo. This represented a conflict in culture between the person who tagged the photo and Rebecca. On issues of classification and culture, Danah Boyd addressed issues of folksonomies and culture, wrote:

What makes the tagging phenomenon utterly fascinating is that there is a collective action component to it. We love to see how people will come to common consensus on relevant terms. But part of what makes it valuable is that, right now, most of the people tagging things have some form of shared cultural understandings. The “in the know” groups using these services are very homogenous and often have shared values and thus offers valuable related links. This helps explain why Rebecca Blood is concerned about the MLK tags – they signify a lack of shared common ground. In tagging, quality is not just about ‘accuracy’, but about what cultural assumptions dominate.

Design questions then emerge. How do we deal with conflicting cultural norms as more people are engaged in the act of tagging? How useful are tags across cultures? Do we only gain value from collective-action tagging amongst groups of shared values? If so, how do we implement that? And what are the social consequences for explicitly delimiting culture online?

Since the use of tags is so new and folksonomies so limited, does this seem like a minor problem? A favorite example of controlled taxonomies, the Dewey Decimal System, is infamous for its Christian dominated classification system, which both AKMA and David Weinberger discussed. There are 88 numbers reserved for Christian topics, while Jews and Moslems get 1 number, each. The only reason the current system hasn’t failed by now, is because topics can be added as ‘decimal numbers’ within the system.

Leaving aside the offensiveness of a system that is so biased against non-Christian faiths, the DDS is an inherently imprecise and misleading system. It’s somewhat like the current pagerank system within Google, with an implication that more numbers implies a greater authority.

And leaving aside cultural differences, how does folksonomies scale in a multi-language environment? One of the most popular Technorati Tag pages is the one for Weblog. If you access the page, the first thing you notice is that many of the entries are in Chinese. Providing support for different languages then becomes an issue with folksonomies that are intended to go beyond one country’s borders, or beyond a single language.

Even if tag systems follow Wikipedia’s use of different language domains, there are issues within the different languages that may make the formation of folksonomies from simple tags difficult or even impossible. Peter van Dijck wrote:

This post is about folksonomies (tagging), and how it might be really hard in Japanese. This is mostly speculation at this point, please comment or email me if you speak Japanese.

On the Sigia-L list, Fiona Bradley writes: “I don’t know Cantonese, but I have just started to learn Japanese and it’s not necessarily that the definitions of emotions are different, just that they are a lot more complex than in English once you factor in politeness levels and directness. And then there’s all the complications that arise from having many Kanji to choose from and many readings for each. If you’re just assigning a single word to a photo for instance, with no other words to define context, that may make the system quite difficult to search.

ButtUgly, expanded on this:

There are three cases of “language collision” on tags (I’m using English and Finnish as an example only here).

1. The tag is different in English and in Finnish. For example “fishing” and “kalastus”. This should pose no problem, as the folksonomies grow on each of the tags independently.
2. The tag is the same in English and in language Finnish, but the meaning of the tag is different. In this case, the dominant mass of the users will “hijack” the tag.
3. The tag is the same in both languages, but the web pages will be in different languages. This is the case with things like trade marks (Apple, Macintosh, Nokia), or when people like to tag Finnish pages with English tags (like me: I use the word “blog” to mark any significant articles about blogs, regardless of the language). This reduces the usefulness of tags for people who do not understand Finnish.

There is also an additional tagging problem with languages such as Finnish: the same word can be conjugated and written in multiple ways, depending on the context. It is somewhat the same as the problem of using different words for the same concept, but it does make the number of potential strings increase three-fourfold.

The discussion has been centered around the cultural bias in tags. However, the very concept of folksonomies–spontaneous aggregations of keywords–is itself based on bias, formed from a specific culture, which tends to be male, western, with English as a native language.

Scary stuff, when you consider people, such as Jeff Jarvis, have become interested in tagging people:

It’s time to tag people.

This comes out of David Galbraith’s one-line bio and out of arguments I’ve made over time that the real future of classifieds is a generation beyond Craig and Monster: It’s a distributed world where resumes and jobs (or men seeking women and women seeking men) live anywhere and they are found and matched by some specialized successor to Google that uses tags (e.g., work status, education, location, languages…. or smoker, nonsmoker, single, divorced, great personality). In that world, in essence, people, ads, and content are all tagged.

Finding the love of your life through tagging. I would rather gnaw my own leg off then live in this world. And I think we can safely assume this isn’t the folksonomy engine that AKMA is seeking.

If we reject the idea of folksonomies bringing us closer to potential mates, the patterns reflected in the popular aggregation of words are seen as the next step to bringing up closer to artificial intelligence–a concept I call accidental smarts

Accidental Smarts

I liked what Joshua Porter had to say on folksonomies and tags:

Tagging, by itself, does not a folksonomy make. It is possible, as Clay Shirky has pointed out, to tag things without creating a folksonomy. Tagging is simply an explicit activity that people can do to add metadata to content. It is common. Information architects tag things. I tag things on my computer. Every web page consists of dozens of tags (albeit with little meaning). In general, creating metadata includes a lot of tagging. Tagging as an activity is neither unique nor special.

Because we can aggregate tags, however, we can build a taxonomy out of them. More specifically, we can build a taxonomy out of the patterns we see in how people use tags. It is this act of aggregation, and not the act of tagging, that give folksonomies their power. Without aggregation, tags are just tags, with no meaning beyond the local meaning that each user gives to their own set.

It is the aggregation of tags that gives folksonomies their power. Yet aggregations of tags are based on certain understandings of language, and there is a great deal of imprecision with langauge–even after discounting culture and focusing primarily on English.

A recent Slashdot article discussed the work of two scientists who are looking at Google results as a way of communicating with machines. According to the New Scientist article on their work:

Computers can learn the meaning of words simply by plugging into Google. The finding could bring forward the day that true artificial intelligence is developed.

The problems associated with natural language processing of the English language have to do with certain types of homenyms, words that sound and are spelled alike, but with different meanings, and heteronyms, words that sound and are different, but are spelled the same. For instance, “wind” could mean “wind the clock” or it could mean “a strong wind was blowing”; a “rider” could be associated with a horse, or with an insurance policy. The only way to differentiate these words is the context, and computers don’t handle context well.

The two scientists, Paul Vitanyi and Rudi Cilibrasi, have tapped into Google’s database, analyzed search results, and created what they call the normalized Google Distance or NGD. This is the factor that measures the logical distance two words have to each other. The more closely associated, the larger the number, all based on searching for the pair of words in Google.

Within this system, “hat” has a greater number of matches with “head” than “banana”, so the context of this search supplies information that a hat usually has more to do with a head than a banana. Combining all of these searches, and eventually, it is supposed, a computer could search its way to smarts.

Of course, there is more to language than associations of words returned from a Google search. For instance, if a computer is researching the context of a word, “bush”, it is more likely to assume that a bush is a human than a plant: there are 4.5 million matches for “bush” and “tree”, but 9.8 million matches for “bush” and “leader”. Not only that, but the “bush” is a “bad” (over 10 million) person, at that.

It is this attempt to extract semantics out of incidental associations, to get more meaning out of a system than we put into it, that is the basis for accidental smarts.

In addition to their own challenges with homonyms, folksonomies have an additional problem, and that’s synonyms: different words with the same or similar meanings. An example could be ‘cat’ and ‘kitty’. However, Clay Shirky rejected this as an issue early on, writing:

Synonym control is not as wonderful as is often supposed, because synonyms often aren’t. Even closely related terms like movies, films, flicks, and cinema cannot be trivially collapsed into a single word without loss of meaning, and of social context. (You’d rather have a Drain-O® colonic than spend an evening with people who care about cinema.) So the question of controlled vocabularies has a lot to do with the value gained vs. lost in such a collapse. I am predicting that, as with the earlier arc of knowledge management, the question of meaningful markup is going to move away from canonical and a priori to contextual and a posteriori value.

Interesting reading, however, this is nothing more than verbal sleight of hand. Clay is breezily questioning the concept of synonyms rather than directly face that synonyms are one of the many issues facing folksonomies, rejecting any concerns with promises of the gold ring – the value gained.

The same issues that are going to impact on the precision with the Google Searches will eventually impact on the precision of folksonomy searches. Yet we think we can look at how people use tags, and from there map human thinking. In a new paper, Jakob Lodwick wrote:

According to Scientific American, in 1966 Ben-Ami Lipetz concluded that:

…breakthroughs in information retrieval would come when researchers gained a deeper understanding of how humans process information and then endowed machines with analogous capabilities.

Well, Ben was right, as you’ll soon see for yourself. By looking at how we tag photos on Flickr, we can understand how humans process information. Once we understand that, we can understand how to model it with computers, thereby creating better information retrieval systems.

What Ben was unable to predict all those years ago was that we will not only develop better information retreival systems, but also model our own brains on the lowest levels, and eventually create artificial intelligence.

One only has to glance at the use of ‘tag’, in Technorati itself to see the impact of our different interpretations, and contexts, of this simple, single word. In delicious, a ‘tag’ is usually associated with folksonomies, but not always; in Flickr, it’s almost always associated with graffiti, but not always. And in weblogs, through automatic translation of categories into tags, it’s associated with a recipe using asparagus. Perhaps before we teach computers how to think using folksonomies, we might want to take a closer look at how we think with folksonomies.

tag=the end

If there was an award for longest weblog post, I think this one might be a contender. And that’s after I finally deleted four sections because I could no longer manage the post within the weblog tool. A little bit of work and I have the start of a book.

My first reaction to tags and folksonomies was, “Oh what silly new thing have they come up with now.” Associating keywords with bookmarks in a publicly shared venue and hoping to extract the meaning of the universe from the cross-section of terms does stretch one’s credulity to the max.

However, when I was thinking about what I could use to replace trackbacks after I pulled support for them, the concept of tags was the first thing that came to mind; not at a macro level, with global significance; at a small level, an intimate use of the concept.

Of course, not every one agrees with my use of tags and folksonomy for something so specific and mundane, as demonstrated by some of the entries associated with the the delicious bbintoducingtrackback (one playful, one less so). But even these entries demonstrate the benefits of the approach, as these imprecise uses of delicious tags impact more on the credibility of the entries than on my idea or me–a perk of tagback over trackback.

I said in the previous posting on tags and folksonomies that a million monkeys randomly typing were not going to write Shakespeare; and a hundred million people randomly assigning tags to objects were not going to create the semantic web. I still believe this–the semantic web will never arise spontaneously from random acts on random data. But I think that tags and folksonomies can be useful, all the same. If we stop jumping up and down about what they’ll do in the future, and focus on making them work, now.

Now if I can only convince AKMA to use tagbacks…