November 9th, 2006

Danny Ayers writes an interesting thread related to a talk that Marten Mickos, CEO of MySQL, gave at the Web 2.0 conference summit. According to Greg Linden, who was at the conference:

The idea is that "structured data should be open sourced", linked, and easily accessible. The idea is to do something like Google does for unstructured data (web documents) for structured data (database records).

This is not a new idea. People usually talk about this as querying heterogeneous distributed databases. The trick is matching up disparate data definitions and smoothing over bad data. And that is quite a trick.

As Danny and others mention, what Mickos is discussing is using a structured data format for interoperability (can we say 'RDF'), as well as a querying method that can work with heterogeneous data sources (can we say 'SPARQL'). Ian Davis wrote:

This is where it became evident that there is a deep disconnect between the traditional database community and the semantic web community. Mårten’s response was rather vague, that this wasn’t as broad as the semantic web and that the semweb includes unstructured data so wasn’t appropriate.

What a shame and what a failure of the semantic web community if the CEO of MySQL AB cannot see how his vision for an interconnected web of data is the same as ours! We must try harder and demonstrate at all levels the value of the semantic web approach to people like Mårten. SWEO and SWIG will help, but the convincing arguments will come from the practical applications of the semantic web being developed to solve real world problems.

A big Amen, Ian.

Danny, in the interests of 'any data is better than no data', is willing to take Mickos' big mother MySQL database for a spin:

Ok, so how would you go about making a distributed RBDMS that might work on a global scale? Well you might want to start with keys that will work in such an environment. No need to invent any GUIDs, just reuse the web's ID field, URIs. What about table (relation) structure? There's obviously going to be a problem trying to create top down schema that could work in such a diverse environment as the world. So you need to break things down into a minimal form, i.e. binary relations, and allow them to be interconnected. How can you enable interlinking on such as scale? Identify the relations with URIs too. Keep going for 5 minutes and you've got RDF. You'd probably want a query language that worked against it too, and maybe even like it to look like SQL. Go on, call it SPARQL. Deploy these on HTTP (which is also based on URIs) and you've got a Web of Data, the Semantic Web.

Hee. Sneaky semantic web people.

Danny just scratched on the issues associated with a global relational data store. One major difference between the relational model and RDF is that the relational model assumes data agreement before mapping; RDF assumes that data agreement will happen sometime, but isn't too terribly worried about it because any data is welcome, and we can use the data we have now while we work things through.

Comments
1
Phil - 4:39 am 11/10/2006

RDF assumes that data agreement will happen sometime, but isn't too terribly worried about it because any data is welcome, and we can use the data we have now while we work things through.

Naah, RDF is the Semantic Web, innit, and the Semantic Web is all about the big top-down everything-agreeing-with-everything-else thing, which is why it will never happen, ever. Ever. Sorry, no, not listening. La la la. Never, ever, ever.

Ahem. I am actually facing both ways on this one - having done a bit of work with EDI, way back, I know how important it is to get message formats nailed down at both ends before you start communicating, and I have found some SemWeb literature to be far too optimistic about the effort involved in doing the nailing-down. But the turtles-all-the-way-up sceptical position systematically confuses 'unsolved' with 'insoluble', and 'partially solved' with 'unsolved'. The project I've been working on for the last year is all about building logical representations of imprecise clusters of concepts, and then mapping one fuzzy set of concepts onto another. (And yes, it can be done.)

2
Danny - 12:18 pm 11/10/2006

Shelley - thanks.

Phil, I don't disagree, but do think there's still a role for naive optimism :-)
Case in point, not long ago I had a real eye-opener (which I'm likely to repeat all over the place until I have another…). I'd installed Longwell, the facetted browser, curious to see whether it'd be useful for my blog data. Longwell will eat any RDF files you dump in the appropriate directory. On a whim I collected a few random files from the web, including one about famous people. I'd glanced at the source and knew there was an entry for Beethoven. So I did a (plain text) search for the guy in Longwell. Sure enough it picked up his bio material. But in the same results I also had a blog post I'd forgotten about, pointing to some audio files of his symphonies. So ok, I might have got something similar from completely text-oriented data+tool. But both the blog post and the bio info were quite well linked to other resources, and I could have made the link between the two more explicit had I wanted, adding value to both entries. Dunno, it thrilled me…

3
Shelley - 9:32 pm 11/10/2006

Phil, I've worked with EDI for years, first with PDES and then with POSC. Hear you on the importance of data agreement and how difficult it is.

When you have heterogeneous relational databases, before you can throw the data together, you have to understand how the data is identified. Is it a dummy identifier? If so, is it character or number? Or is it an identifier made up of meaningful data, or even multiple fields of meaningful data concatenated.

You have to have this before you can join the data together, or you corrupt all the data.

That's where RDF changes things. The data identifier is defined, the syntax is understand, and there can never be corruption of data. You may have duplicates, such as Shelley Powers is identified by both http://weblog.burningbird.net as well as http://just.shelleypowers.com, but it causes no harm to the data–it's just facts added to the data store, which can eventually be mapped to each other.

The data always moves in a positive direction–always building, never taking away.

The semantic web community will have fancier ways to describe all of this, and most likely not think much of my poor efforts, but I have absolute trust in RDF: there is never a chance of loss of data when two or more RDF data sources are combined, without any prior arrangement.

Thanks to all those who have contributed to the discussion. Comments are now closed, but you can contact the author of the post directly.