Categories
RDF Semantics

I love you 25% of the time

Oooo. This is fun. *claps hands*

David Weinberger asks:

Let’s say I want to express in an RDF triple not simply that A relates to B, but the degree of A’s relationship to B. E.g.:

Bill is 85% committed to Mary

The tint of paint called Purple Dawn is 30% red

Frenchie is 75% likely to beat Lefty

Niagara Falls is 80% in Canada

Other than making up a set of 100 different relationships (e.g., “is in 1%,” “is in 2%,” etc.), how can that crucial bit of metadata about the relationship be captured in RDF?

In my opinion, there is no one way to record a percentage in RDF. That’s the same as saying that being faithful to a lover 50% of the time is equivalent to eating only 50% of a banana split.

So let’s take just one of the examples David gives us: Niagra Falls is 80% in Canada. At first glance, if we wanted to limit ourselves to recording this fact, using one and only one triple, we could do the following:

Niagra Falls — has an 80% existence — in Canada.

That records the fact. If I were specifically looking this information up, I would have it. The only point is, that’s all I would have. I could continue this, as David says, with an 81% existence, and an 82% existence and so on. How tedious. Humans don’t work this way. We don’t memorize every single number in existence. No we memorize ten characters, and we devise a numeric system to derive the rest–learning how to use this number system instead of memorizing all possible numbers.

What we need is a way of capturing that ability to derive new concepts from existing facts using a set of triples in the form: subject predicate object.

Rather than dive straight into the triples, let’s look at the question from a perspective of David, being David, and me being me, and this being April, 2006. In other words–let’s look at what David is really saying when he gives the sentence: Niagra Falls is 80% in Canada.

When David said Niagra Falls is 80% in Canada, what he’s saying, in an assumed short-hand way, the following:

Niagra Falls exists 80% in Canada.

This statement was made in 2006.

Canada is a country.
A country is a political entity, which may, or may not have, a fixed physical location.

Niagra Falls is a physical entity.
Niagra Falls has a physical location.
Niagra Falls has an area, bounded by longitude and latitude.

Niagra Falls’ physical location has nothern terminus longitude of ____.
Niagra Falls’ physical location has a southern terminus longitude of ____.
Niagra Falls’ physical location has a western terminus latitude of _____.

In 2006 Canada’s southern most border is at longitude ____.
In 2006 Canada’s western border is at latitude of ____.
In 2006 Canada’s northern most border is at longitude ____.
In 2006 Canada’s eastern most border is at latitude of ____.

Why all of the different sentences? Because there’s more to the statement “Niagra Falls is 80% in Canada” than first appears from just the words. We want to capture not only the essence of the words, but also the assumptions and inferences that we, as humans, make based on the words.

Given David’s statement that Niagra Falls is 80% in Canada, what can we infer?

That the statement about Niagra Falls being 80% in Canada was made in 2006.
That Niagra Falls has an area bordered by such and such latitude and such and such longitude. This is a physical, fixed, location (though not immutable).
That in 2006, Canada has an area border by such and such latitude and such and such longitude. This is a mutable, political border, though rarely changing.

Based on all of these, we can determine that 80% of Niagra Falls is in Canada.

The semantic web means capturing information so that we can make inferences based on conclusions. Since wetware is still experimental, and we haven’t yet created machines that can build inferences without a little help from us’ons, we provide enough of the other details to reach a point where we can infer all the facts from a given statement.

Therefore, we have the following triples (using English syntax rather than Turtle or some other mechanistic format, since I’m writing for people not machines right at the moment):

A geographical object has a physical existence at a point in time.
A geographical object’s physical existence can be measured in area.
The area of a geographical physical object’s physical existence is found by taking the length of one side and multiplying it by the length of the other (broadly speaking).
The length of one side can be found by finding the difference of it’s boundaries, as measured by it’s southern and nothern longitudes.
The length of the other side can be found by finding the difference of it’s boundaries, as measured by it’s western and eastern latitudes.

A geopolitical object is also a geographical object.
A country is a geopolitical object.
Canada is a country.

Canada’s 2006 border has a northern most longitude of ____.
Canada’s 2006 border has a southern most longitude of ____.
Canada’s 2006 border has a western most latitude of _____.
Canada’s 2006 border has a eastern most latitude of ______.

Niagra has a northern most longitude of ______.
Niagra has a southern most longitude of ______.
Niagra has a western most latitude of _____.
Niagra has an eastern most latitude of _____.

Seems like a lot, but this is actually capturing what David is saying; he just doesn’t know he’s saying it. If we just recorded the fact Niagra Falls is 80% in Canada, we would be leaving all the important bits behind.

There’s better schema folk than I, and they can, most likely, come up with better triples. The point is that RDF doesn’t record facts. We have existing models that do a dandy job of recording facts. Given an infinitely long, one-dimensional flat plane where all facts have a single point of existence, we have systems that can capture snapshots of this plane far more efficiently than RDF.

Consider instead, a model of knowledge that consists of an infinite number of finite planes of information, intersecting infinitely. That’s RDF’s space, recording these points of intersection.

Categories
JavaScript RDF

We interrupt your regular thinking

I wrote a while back about putting RDF files out on Amazon’s S3 file storage. Why, I was asked. After all, I don’t have enough files, I have room on my server, and so on. Yup, I agreed. Other than S3 being nifty tech and wanting to be a cool kid, why would one want to use it?

One reason: it forces one to think differently about application development and data storage when you’re restricted to using web services rather than traditional file or database I/O to access the data.

Les Orchard wrote today about his S3 Wiki work:

One of the mind-bending concepts behind this whole thing for me is that both the documents and the authoring application are all resident on the S3 servers, loaded and run on the fly in a browser. The magic S3 lends is simple file I/O, taken for granted by applications in nearly every other development environment. Otherwise, JavaScript in a browser is a very capable system.

I agree that JavaScript in the browser is a very capable front end. Oh, I don’t agree with replacing Word with Ajax–why do we always see Office as the only killer app in the world and systems have to ‘replace’ it to be considered viable? But JavaScript in browsers, as we progress closer to true cross-browser compatibility, is a very powerful application development system.

However, the part that caught my interest specifically is what Les wrote about the data storage of his wiki application. He is spot on in that S3 changes how you think of I/O (Input/Output). It forces you to challenge your data storage assumptions–all the golden rules you’ve learned since you were knee high to a grasshopper. When you do, you get this sudden burst of ideas. It’s like biting into a SweeTarts candy–you’re not sure if you like the experience, but it sure gets your attention.

In my copious spare five minutes a week, I’m loading RDF into S3. I have an idea. It came to me in a burst. It made my face pucker.

Categories
RDF

‘allo Jena

Both Leigh Dobbs and Danny Ayers published notes that the Jena User Conference schedule has been posted.

(Danny also made mention of dreaming about Jena and white-velvet pouffes, but don’t let that scare you away from RDF.)

This has all the makings of being a pretty damn good conference. The sessions are focused either on specific applications or about the practicalities of Jena and RDF. This is not a ’scholastic paper’ conference–this is people making stuff. And fun stuff, too, from the descriptions. (The papers and presentations most likely will be published online, and I’ll post notes to these when they are.)

With SPARQL (the RDF query language) in release candidate status, one of the last and necessary missing pieces is now in place so we’ll be seeing more implementations in the future. The can merge any kind of data effortlessly nature of RDF is going to start attracting larger corporate interest. Jena, being the dominate Java implementation of RDF, and Java being so popular in corporations, I expect to see more RDF and Jena in corporate development.

Speaking of Jena and SPARQL, the person who wrote ARQ the Jena compatible SPARQL engine, Andy Seaborne has a new weblog. He’s not listed in Planet RDF just yet, but I’m assuming this will change fairly soon.

I’m particularly interested in SPARQL because of some of the work I want to do at my site. I’m using PHP and Ruby, though; I can’t use Jena for any personal web development I want to publish online because my server isn’t set up for Java. What we need is a Jena enabled server where people can host their personal development projects–somewhat like wordpress.com hosts WordPress weblog, or typepad.com for TypePad weblogs and so on.

Not that I’m hinting or anything, but publicly accessible servers setup for specific RDF environments, such as Jena, where one can affordably lease space isn’t such a bad idea.

Disclaimer: I’m working with the HP folks who are involved with Jena, and presenting at the conference. I thought I should probably mention this, otherwise, you might be confused as to why I’m talking enthusiastically about RDF.

Categories
RDF

S3 S404 RDF OK WRKS

First of all–isn’t there anything in any of the syndication feed specs that when a syndicated item returns 404 or like, some indication is made? Shouldn’t there be?

In the meantime, I’ve been putting some thought into what I can do with S3. If you’ve been living under a rock (or conversely, are non-tech and go out to the park and stuff on the weekend), S3 is a very cheap mass storage system that Amazon is providing. You pay a few bucks a month and get bunches of space and bandwidth. The only thing is, you have to store data using web services–it’s not a regular hosting system.

I thought this would be a perfect place to put my RDF files. You can’t store database data at S3, which limits data types of storage. But I don’t store my RDF data in a database. Each model is stored as a separate file, which would be simple to move to the storage. Only thing is, I have plenty of space between the two servers I have now–my shared system for the weblog, and my development server.

I could put my pictures on S3, but it took time for me to find a way to pull all of these back from Flickr AND modify my URL in my posts. I’m not of a mind to do the URL thing again.

I could store my gmail email on S3, but I deleted the account. Actually, I’ve deleted most of my centralized accounts.

That space demands media files. Only problem is, I’m not a real media person–outside the pics. I don’t think I’m going to get heavily into podcasts. I don’t have a video camera.

As for storing my personal computer data at S3, I have a DVD burner; I have blank discs.

The more I think on it, the more I think S3 would be a good spot for RDF data. Not just the RDF that helps run my site–RDF I download, or RDF I scrape from other sites, or RDF I pick up here and there. Then, when I need the data, since the models are stored as separate files, it would be easy to access the data, and update it if necessary.

This doesn’t work with the microformat stuff, as this type of metadata is stored directly in the pages. RDF, on the other, hand, can be associated with our web pages or other files, but stored in an external location.

The key is not to provide public access to the data on S3. I don’t control the domain name, I am unaware of how one can assign a domain name for an individual piece of storage, and there is no guarantee the data will live there forever. It’s hard enough preventing 404 errors when I do host the files, much less when I don’t.

Instead, I’ll mine the data from my server, and then serve it directly from my domains. If I then decide to move the files, I just pull the data, put it somewhere else.

As for security and confidentiality of data–heck, people have been bitching about how unreadable RDF/XML is for years. Now when they say it, I can smile, tell them it’s a perk.

Categories
JavaScript RDF

Asking permission first

Recovered from the Wayback Machine.

Tim Bray has an interesting take on the use of AJAX: rather than have your server do the data processing, use AJAX to grab the data and then have the clients do the work:

A server’s compute resources are usually at a premium, because it’s, you know, serving lots of different client computers out there. Lots of different computers, you say; and how busy are they? Not very; your average Personal Computer usually enjoys over 90% idle time. So you’ve got a network with a small number of heavily-loaded servers and a huge number of lightly-loaded clients. Where do you think it makes sense to send the computation?

The thing is, you know what’s happening on your server, but you don’t know what’s happening on each individual client machine. In addition, you don’t know what each client can and cannot support. You don’t even know if your client has JavaScript turned on to be able to allow you do to the processing.

I agree that we can do some interesting work with Ajax and grabbing data from the server and processing it on the clients. Perhaps we need to explore some newer uses of JavaScript and RDF in light of the new server-client interoperability.

However, a developers first priority is to do no harm to those using their application. Their second priority is to ensure their pages are accessible by their target audience. If we start making assumptions that the client’s machine is ours to do with what we will, we won’t need hackers to inject harm into our scripts–we’ll do a fine job of it, ourselves.