Categories
HTML5 RDF

Holding on effort for HTML5

I have discontinued my efforts to re-examine Ian Hickson’s semantic microdata use cases, as Ian has just published another use case, and added a microdata section to the HTML5 specification. (update see at end of writing)

Announcement at WhatWGNew section in HTML5 draft.

First glance:

I am not an expert at RDFa, so read what I write accordingly. In my opinion, though, I do not find this customized microdata section in the HTML5 to be compatible with RDFa. Yes, one can extract RDF out of the text, but one can’t use an RDFa extractor to extract RDF out of the page. This means that people will have to use one syntax when incorporating RDFa into XHTML1.1, and XHTML2.0, and another for HTML5.

More importantly, where now we can use RDFa in HTML5, though not “validly”, with this change in the HTML5 spec, this will no longer be possible. One specific issue is with the “property” attribute.

The property attribute is defined as follows in the new HTML5 section:

The property attribute, if specified, must have a value that is an unordered set of
unique space-separated tokens representing the names of the name-value pairs that it adds. 
The attribute's value must have at least one token.

Each token must be either:

    * A valid URL that is an absolute URL, or
    * A valid reversed DNS identifier, or
    * If its corresponding item's item attribute has no tokens: a string containing neither a U+003A COLON character (:) nor a U+002E FULL STOP character (.), or

It does seem like the last item describing valid values was cut off, and a bit garbled, so I can’t make any interpretation based on it.

Now compare this with the description provided in the RDFa specification:

@property
    a whitespace separated list of CURIEs, used for expressing relationships
 between a subject and some literal text (also a 'predicate');

A CURIE is of the form:

curie       :=   [ [ prefix ] ':' ] reference

prefix      :=   NCName

reference   :=   irelative-ref (as defined in [IRI])

Which is really tech gobbledy for a value such as “dc:title”, where “dc” is an an abbreviation for the vocabulary namespace, in this case the Dublin Core namespace of “http://purl.org/dc/elements/1.1/”.

According to the HTML5 spec, the use of the CURIE is invalid in HTML5. Depending upon how parsers handle invalid attribute values, the use of the CURIE could actually generate an HTML error, because the HTML5 specification requires that the value either be a full URI, or reverse DNS identifier, such as com.java.somevalue.

In addition, we can’t use RDFa parsers on the HTML5 markup, because the RDFa specification specifically states that property attribute values must be in the form of CURIEs, and anything else is ignored, as Ian notes in the WhatWG announcement of his customized microdata format handling:

An alternative is to go back to the non-URI class names we had above. This doesn’t break compatibility with the RDFa processors, because when there is no colon in the property=”” or rel=”” attributes, the RDFa processors just ignore the values (this is the “no prefix” mapping of CURIEs).

According to the RDFa syntax specification, RDFa does not define a ‘no prefix’ mapping, meaning that this form of CURIE is not supported. If the value is ignored, than whether it would break RDFa parsers is moot, because what value is there to running such a parser against a page that would return no data?

Ian’s philosophy on the use of CURIEs is that these are too hard for people to understand. However, I think that people would have an easier time with them, then they would a reversed DNS identifier, which is a pure code construct made popular in certain programming languages.

There are other what I consider to be significant dissimilarities between the HTML5 proposal, and RDFa, but I’ll hold on these for now. I’d like to see what the RDFa community has to say on the new specification addition before I go further in examining the HTML5 proposal.

Ian has provided some code which seemingly extracts RDF triples out of his own customized microdata format. Will the format hold up under rigorous testing with other test cases that have had successful results in the RDFa space? I don’t know. Regardless, I do know that all of the technology that has been adapted for use with RDFa will not work with the HTML5 microdata, and whatever works with HTML5, will not work with RDFa.

I’ve heard from someone in the RDF space who thinks the HTML5 specification is “close” to RDFa, and only needs a few tweaks. I lack the experience, or perhaps the foresight, to see the same degree of similarity. And this leads me to question whether I should continue with my own re-visiting of the use cases.

I had said once before that I am willing to put the work in, if I felt it would add value to the effort. I hope I will be forgiven for believing that my work won’t impact on the direction the HTML5 editor takes. This wouldn’t be an issue, so much, if I also felt my efforts were of value to the RDFa community. I have not received any indication from the RDFa community that they see my continuing efforts as beneficial. So, I’m not necessarily completely discontinuing my effort, but I am putting it on hold, until I ascertain whether my efforts are beneficial or not.

Update

I just sent the following to the HTML comment list, and the What WG list:

Sorry for the double emails today.

I will continue with revisiting the use cases for the microdata section. One additional component I’ll add to the use cases is applying my interpretation of how RDFa might handle the use case, as compared to how it could be handled with Ian’s new HTML5 microdata proposal. This will, of course, slow me down a bit.

Note, though, that I don’t claim to be an expert on either RDFa or Ian’s new microdata proposal. My hope is that if I make a mistake, or I’m not clear, folks will respond to my writing with corrections and/or additions. The purpose behind my effort is to open discussion. I will admit, though, that I do have a bias for RDFa, primarily because this is something that’s real, today, and that I can use, today.

Categories
HTML5

Searchcase

(This document is part of an effort to flesh out use cases for microdata inclusion in HTML5. See the original use case document, and the background material document as well as the email correspondence that best describes this process.)

————–

Original Use Case: Search

USE CASE: Site owners want a way to provide enhanced search results to the
engines, so that an entry in the search results page is more than just a
bare link and snippet of text, and provides additional resources for users
straight on the search page without them having to click into the page and
discover those resources themselves.

SCENARIOS:

* For example, in response to a query for a restaurant, a search engine
might want to have the result from yelp.com provide additional
information, e.g. info on price, rating, and phone number, along with
links to reviews or photos of the restaurant.

REQUIREMENTS:

* Information for the search engine should be on the same page as
information that would be shown to the user if the user visited the
page.

————–

The ability to add information into a web page that can enhance search reliability seems to have been mentioned in more than one communication. In addition, this enhanced search goes beyond just returning links to pages with the original material, but also includes the ability to provide relevant information directly in the search result, in addition to enabling smaller-scale aggregation of data for searching.

Generally, this capability is alluded to in several communications by Charles McCathieNevile, from the Opera Software, Standards Group, dated the beginning of January, 2009. I won’t repeat all of the information, but the text that seems most relevant to the use of metadata to enhance search capability include (but are not limited to):

There are many many small problems involving encoding arbitrary data in
pages – apparently at least enough to convince you that the data-*
attributes are worth incorporating.

There are many cases where being able to extract that data with a simple
toolkit from someone else’s content, or using someone else’s toolkit
without having to tell them about your data model, solves a local problem.
The data-* attributes, because they do not represent a formal model that
can be manipulated, are insufficient to enable sharing of tools which can
extract arbitrary modelled data.

RDF, in particular, also provides estabished ways of merging existing data
encoded in different existing schemata.

There are many cases where people build their own dataset and queries to
solve a local problem. As an example, Opera is not intersted in asking
Google to index data related to internal developer documents, and use it
to produce further documentation we need. However, we do automatically
extract various kinds of data from internal documents and re-use it. While
Opera does not in fact use the RDF toolstack for that process, there are
many other large companies and organisations who do, and who would benefit
from being able to use RDFa in that process.

I picked this quote because it effectively demonstrates the complexity of metadata enabled searching. This type of search is not just a matter of plunking any data into a web page and hoping it gets extracted by Google or Yahoo. Instead, metadata enriched searching annotates the data in such a way that the syntax is consistent and reliable, so that one only needs to generate one toolset in order to either produce the data, or extract it.

This is particularly essential for metadata enabled searching, because if the data is not based on the same underlying data model, the data cannot be merged into one store, nor can it be queried as one store at a later time.

Charles further emphasizes the importance of a consistent data model with the following:

Many people will be able to use standard tools which are part of their
existing infrastructure to manipulate important data. They will be able to
store that data in a visible form, in web pages. They will also be able to
present the data easily in a form that does not force them to lose
important semantics.

People will be able to build toolkits that allow for processing of data
from webpages without knowing, a priori, the data model used for that
information.

And the following:

If the data model, or a part of it, is not explicit as in RDF but is
implicit in code made to treat it (as is the case with using scripts to
process things stored in arbitrarily named data-* attributes, and is also
the case in using undocumented or semi-documented XML formats, it requires
people to understand the code as well as the data model in order to use
the data. In a corporate situation where hundreds or tens of thousands of
people are required to work with the same data, this makes the data model
very fragile.

RDFa’s ability to add this search capability has been stripped out of the use case because it is considered an “implementation detail”. However, to the best of my knowledge, RDFa is the only specification that provides this capability, and for which there are actual implementations demonstrating its feasibility.

The RDFa-use-cases Wiki points to Search Monkey as an example of a Search Engine based more on extracting metadata encoded into a web page, than using some form of algorithmic alchemy.

There is also an excellent case study document at the W3C that discusses Search Monkey, and its relevance to enhancing search with metadata annotation.

Though actual implementations are not referenced in the use case document submitted by Ian Hickson, it’s important to note that such implementations do exist. Such actual implementations cast doubt on the assertions (in IRC at http://krijnhoetmer.nl/irc-logs/microformats/20090504 and elsewhere) that RDF is a failure, there are no web sites incorporating RDFa and so on.

Based on this, I would also recommend the following use case for this particular requirement:

I have a music store selling CDs, as well as original vinyl records, in addition to sheet music, posters, and other music memorabilia. I want to annotate my store listings with information such as artist name, song, and medium, as well as prices. I'm hoping this information will be picked up by search engines, and when someone is looking for something I'm selling, my store will popup, including information about the item for sale.

This information can be derived from natural language processing. However, I also want to record recommendations, so that people searching for something such as a poster of the Beatles' Abbey Road will not only see that I sell the Beatles' Abbey Road poster, but that I also sell a poster featuring the Yellow Submarine, and one featuring a photo of Mick Jagger from the early 1970's, as well as other British Invasion memorabilia.

When people click through to my site, not only do I list items for sale, but I also list other information, derived from sites such as Music Brainz, so that people not only associate my site with things to buy, but also a fun place to visit just to learn something new, and interesting.

Lastly, I want to provide a web service where people who review music can provide a link to their web feeds, also annotated with metadata, that I process with an application I downloaded, for associations relevant to music and artists I feature, so that i can include an excerpt of their reviews, and a link for the full review. With this approach, the web site owner doesn't have to remember to link directly between a specific review and an artist and song. And by using well-defined metadata I have access to pre-built tools so that I don't have to derive a complex natural language processing algorithm in order to pull out just the pertinent information in order to create a link between my store and the review.

Extrapolate from this use case to a store featuring squid art, another selling gourmet chocolate, a third jewelry made from recycled materials, and a travel agency, but with one stipulation: the businesses use the same basic toolsets, including content management system, and web services, as well as underlying data store.

This generic data model and toolset aren't arbitrary requirements: A generic data model ensures that one set of tools can have wide use, encouraging development of more tools. Generalizing the toolset ensures that the best are available to all of the stores, and that those people who may provide reviews of music and also funky jewelry, can use the same CMS, and the same underlying metadata structure, in order to annotate both to ensure inclusion of all reviews at relevant stores.

(Though not specifically related to the requirements process, a recent publication on the use of RDFa and music data can be found in the writing, Data interchange problems come in all sizes.)

Categories
HTML5 Photography

Pack of pictures and other stuff

I’ve put together a package of photos I’ve taken earlier this year. They include photos of places around town, flowers, chimps, and other critters. This is the package of photos I’m currently using for my screen saver, so I thought I’d put it online. I don’t guarantee you’ll like any of the pictures, but if you don’t, the most it will cost you is the download time. Note, the file is 17.3MB so I hope you have broadband. If you want to look at the photos online, they’re all at MissouriGreen.

I’m in the process of butting into the HTML5 effort in regards to RDFa. You can read the history of this effort at the HTML WG list. I’m taking the HTML5 editor, Ian Hickson’s, use cases, his original raw material, and mapping the two. I’m also adding in my own use cases. In the effort to make the use cases “implementation free”, I think that the detail and the complexity of the original use cases were reduced too drastically. You can see what I mean by my first use case, and will have the same for the others by Monday.

Will this make a difference? I haven’t a clue. Probably not. I’m sure that neither the HTML5 group, nor the RDFa group, appreciate my particular style of “contributing”, but I decided to follow Sam Ruby’s advice to “put up or shut up” when it comes to HTML5. I’m just going to put up or shut up in my way.

In the meantime, I need to return to my book, which also means that I will be tearing apart my sites as part of my research. I don’t expect to be twittering much, or writing to the weblog, either, in the next few months. I need to focus on the book, and other writing/work for income. I’m also really burned out and very tired, and feeling under the weather lately, and have a need to disconnect from the social hive. Emails always welcome, but I just don’t feel like “broadcasting”.

If you do access any one of the sites at any point in time and find them either not working, or working oddly, no worries, this is just me experimenting, researching, documenting, and writing. Hopefully by the time my book is done, I’ll be more up for writing to my web sites, and they’ll be all settled down and behaving.

If you do follow along with my RDFa use case efforts, I hope you’ll make comments at the HTML WG, as that’s the appropriate place to have a discussion. However, I will also open up comments for a week, in case you just want to make more casual remarks here. Or you can just ignore the whole thing, which is also a good option.

Categories
HTML5 W3C

Annotation

(This document is part of an effort to flesh out use cases for microdata inclusion in HTML5. See the original use case document, and the background material document as well as the email correspondence that best describes this process.)

————–

USE CASE: Allow authors to annotate their documents to highlight the key
parts, e.g. as when a student highlights parts of a printed page, but in a
hypertext-aware fashion.

SCENARIOS:

* Fred writes a page about Napoleon. He can highlight the word Napoleon
in a way that indicates to the reader that that is a person. Fred can
also annotate the page to indicate that Napoleon and France are
related concepts.

—————

Ian has already provided his summary of this use case in the What WG group list. His summary

This use case isn’t altogether clear, but if the target audience of the
annotations is human readers (as opposed to machines and readers using
automated processing tools), then it seems like this is already possible
in a number of ways in HTML5.

In conclusion, this use case doesn’t seem to need any new changes to the
language.

This use case was submitted by Kingsley Idehen, who said considerably more than was entered into the summary user case. Kingsley wrote:

When writing HTML (by hand or indirectly via a program) I want to
isolate at describe what the content is about in terms of people,
places, and other real-world things. I want to isolate “Napoleon” from a
paragraph or heading, and state that the aforementioned entity is: is
of type “Person” and he is associated with another entity “France”.

The use-case above is like taking a highlighter and making notes while
reading about “Napoleon”. This is what we all do when studying, but when
we were kids, we never actually shared that part of our endeavors since
it was typically the route to competitive advantage i.e., being top
student in the class.

What I state above is antithetical to the essence of the World Wide Web,
as vital infrastructure harnessing collective intelligence.

RDFa is about the ability to share what never used to be shared. It
provides a simple HTML friendly mechanism that enables Web Users or
Developers to describe things using the Entity-Attribute-Value approach
(or Subject, Predicate, Object) without the tedium associated with
RDF/XML (one of the other methods of making statements for the
underlying graph model that is RDF).

This use case could have used some more discussion between Ian and Kingsley, because, in my opinion, Ian’s interpretation doesn’t match what Kingsley wrote.

Kingsley wrote about annotating the information within the publication, as one would use a highlighter, but he didn’t mean that this information actually has to be highlighted and made visible to the person reading the text. I believe he meant that the annotation would be visible to processes that could then be made available, both to the individual who made the annotation (most likely at a later time, as notes), or perhaps others when aggregated (the latter is my own interpretation).

The question then, is there a mechanism currently in HTML5 where one can annotate the data within a writing, in a non-visible manner, and which one then be used to make an assertion, such as Napoleon is the name of a person, and the person Napoleon is related to another entity, this one named France (which is the name of a country, and so on).

So, let me take another try at this use case:

Within a writing published on the web, I want to add annotation into the text to highlight specific facts, but I don't want such highlighting to distract from the text, so I don't want it to be visible. An example of the type of annotation I may make is to highlight the word "Napoleon" and annotate this word with an assertion that Napoleon is a person, and to add further information, that the person, Napoleon, is related to France (a country).

I write on many topics, and so I may make use of several different vocabularies in order to perform my annotation. In addition, I may have to create my own vocabulary if the annotation I want to make doesn't match any of the known and previously published vocabularies. If I do, I'll do so in such a way that there can't be a possible conflict with any other vocabulary.

Once my text is documented, I want to be able to access this annotation at a later time, separate from the document. To do this, I'll process each of my writings with an application that will pull out this specialized annotation, for aggregation and later query. In addition, by using a standard metadata annotation technique and model, the data can also be accessed by search engines, making the data also available to others.

It would help to get concurrence from Kingsley as to the accuracy of my assessment, but I do feel comfortable that my use case is a closer approximation to what Kingsley meant. If this is so, Ian’s concluding statement about this use case, including the fact that it would require no change to HTML5 could be in error.

Categories
Semantics

Arbitrary Vocabularies and Other Crufty Stuff

I went dumpster diving into the microformats IRC channel and found the following:

singpolyma – Hixie: that’s the whole point… if you don’t have a defined vocabulary, you end up with something useless like RDF or XML, etc
@tantek – exactly
Hixie – folks who have driven the design of XML and RDF had “write a generic parser” as their #1 priority
@tantek – The key piece of wisdom here is that defined vocabularies are actually where you get *user* value in the real world of data generated/created by humans, and consumed eventually by humans.
Hixie – i’m not talking about this being a #1 priority though — in the case of the guy i mentioned earlier, it was like #4 or #5
Hixie – but it was still a reason he was displeased with microformats
@tantek – Hixie – ironically, people have written more than one generic parser for microformats, despite that not being a priority in the design
Hixie – url?
@tantek – mofo, optimus
@tantek – http://microformats.org/wiki/parsers
@tantek – not exactly hard to find
@tantek – it’s ok that writing a generic parser is hard, because not many people have to write one
Hixie – optimus requires updating every time you want to use a new vocabulary, though, right
@tantek – OTOH it is NOT ok to make writing / marking up content hard, because nearly far more people (perhaps 100k x more) have to write / mark up content.
Hixie – yes, writing content should be easy, that’s clear
Hixie – ideally it should be even easier than it is with microformats 🙂
singpolyma – Of course you have to update every time there’s a new vocabulary… microformats are *exclusively* vocabularies
Hixie – there seems to be a lot of demand for a technology that’s as easy to write as microformats (or even easier), but which lets people write tools that consume arbitrary vocabularies much more easily than is possible with text/html / POSH / Microformats today
singpolyma – Hixie: isn’t that what RDFa and the other cruft is about?
Hixie – RDFa is a disaster insofar as “easy to write as microformats” goes
singpolyma – Not that I agree arbitrary vocabularies can be used for anything…
Hixie – and it’s not particularly great to parse either

Hixie – is it ok if html5 addresses some of the use cases that _are_ asking for those things, in a way that reuses the vocabularies developed by Microformats?

Well, no one is surprised to see such a discussion about RDFa in relation to HTML5. I don’t think anyone seriously believed that RDFa had a chance of being incorporated into HTML5. Most of us have resigned ourselves to no longer support the concept of “valid” markup, as we go forward. Instead, we’ll continue to use bits of HTML5, and bits of XHTML 1.0, RDFa, and so on.

But I am surprised to read a data person write something like, if you don’t have a defined vocabulary, you end up with something useless like RDF or XML. I’m surprised because one can add SQL to the list of useless things you end up with if you don’t have defined vocabularies, and I don’t think anyone disputes the usefulness of SQL or the relational data model. A model specifically defined to allow arbitrary vocabularies.

As for XML, my own experiences with formatting for eBooks has shown how universally useful XML and XHTML can be, as I am able to produce book pages from web pages, with only some specialized formatting. And we don’t have to form committees and get buy off every time we create a new use for XML or XHTML; the same as we don’t have to get some standards organization to give an official okee dokee to another CMS database, such as the databases underlying Drupal or WordPress.

And this openness applies to programming languages, too. There have been system-specific programming languages in the past, but the widely used programming languages are ones that can be used to create any number of arbitrary applications. PHP can be used for Drupal, yes, but it can also be used for Gallery, and eCommerce, and who knows what else—there’s no limiting its use.

Heck HTML has been used to create web pages for weblogs, online stores, and gaming, all without having to redefine a new “vocabulary” of markup for each. Come to think of it, Drupal modules and WordPress plug-ins, and widgets and browsers extensions are all based on some form of open infrastructure. So is REST and all of the other web service technologies.

In fact, one can go so far as to say that the entire computing infrastructure, including the internet, is based on open systems allowing arbitrary uses, whether the uses are a new vocabulary, or a new application, or both.

Unfortunately, too many people who really don’t know data are making too many decisions about how data will be represented in the web of the future. Luckily for us, browser developers have gotten into the habit of more or less ignoring anything unknown that’s inserted into a web page, especially one in XHTML. So the web will continue to be open, and extensible. And we, the makers of the next generation of the web can continue our innovations, uninhibited by those who want to fence our space in.