Searchcase

(This document is part of an effort to flesh out use cases for microdata inclusion in HTML5. See the original use case document, and the background material document as well as the email correspondence that best describes this process.)

————–

Original Use Case: Search

USE CASE: Site owners want a way to provide enhanced search results to the
engines, so that an entry in the search results page is more than just a
bare link and snippet of text, and provides additional resources for users
straight on the search page without them having to click into the page and
discover those resources themselves.

SCENARIOS:

* For example, in response to a query for a restaurant, a search engine
might want to have the result from yelp.com provide additional
information, e.g. info on price, rating, and phone number, along with
links to reviews or photos of the restaurant.

REQUIREMENTS:

* Information for the search engine should be on the same page as
information that would be shown to the user if the user visited the
page.

————–

The ability to add information into a web page that can enhance search reliability seems to have been mentioned in more than one communication. In addition, this enhanced search goes beyond just returning links to pages with the original material, but also includes the ability to provide relevant information directly in the search result, in addition to enabling smaller-scale aggregation of data for searching.

Generally, this capability is alluded to in several communications by Charles McCathieNevile, from the Opera Software, Standards Group, dated the beginning of January, 2009. I won’t repeat all of the information, but the text that seems most relevant to the use of metadata to enhance search capability include (but are not limited to):

There are many many small problems involving encoding arbitrary data in
pages – apparently at least enough to convince you that the data-*
attributes are worth incorporating.

There are many cases where being able to extract that data with a simple
toolkit from someone else’s content, or using someone else’s toolkit
without having to tell them about your data model, solves a local problem.
The data-* attributes, because they do not represent a formal model that
can be manipulated, are insufficient to enable sharing of tools which can
extract arbitrary modelled data.

RDF, in particular, also provides estabished ways of merging existing data
encoded in different existing schemata.

There are many cases where people build their own dataset and queries to
solve a local problem. As an example, Opera is not intersted in asking
Google to index data related to internal developer documents, and use it
to produce further documentation we need. However, we do automatically
extract various kinds of data from internal documents and re-use it. While
Opera does not in fact use the RDF toolstack for that process, there are
many other large companies and organisations who do, and who would benefit
from being able to use RDFa in that process.

I picked this quote because it effectively demonstrates the complexity of metadata enabled searching. This type of search is not just a matter of plunking any data into a web page and hoping it gets extracted by Google or Yahoo. Instead, metadata enriched searching annotates the data in such a way that the syntax is consistent and reliable, so that one only needs to generate one toolset in order to either produce the data, or extract it.

This is particularly essential for metadata enabled searching, because if the data is not based on the same underlying data model, the data cannot be merged into one store, nor can it be queried as one store at a later time.

Charles further emphasizes the importance of a consistent data model with the following:

Many people will be able to use standard tools which are part of their
existing infrastructure to manipulate important data. They will be able to
store that data in a visible form, in web pages. They will also be able to
present the data easily in a form that does not force them to lose
important semantics.

People will be able to build toolkits that allow for processing of data
from webpages without knowing, a priori, the data model used for that
information.

And the following:

If the data model, or a part of it, is not explicit as in RDF but is
implicit in code made to treat it (as is the case with using scripts to
process things stored in arbitrarily named data-* attributes, and is also
the case in using undocumented or semi-documented XML formats, it requires
people to understand the code as well as the data model in order to use
the data. In a corporate situation where hundreds or tens of thousands of
people are required to work with the same data, this makes the data model
very fragile.

RDFa’s ability to add this search capability has been stripped out of the use case because it is considered an “implementation detail”. However, to the best of my knowledge, RDFa is the only specification that provides this capability, and for which there are actual implementations demonstrating its feasibility.

The RDFa-use-cases Wiki points to Search Monkey as an example of a Search Engine based more on extracting metadata encoded into a web page, than using some form of algorithmic alchemy.

There is also an excellent case study document at the W3C that discusses Search Monkey, and its relevance to enhancing search with metadata annotation.

Though actual implementations are not referenced in the use case document submitted by Ian Hickson, it’s important to note that such implementations do exist. Such actual implementations cast doubt on the assertions (in IRC at http://krijnhoetmer.nl/irc-logs/microformats/20090504 and elsewhere) that RDF is a failure, there are no web sites incorporating RDFa and so on.

Based on this, I would also recommend the following use case for this particular requirement:

I have a music store selling CDs, as well as original vinyl records, in addition to sheet music, posters, and other music memorabilia. I want to annotate my store listings with information such as artist name, song, and medium, as well as prices. I'm hoping this information will be picked up by search engines, and when someone is looking for something I'm selling, my store will popup, including information about the item for sale.

This information can be derived from natural language processing. However, I also want to record recommendations, so that people searching for something such as a poster of the Beatles' Abbey Road will not only see that I sell the Beatles' Abbey Road poster, but that I also sell a poster featuring the Yellow Submarine, and one featuring a photo of Mick Jagger from the early 1970's, as well as other British Invasion memorabilia.

When people click through to my site, not only do I list items for sale, but I also list other information, derived from sites such as Music Brainz, so that people not only associate my site with things to buy, but also a fun place to visit just to learn something new, and interesting.

Lastly, I want to provide a web service where people who review music can provide a link to their web feeds, also annotated with metadata, that I process with an application I downloaded, for associations relevant to music and artists I feature, so that i can include an excerpt of their reviews, and a link for the full review. With this approach, the web site owner doesn't have to remember to link directly between a specific review and an artist and song. And by using well-defined metadata I have access to pre-built tools so that I don't have to derive a complex natural language processing algorithm in order to pull out just the pertinent information in order to create a link between my store and the review.

Extrapolate from this use case to a store featuring squid art, another selling gourmet chocolate, a third jewelry made from recycled materials, and a travel agency, but with one stipulation: the businesses use the same basic toolsets, including content management system, and web services, as well as underlying data store.

This generic data model and toolset aren't arbitrary requirements: A generic data model ensures that one set of tools can have wide use, encouraging development of more tools. Generalizing the toolset ensures that the best are available to all of the stores, and that those people who may provide reviews of music and also funky jewelry, can use the same CMS, and the same underlying metadata structure, in order to annotate both to ensure inclusion of all reviews at relevant stores.

(Though not specifically related to the requirements process, a recent publication on the use of RDFa and music data can be found in the writing, Data interchange problems come in all sizes.)