Categories
RDF

Poetry Finder: A bit more and a little geek

Recovered from the Wayback Machine.

Interrupting the play I’m having with the RDF Poetry Finder essays to see what others are saying, and also to add some geek stuff so people know that there really is a string at the end of this particular balloon.

Joseph Duemer (and Frank Paynter, indirectlyexpressed some concerns about the image=abstraction view of poetry, suggesting instead a possible alternative:

 

I would tend to think that finding ways of clustering related images around themes or subjects might be a more better way to proceed. This whole project plunges us into the depths of cognative science & the ways in which we can model human consciousness.

This opens up a new way of looking at the core of the system, or perhaps more accurately, looking at the core of what is poetry. I am very intrigued. I am also out of element, not being a poet and only recently becoming a poet enthusiast. My hope is that Joseph, and the other poets/enthusiasts in the audience will expand on this fascinating view.

I also agree with Joseph when he wrote I salute Bb & will happily join her project, but, boy, this is a huge undertaking!. This is a huge undertaking, but not necessarily a new one, or even an impossible one. And this is where today’s bit of geekery enters the picture.

Years ago when I worked in the Acoustical and Linguistics group at Boeing, one of the projects the group was working on was the idea of a concept search engine that could be used foir more precise searching in a massive database of stored documents. We lost our funding before we could explore this concept beyond our simple prototype, which was an intelligent front end to the company’s data dictionary.

At the time, there was no Web, and there was no XML or RDF or any of the tools, technologies, and specifications available today. We take for granted the Web, XML, and even RDF, but it’s amazing how much these technologies can simplify a search of this nature. Compared to the technologies we had back then, in the late 1980’s.

I wasn’t one of the brains in the group — my strengths were in taking their efforts and finding practical uses for same. However, right from the start I could see the power of a concept based search, one in which a person can search on an idea or a thought or a need, rather than search on keywords and exact phrases:

“I need all documents focusing on the cost of stress testing compared to the payback of same.”

ALIA’s efforts were focused on finding automated ways of performing these searches, using fairly complicated heuristics and rather intimidating technologies such as neural networking. Heavy stuff. And our organization wasn’t the only one interested in this — just search on concept based search engines in Google to find several references (hereherehere, and so on).

[I was particularly intrigued by Joseph’s discussion about “clustering related images about themes or subjects”, because concept-based searching is sometimes referred to as cluster based searching (here and here and herehere, and so on).]

Of course, we now know that completely automated systems to support concept-based search aren’t really the answer. Concept-based search engines are built on the premise of a partnership between human and machine, between the synaptic and the digital. We have the technology, we have the expertise — we just need to find a way to bring it all together. That’s what we’re doing now, here in this weblog and at the Renaissance Web dialog group: ad hoc discussions bringing together the poet, the enthusiast, and the technologist to explore ideas, test out possibilities without necessarily restricting the discussion this early to specific implementation.

Still, there is one technical implementation issue that can be discussed separate from the heuristics of the search itself: how does one take data from a closed system using RDF Poetry Finder to a more global one?

For instance, we’ll implement an ontology, an architecture, and even open APIs (Application Programming Interfaces) based in different programming languages to support a semantic search engine for poetry and we’ll provide it free of charge to sites such as poets.org and any other poetry-related site that wants to use it. We’ll also give it out to webloggers for use at their own sites.

Hopefully with our help and encouragement, these sites will begin to incorporate this technology, providing semantic search capability for their own needs. In addition, though, they’ll also be generating data that can be accessed and consumed outside of the closed system.

This is doable, practical, and even has precedence in previous technologies. The annotation of the semantic markup for each poem will be based on an ontology created using the W3C’s OWL (Ontology Language), itself based on a universal model of semantic data, RDF, and serialized to a file using a universal markup, XML. Accessing a poem’s semantic markup will be no different than accesssing, and processing, my RSS 1.0 file — except that the structure of the data might be a bit more complex, and the information in the file persists, though it will change over time (as new interpretations of the work are incorporated).

Today’s technology used for today’s needs. Have no doubts this can be done. In a closed system, such as in use at poets.org.

However, how do we globalize this system? We have access to the RDF/XML files containing the semantic markup, and it’s based on a universally used ontology — anyone with an RDF parser can access the data, and use it to build a more globalized search engine. Right?

Well, the answer is: yes and no. The technology doesn’t prevent this, but the data does. Or at least, it provides an interesting challenge.

Scenario:

The poem Do not go gentle into that good night by Dylan Thomas is almost universally known. If you search in Google for this poem, you’ll find thousands of references to it, including this one at poets.org, which also includes an audio reading of the poem by Thomas, himself.

Within the closed system that is poets.org, when this Thomas poem is identified as a poetry resource, the identifier given it, based on the RDF model, is the URL of the poem, above. This makes sense within the closed system, because the poem and the web resource (the page) are one in the same within the closed system that is represented by poets.org.

However, another closed system such as Loren’s weblog, also featuring this poem, has a different identifier for it — the page where it is located.

How do we reconcile that the resource identified by “http://www.poets.org/poems/poems.cfm?45442B7C000C07040C7A” is the same as that identified by “http://www.lorenwebster.net/In_a_Dark_Time/archives/000236.html#000236”? And once we do, which ‘identifier’ do we use within the global system? we could pick one closed system over another and just use their identifiers, but which would we choose? All add their value to this poem, through a unique perspective as well interpretations associated with the work?

This is where we start bringing the purely automated solutions back into the picture — we determine if two resources are the same by performing a relatively sophisticated pattern match on the titles of the work. Chances are, accounting for spelling differences as well as language, we’ll be able to find matches at least 98% of the time. Or more.

And rather than choose one closed system over another, we merge the information and interpretations and concepts from both systems into one data store, something that can be accomplished quite easily with RDF/XML, and generate a third, unique identifier within this higher-level system.

Of course, it then is nothing more than a repeat of the same processes to merge the data from several higher-level systems into an even more globally placed system. A system such as Google, if it was so inclined.

I hesitated about bringing specific technical details into the discussion on Poetry Finder at this time, because we’re at the stage where we’re letting our imaginations roam, without hinderance by technical limitations. We’re not ready yet for the constraining environment of actual implementation issues.

Still, it’s hard to dance about semantics, when you’re not sure if there’s a floor underneath you as you twirl about. So go ahead, tap your feet — there’s something there.