Tuesday, November 10, 2009

GUIDs as Conceptual Endpoints

GUIDs have been a topic of fantastic debate within the biodiversity informatics community over the last few years.

There was, and is, and obvious need for GUIDs - to link data records and data sets together, to be able to refer to the "same" object that other people are referring to, to be able to update a cached record at a later date, and many others.

These discussions have lead to a, rather thorough (maybe too thorough), analysis of GUID technologies, styles, perspectives, advantages, and disadvantages.

This has brought up many issues about GUIDs in general:
- what exactly does a GUID refer to? a physical object? a conceptual object?
- how do we "resolve" information about a GUID?
- which is the "official" information about that GUID? and which is annotation?

Various perspectives of GUIDs are now apparent, from very specific (eg, a database record and some data values in xml that represent the object the GUID refers to), to the fuzzy (conceptual descriptions and annotations about an idea).

The Linked Data community (http://linkeddata.org/) have needed to deal with these issues face on (dues to the obvious need to link data objects together directly), and have addressed this by adopting the idea that GUIDs always refer to "abstract" representations, and always use HTTP 303 redirects to "redirect" to the data about that object. This seems a little overkill, and overly abstracted, but it is a consistent approach for the various types of objects, i.e. abstract objects and physical objects.

Over time, various forms of GUID have been present in these discussions, including
- URL
- URI
- LSID
- DOI
- UUID
- HANDLE
- Integer :-)
- etc

All of these technologies are really methods of getting to the information about a GUID, EXCEPT the UUID. However UUIDs are not resolvable - so are they of little use?

I have been contemplating this issue, and have a devised a slightly different approach to this conundrum. What I think we are mean, when we assign a GUID to an object, is to stamp that object/concept with an ID so we can all talk about the same thing. So all we really need to satisfy this, is a unique ID (without all the Internet oriented, resolution overhead piffle on top) - and a good example of this pure identifier, is the UUID.

So if we just start from the basics and assign UUIDs to the objects we are describing, then add the data we are interested in, and worry about the Internet resolution stuff later, I think we will progress much quicker. This is what I call "conceptual endpoints", because the object we are talking about now has a GUID (UUID), but is not really generally retrievable, as it is not resolvable over web technologies.

We can then add resolution mechanisms on top of this approach, and could well use several standard technologies for doing this job, eg LSID, HTTP URI, DOI etc - it doesn't really matter which, or how many, so long as they connect to the same conceptual endpoint.

eg
say we have a physical object, eg a coffee mug
- we could then assign an ID to that mug, eg "72FFE2F1-A9F8-4887-A95F-11D127730879"
- and some data, eg "color:white", "height:9.6cm"


The conceptual endpoint in this case is the mug "72FFE2F1-A9F8-4887-A95F-11D127730879". Currently there is no way to find out any information, over the web, about this mug.


So, we could assign resolvable technologies to our endpoint, so that the information is retrievable, eg:

- LSID - urn:lsid:example.org:mugs:72FFE2F1-A9F8-4887-A95F-11D127730879

- URI - http://example.org/mugs/72FFE2F1-A9F8-4887-A95F-11D127730879


So these various methods of retrieving data about an object are all valid an useful, but ultimately refer to the same conceptual endpoint. Which in the end is what most people really want to make sure is in place and correct. The technology for retrieving the information is quite secondary.


So that is conceptual endpoints, or perhaps, as me mate Roger said, the "conceptual end of the point".