Metadata Basics

Identifiers

Identifiers: the rules for good identifiers

For identifiers to do their job they must adhere to some rigorous requirements.

Identifiers must be unique

If they aren't unique, they can't identify unambiguously. Unique is rather tricky however. The identifiers must be unique within the context in which they will be used. If you have a database with data that will only be used in that database, then you can give your data numbers that are only unique within that database.

If, however, you will be sharing your data with others, then your identifier must be unique within that entire context. Since often identifiers have been assigned locally before sharing, it is common to add a "home" element to the identifier to make it unique when it moves beyond its original place. For example, some communities have institution or branch codes that can be pre- or post-pended to a local identifier to make it unique in the larger context.

Say there are two banks and each has 5-digit account numbers, like 990078. Although some of their account numbers may not overlap, undoubtedly some of them will. So you could have:

Bank of Martha, account 45783

Bank of James, account 45783

The way to make these unique is to add codes representing the bank branch. We'll make this simple and call them "BoMa" and "BoJa". In a database that combines account information from Bank of Martha and Bank of James these accounts will now each be uniquely identified:

BoMa:45783

BoJa:45783

The larger the sphere in which one wants to share data the more difficult it can become to be sure that your identifiers are unique. Because of this you would think that creating unique identifiers for the World Wide Web would be particularly difficult. It would be except for the fact that the Web already has a well-functioning system of identification known as the Domain Name System. No two people or companies can be in control of the same domain name, so these function the way the branch identifiers did in our first example. What's a domain name? It is a name that ends in one of the defined top-level domains on the web; such as .edu, .com, .us, .gov. The Library of Congress owns loc.gov and can create any identifiers using this name. I own the domain kcoyle.net, so I, too, could create unique identifiers. The domain name system is absolutely essential for the functioning of the Internet and therefore it is carefully managed and kept up to date. These Web-based identifiers are also used in the Semantic Web, which is covered in another unit.

Identifiers must be persistent

Ideally, identifiers should persist for the entire life of the thing that they are identifying. In practice, this sometimes isn't possible, but every effort should be made to ensure a long life for identifiers. There are various ways that identifiers can get lost: the organization that assigned them can cease to exist; the identifier scheme can fall out of use; or the identifier and the thing it identifies can become separated. Imagine an identifier that is on a garment's sales tag. Once the sales tag is removed (and it must be removed before a garment can be worn), the identification is broken. This isn't a problem in most cases where clothing is concerned. Other identifiers need to have greater longevity, like the Social Security Number in the U.S., which connects a worker to the system for an entire lifetime. An automobile has a "hard coded" identifier etched into the metal of the engine which may be needed up until the time the vehicle is turned to scrap.

Anyone creating identifiers needs to think about the potential lifetime of the things they identify, and how they will be maintained over time.

Identifiers must be consistent

It occurs fairly regularly that a single item, whether a product or a book or a person, is assigned more than one identifier. You probably have an employee id, a social security number, one or more frequent flyer numbers. While potentially confusing, this isn't a problem. What is a problem is if any one of these numbers, or any other identifier, is given to more than one person or thing. In that case, it is ambiguous what the identifier is identifying, meaning that it isn't functioning as an identifier at all. It's the same problem we have with two people named "Mary Jones."

One upshot of this is that identifiers must not be re-used. It may seem safe that an identifier for something that is not longer being used, say a book that is out of print, could be used for another book, but there is no way to be sure that the identifier doesn't linger in a record or archive somewhere. There is no savings in re-using identifiers, only potential mis-identification.

The Uniform Resource Identifier

Any string of characters can be an identifier. There are, however, some standard formats for identifiers. The one most used today is called the Uniform Resource Identifier, or URI. The general format of the URI is:

  <scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ]    

It so happens that a Uniform Resource Locator (URL) is a type of URI. This means that URLs can be used as identifiers, whether or not they link to specific locations. This may sometimes cause confusion for people looking at something that begins "http://...". There are response codes that reflect back the appropriate information for machine queries.

Identifiers often referred to as "http URIs" are being widely used on the Web for linked data. They have a number of advantages over ad hoc identifier systems:

This lecture continues...