Identifiers: Unique, Persistent, Global

by Karen Coyle

Preprint. Published in The Journal of Academic Librarianship, v. 32, n. 4, July, 2006, pp. 428-431.

The Importance of identifiers Today

The human sensory system is a marvelous thing. I can recognize a face (even though I will often forget the name that goes with it). I can tell a rose from an iris by its smell as well as its look. I can name a bird that I cannot see by its distinctive call. I can search the library catalog for the string "moby dick" and wade through a retrieved set of different editions of the book and of films based on the book and can select an individual entry that meets my needs. With our senses and our brain we can identify items in the world around us with an impressive amount of accuracy.

Computers are often called "electronic brains" and are said to "think." These disembodied mechanical brains lack sensory input, however, and therefore they don't have our ability to tell a daisy from a dandelion, or Moby Dick from Dick and Jane. For a computer to act on any data about a thing, it needs to be given an identifier that represents the thing in its computational model. As our machine-to-machine activity has increased over the decades since computers became commonplace, our need for identifiers has increased as well. Identifiers that served us well in the early days of business transactions, like the ISBN, are showing their age. We not only need to apply these identifiers to a larger number of items than was originally intended, we also want to be able to refer to individual parts of those items, like chapters or pages. Suddenly it seems that there are not enough numbers in the world to identify everything that we need to identify.

In March of 2006, The National Information Standards Organization (NISO) held a roundtable discussion at the National Library of Medicine with experts from a variety of areas where identifiers are created, used, and maintained. The purpose of this meeting was to articulate needs that might be met through standards or other activities that could be led by NISO.^¹ The attendees agreed that we need more information in our community generally about the technology of identifiers and their use in systems and services. Experts in the room talked of common misconceptions about identifiers and clarified definitions, as well as proposed solutions. I present some of those in this article.

What is an Identifier?

Identifiers, as we use them in our electronic systems, are strings of numbers, letters, and symbols that represent some thing. Identifiers are said to "name" things and "[n]aming entities makes it possible to refer to them, which is essential for any kind of processing." ^² But how does a string like "039450643X" come to name, in this case, a particular edition of Remembrance of Things Past, by Marcel Proust? The book was originally published at a time when no such identifier was used. It is the assignment of that string to that thing, the book, by the publisher that makes it an identifier. Some would go so far as to say that the identifier is not the string but "an association between a string (a sequence of characters) and an information resource."^³ In either case, we recognize that an identifier is a convention that requires some person or organization to assert the relationship between the string and the thing. The string 039450643X is meaningless without the ISBN standard that defines it, and without the action of the publisher that assigns that string to a book. In this sense, identifiers are social agreements, and their value depends on the dedication of the organization that creates, maintains, and assigns them.

Qualities of identifiers

You often see the word "identifier" preceded by an adjective, like "unique" or "persistent." These are qualities that we often require of identifiers because they are thought to be necessary for identifier use. These qualities are indeed important, but exactly what they mean and how they help us maintain reliable systems and interoperate electronically is more nuanced than you might imagine.

Unique Identifier

Uniqueness is one of the basic requirements that is cited for identifiers. But what does it mean to say that an identifier is unique? There are at least two ways to define "unique" for identifiers:

each thing has one and only one identifier
each identifier refers to one and only one thing

In reality, uniqueness is relative to the task at hand. In terms of having one and only one identifier, think about the many identifiers that are associated with you: your name, your Social Security Nunber (SSN), your credit card numbers, your driver's license number, and others. Each of these is you, but each can serve a different function. When you file your taxes with the IRS you are identified by your SSN, but when you make a purchase with a credit card you are identified by that credit card number. We are complex creatures who exist and interact in a variety of contexts. Each of those contexts can, and often does, use its own identifier for us. So we can amend our statement by saying that each thing has one and only one identifier within a defined context.

In terms of an identifier referring to one and only one thing, it all depends on what "one" means, and it means different things in different contexts. Publishers consider the ISBN to be an item-level identifier, but their item is available in many copies. Libraries consider the ISBN to be at the level of a title (in the pre-FRBR sense of that word), and assign barcodes to identify each physical item on their shelves. The uniqueness in this case relates to the granularity of the need. My twelve apples are your one dozen. Both are correct, but they would require different identifiers.

Persistent Identifier

Persistence is often cited as a primary quality of an identifier. In simple terms, the answer to the question: "How long does an identifier need to last?" is: "As long as it is needed." The identifier for an IP packet going across the Internet has to last as long as it takes to reach its destination on the network, which may be a matter of thousandths of a second. For another kind of package, that carried by the delivery company UPS, the identifier needs to last until the package is delivered and billed. A Social Security Number persists for the life of the individual to whom it was assigned. If we do develop a registry of authors for the purposes of tracking copyrights, that identifier will need to persist for at least seventy years after the death of the author, and preferably for as long as the author's works exist in some form.

Persistence of identifiers is a particular issue for libraries and other cultural heritage institutions because we have no end date on our commitment to the resources we manage. As the above examples show, in other contexts the identifier has a term of usefulness after which it can be retired. Persistence, however, is not a characteristic of the identifier technology: "... persistence is a function of organizations, not technology."^⁴ Because identifiers are an assertion of a relationship between a string and a thing, they persist as long as the assertion is maintained. This is the difference between your average URL, which may have no expectation of persistence behind it, and the persistent URL (PURL) service at OCLC that has an organizational commitment to its longevity. The Archival Resource Key (ARK) developed at the University of California includes a commitment statement from the service managing the identifiers as part of its design.^⁵

An aspect of persistence that is particularly relevant to complex cultural resources is how a change in the resource is reflected in the identifier assignment. This is not limited to the identification of digital materials; the impact of new editions for monographs and title changes for serials was an issue for libraries long before the digital age. Decisions such as these need to be governed by clear policies on the part of the agency managing the assignment of the identifiers. Where this aspect of persistence is not managed, such as on the Web, there is no guarantee that the same identifier will point to the same resource at different moments of time. When we cite Internet resources we often feel obliged to qualify our citation with a date ("accessed Feb. 6, 2005") precisely because there is no guarantee of persistence.

Another area where identifiers may or may not persist is in the meaning and use of the identifier itself. Identifiers are used in systems and a society that are in constant change. We have seen that the social security number, originally intended to connect a person, her earnings, and the Social Security Administration, has become a de facto personal identifier for schools, for medical insurance, and banks. The International Standard Book Number (ISBN) as been famously assigned to teddy bears and biscotti to support the needs of retail bookstores. Today you can retrieve your boarding pass at an airport using a credit card number, even if that credit card was not the one originally used to purchase the electronic ticket. Each of these are examples of the evolution of what the identifier represents, and they are also uses outside of the arena of commitment of the managing agency. The bank does not guarantee that your credit card will be recognized by the airline's ticket machine, and the Social Security Administration has no responsibility over the uses of the SSN beyond its own. Opportunistic uses of identifiers may facilitate business functions, but reliability and commitment may be sacrificed in the process.

Global Identifiers

It is often said that a certain class of identifiers must be "globally unique." That is, that they can be used anywhere in any system and will never overlap with an identifier assigned by someone else. This is a growing concern that arises out of the increasing interaction between systems in the digital and networked world. The common experience is that an identifier is created within a system or within a context, and at a later date it needs to be used in another or larger context. At that point, the identifier may no longer be unique. An example from our environment is the MARC record identifier within the local ILS. It is common that every database assigns a unique identifier to each record stored within it. But if at a later date the library wishes to participate in a union catalog, these record identifiers could very easily overlap with those of other libraries.

There are techniques that allow the creation of globally unique identifiers. One of these is the Universal Unique Identifier (UUID), a mathematically derived 128-bit number that is virtually guaranteed to be unique for the next millennium. Although this is a solution, it is perhaps more rigorous than most of us wish to undertake. A simpler solution is the one used by the Uniform Resource Locator (URL): because every Internet site must have a unique address assigned through the domain name system, the owner of that address can prepend it to any string, essentially saying: "what follows is my identifier." The file index.html exists in many millions of instances throughout the World Wide Web, yet each one is uniquely identified by the domain name and path that precede it. Similarly, the bibliographic record identifier from a library system can be allowed to interact in a larger bibliographic context by prefixing it with the library's code from the MARC Code List for Organizations that is managed by the Library of Congress.^⁶ In this case, "global" uniqueness is really global within a large but not universal context. There is nothing to prevent another community from creating an identifier that would be the same as one from a library database, including the organization code, but one weighs the risk of this occurring against the efficiency and cost of the solution.

Identifiers in the Library Environment

Libraries have a long history of the use of identifiers. Incredibly, one of the more common identifiers in use in libraries today, the Library of Congress Catalog Number (LCCN), was first used in 1898.^⁷ ISBNs, which we now take for granted, were only first assigned in 1970.^⁸ ISSNs came into use in 1975.^⁹ Both the ISBN and the ISSN, along with their newer cousins the ISMN (International Standard Music Number)^¹⁰ and the ISAN (International Standard Audiovisual Number)^¹¹ are all standards agreed on through the International Standards Organization (ISO). Other identifiers, not unlike the LCCN, have become standard through use rather than through a formal standards process. The OCLC record number is commonly used to identify machine-readable bibliographic records that were originally obtained from the OCLC database. The PubMed record identifier (PMID), which represents the National Library of Medicine bibliographic record in the way that the LCCN represents the Library of Congress bibliographic record, is commonly listed in citations of articles in the medical field. The PMID often provides a unique identifier between an article citation and the full text of the article even though this latter is in a database unrelated to PubMed. The Digital Object Identifier (DOI) is a system that resolves a standard DOI string to publisher services related to the digital resource.^¹² The DOI string has become accepted as an identifier even when resolution is not desired. Yet none of these identifiers covers the entire world of intellectual resources, and there is overlap among them: the LCCN and the ISBN both identify books and their metadata; the DOI and the PMID both identify a subset of the world of journal articles. A universal resource identifier simply does not exist.

One particular issue related to the longevity of libraries and library identifiers is the need to use identifiers from our past in the current highly-networked digital systems. There are two aspects to this: the first is that we have to specify the name space of the identifier; the second is that we have to be able to structure the identifier to meet current standards. The primary identifier standards for the networked world are the Uniform Resource Identifier (URI), the Uniform Resource Name (URN), and the Uniform Resource Locator (URL). These are defined as:

URI – The basic identifier method on the web. A URI is unique in the context of the web.^¹³
URN - An identifier of the URI type but that is not limited to the location of the resource.^¹⁴
URL - An identifier that uses the network location of the resource as its identification.^¹⁵

Each of these has a mechanism to assign name spaces to identifiers that assure that the identifiers created are unique. The uniqueness of the URL is guaranteed by the domain name registration process, since the first part of the URL is the domain name of the location. This means that all URLs belonging to the domain owned by the Library of Congress will begin with "…loc.gov" and URLs belonging to Microsoft Corporation will begin with "…microsoft.com." The top level of the URN^¹⁶, which determines uniqueness for those identifiers, is managed by the Internet Assigned Numbers Authority (IANA). Of the identifiers in use in the library community, the ISBN, ISSN and the ISAN are all registered as URNs. For example, an ISBN expressed as a URN would look like: urn:ISBN:0-395-36341-1. Unfortunately, those are the only identifiers used in the library community that have that registration. The URI is defined as having the format "uri://…" and IANA registered URI schemes include the familiar "http," "ftp," as well as over four dozen others. Included in the URI list are schemes for z39.50 retrieval and session.

Between the URN and the URI we are still lacking standard network identifiers for most of the identifiers that are used by library systems. This has recently been rectified through the introduction of a new URI called "info."^¹⁷ The "info" URI is specifically designed to allow a wide range of commonly used identifiers to be defined in the URI format. This provides a home for all of those identifiers that were developed either before or outside of the Internet identifier standards. Thus an LCCN can be expressed as " info:lccn/2002022641" and a Dewey Decimal Classification number can be "info:ddc/22/eng//004.678." Before using the "info" URI format for an identifier, the identifier and its name must be registered in the "info" URI registry.^¹⁸ The registry includes some key information, such as the contact for the agency that maintains the identifier. Already more than a dozen identifiers have been registered there including the LCCN, the PubMed identifier, the DOI, and the OCLC record number. The creation of the "info" URI means that library applications can interact over the Internet in ways that will be understood by non-library systems. We now have the capability to use our community's identifiers wherever a standard URI format is required.

Conclusion

As our digital systems grow in complexity and in reach, the number of identifiers also grows. Many of these are internal to the systems and will not be relevant to information exchange. But others may surprise us and, like the LCCN, will become key components of functions that were inconceivable at the time the identifier was first created. Library identifiers must be able to conform the existing network standards, in particular the use of the URI and the URN identifier formats, when library resources interact on the Internet. This is yet another way in which we are breaking down the barriers between libraries and the larger world of information resources.

1 National Information Standards Organization. NISO Identifier Roundtable, March 13-16, 2006. http://www.niso.org/news/events_workshops/ID-docs/ID-06-report.pdf (April 14, 2006)

2Wikipedia: http://en.wikipedia.org/wiki/Identifier (April 14, 2006)

3California Digital Library. Archival Resource Key (ARK). http://www.cdlib.org/inside/diglib/ark/ (April 14, 2006)

4Keith Shafer, Stuart Weibel, Erik Jul, Jon Fausey. Introduction to Persistent Uniform Resource Locators. OCLC. http://purl.oclc.org/docs/inet96.html (April 14, 2006)

5John A. Kunze. Towards Electronic Persistence Using ARK Identifiers. http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf (April 14, 2006)

6Library of Congress. Network Development and MARC Standards Office. MARC Code List for Organizations. http://www.loc.gov/marc/organizations/orgshome.html (April 14, 2006)

7 Library of Congress. Network Development MARC Standards Office. Library Of Congress Control Number (LCCN) http://www.loc.gov/marc/lccn.html (April 14, 2006)

8 International ISBN Agency. The ISBN User's Manual. 4^th edition. Berlin, 2001. p. 1. Available: http://www.isbn-international.org/en/userman/download/ISBNmanual.zip (April 14, 2006)

9 ISSN International Centre. ISSN and the ISO Standards. http://www.issn.org/node/106 (April 14, 2006)

10 International ISMN Agency. http://www.ismn-international.org/

11 International ISAN Agency. http://www.isan.org

12 International DOI Federation. DOI System Overview. http://www.doi.org/overview/sys_overview_021601.html (April 14, 2006)

13 Internet Engineering Task Force. RFC 3986: Uniform Resource Identifier. 2005 http://www.gbiv.com/protocols/uri/rfc/rfc3986.html (April 14, 2006)

14 Internet Engineering Task Force. RFC 2396: Uniform Resource Name. 1998 http://www.ietf.org/rfc/rfc2396.txt (April 14, 2006)

15 Internet Engineering Task Force. RFC 3406: Uniform Resource Locator. 2002 http://www.ietf.org/rfc/rfc3406.txt (April 14, 2006)

16 Internet Assigned Numbers Authority. URN Namespaces. February, 2006. http://www.iana.org/assignments/urn-namespaces (April 14, 2006)

17 Internet Engineering Task Force. RFC 4452: The "info" URI Scheme for Information Assets with Identifiers in Public Namespaces. http://www.ietf.org/rfc/rfc4452.txt (April 14, 2006)

18 "info" URI Scheme. http://info-uri.info/registry/

The copyright in this article is NOT held by the author. For copyright-related permissions, contact Elsevier Inc.