by Karen Coyle
Preprint. Published in Journal of Academic Librarianship, v. 31, n. 2, pp 160-163
"Metadata is cataloging done by men."1
The world of information technology is awash in talk of metadata. Everyone today seems to be creating a metadata format. There is a <meta> tag in HTML to carry metadata for Internet resources; scientists have developed metadata to describe genomes; publishers have a metadata format to facilitate the transfer of promotion and price data to retailers. What is happening in the world of technology that is leading everyone to believe that metadata is the answer? Alternatively, if metadata is the answer, what is the question, and what does it mean for libraries and library catalogs?
First we have to define what it is that we mean by metadata. The common definition is that metadata is "data about data." This definition is catchy, but it doesn't help us understand what metadata is all about. What follows is much less catchy but it does provide us with a way to understand metadata. To begin with, metadata is constructed information, which means that it is of human invention and not found in nature. A good example of constructed information is the use of longitude and latitude to describe the earth and points thereon. The real planet obviously does not have lines going around it, although we are by now very accustomed to seeing maps and globes with them, but the invention of longitude and latitude allows us to talk about locations on the planet and to navigate precisely across vast expanses with no landmarks to guide us.
This leads us to a second necessary characteristic of metadata: metadata is developed by people for a purpose or a function. So a map of a subway system that is handed out to riders uses color coding of routes and symbols to guide the riders through the maze of routes and transfer points. This map is often only barely representative of the actual scale and geography of the city that is served by the subway, but it is useful precisely because it emphasizes a subway-centric view at the expense of geographic accuracy. A road map of the same area would be more true to geography, but if that map were designed by the tourist board it would highlight hotels, museums, points of interest, and parking opportunities. A map of an area used by the hiking club would emphasize topology and natural landmarks. Just as there is no single kind of map that serves all needs, there is no one kind of metadata for documents or other information objects. This is because it is not the object itself that determines the metadata but the needs and purposes of the people who create it and those who it will serve. Without getting too metaphysical, metadata is not the world, it is how we see the world at some moment in time for some purpose.
Metadata is also often used as a surrogate for the real thing. In a library catalog, the entries are surrogates for the books on the shelves. While it would be hard for library users to look at each book to determine which one they want, at least the physical book is there. In the digital environment, the surrogate role of metadata is key because many resources are not easily browsable and others do not carry clear data about themselves. The rise in interest in metadata is part of the effort to organize our rather messy world of digital resources and to provide access and services where none existed before. It is also a way to exchange data between disparate stores of resources, and to allow searching across digital warehouses.
Two acronyms that you will hear used simultaneously with any mention of metadata are XML and RDF. XML is the eXtensible Markup Language and RDF is the Resource Description Framework. Some people speak of XML and RDF as if they are themselves metadata formats, but this is a confusion between form and content. Both XML and RDF are actually general data formats that can be used for any number of applications. In particular, XML is often used as a document format, and is the broader format from which HTML is derived.
If you are unfamiliar with the record structure of XML it may seem fairly complex and mysterious. In fact in its basic form it is very simple, although it is possible to create complicated data records with it. If you think of the MARC record as having fields with tags, such as this use of "245" to mean "title":
245 $a Hamlet, Prince of Denmarkthen XML is just another way to tag a piece of data, although it insists on putting a beginning tag and an ending tag (with a "/" before the tag name) around each data element:
<title>Hamlet, Prince of Denmark</title>The tags can be anything you would like them to be, as long as you pre-define them in a data format definition structure. So if you prefer, your definition could have any of these for a title:
<245>Hamlet, Prince of Denmark</245>
<ti>Hamlet, Prince of Denmark</ti>
XML, like the MARC tags and subfields, is essentially hierarchical. Its advantage over MARC21 is that it can have as many hierarchical levels as is necessary, as opposed to MARC21's two levels of tag and subfield. In XML the hierarchies are "nested" like Russian dolls to whatever level is needed.
The Resource Description Framework (RDF) is a step or two beyond XML. RDF emphases the relationships between data elements. A key relationship in RDF is "about," where a web resource is the object of the RDF statement, and other fields in the statement are about that resource. That is the simplest case. RDF can also make use of relationships such as:
subClassOf subPropertyOf member isDefinedby
and others. RDF is a necessary component of the effort called the "semantic web," an effort of the World Wide Web Consortium to add a semantic component to the sharing of data over the Internet. RDF is more complex than and less used than XML, and it isn't clear yet if it succeeds as a general language to describe the world of web. It definitely seems to require a deeper understanding of certain philosophical concepts than does XML and the number of people who find it inherently puzzling (and I am in that group) is much greater than those who see it as a solution. (The example below of a Creative Commons record uses a simple RDF format.)
As librarians, we will primarily work with metadata for documents and document-like objects, although given our line of work we could find ourselves storing, organizing, and providing services around other metadata types such as scientific metadata. But for this article, I will concentrate on metadata that describes documents, with the main question being: how is this metadata different from library cataloging? Note that the metadata formats introduced in this article (Dublin Core, MODS, and METS) are only three of many that are in use today, but they are the three most commonly used in digital libraries.
Library cataloging is undoubtedly the sine qua non of document metadata. It can trace its origins back to the mid-1800's with Jewett's and Panizzi's rules. It is familiar to just about every moderately educated person in the Anglo-American world. In sheer numbers, instances of library cataloging greatly overwhelm any other metadata scheme being used for books (although possibly not for journal articles). And yet, when developers in Internet applications needed metadata for online documents, they did not adopt the library standard. In fact, the document metadata standard most often found in non-library applications is Dublin Core. To understand why, we need to look at purposes.
creator = Karen Coyle title = Understanding Metadata and its Purpose date = December, 2004 description = The first draft of an article for Journal of Academic Librarianship subject = metadata type = text
The hope of Dublin Core was that documents on the Internet would carry their own bibliographic descriptions and therefore would have coded data elements for information such as author, title, and date. In a sense, this represents a very librarian-like point of view, which is that it should be possible to find a document by its author or its title. On the Internet today Dublin Core is indeed heavily used, although it has not resulted in the creation of a catalog of Internet resources. Instead, Dublin Core has become the document description metadata for a variety of web-based applications. An example of this is the Creative Commons license.
Creative Commons is both a web service and a social movement. It was developed by Larry Lessig, a Stanford law professor known for his criticism of the strengthening of copyright law at the expense of the public's rights to use and re-use the ideas of their predecessors. 3 In the interest of making it possible for creators to given permission for the use of their works, a small set of licenses were developed that can be easily attached to files on the Internet. These licenses state what uses and reuses are granted by the creator of the work. In addition to the license, the Creative Commons software allows the creator to add a small amount of what librarians would call "descriptive" metadata: creator, title, date, and a short description of the item. These use the Dublin Core data elements creator, title, date, description (coded in the record as "dc:creator," "dc:title," etc.).
<!-- /Creative Commons License --> <!-- <rdf:RDF xmlns="http://web.resource.org/cc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <Work rdf:about=""> <dc:title>Metadata: Data with a Purpose</dc:title> <dc:date>2004</dc:date> <dc:description>A general discussion of document/resource metadata and some related uses.</dc:description> <dc:creator><Agent> <dc:title>Karen Coyle</dc:title> </Agent></dc:creator> <dc:rights><Agent> <dc:title>Karen Coyle</dc:title> </Agent></dc:rights> <dc:type rdf:resource="http://purl.org/dc/dcmitype/Text" /> <dc:source rdf:resource="http://www.kcoyle.net/meta_purpose.html"/> <license rdf:resource="http://creativecommons.org/licenses/by-nc-nd/2.0/" /> </Work> <License rdf:about="http://creativecommons.org/licenses/by-nc-nd/2.0/"> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <requires rdf:resource="http://web.resource.org/cc/Notice" /> <requires rdf:resource="http://web.resource.org/cc/Attribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /> </License> </rdf:RDF>
To make use of the Creative Commons license requires no understanding of copyright law or of contracts, and the descriptive elements are ones that nearly anyone can easily understand. In this sense, Dublin Core has achieved its purpose by providing a core of descriptive elements that can be embedded in a variety of web applications.
One of the things that makes Dublin Core easy and usable by anyone is that there are no cataloging rules involved. This is something that goes against the grain of library cataloging and it definitely reduces the re-usability of the contents of Dublin Core records. There are descriptions of each data element in the Dublin Core standard so the meaning of the data element is generally defined, but it is equally valid to say "creator = Karen Coyle" as to say "creator = Coyle, Karen." The advantage to this is that Dublin Core is likely to be useful to a number of different communities and cultures; the obvious disadvantage is that the content of the fields is not uniform across applications, making interoperability a problem.
The MARC format is a highly sophisticated record for encoding bibliographic information. It is well-known in the library world and supported by library systems in the US, Canada, and other countries, especially in the English-speaking world. In the networked environment where descriptive metadata can be transferred across systems and can be included in or with other kinds of metadata, it would seem to be ideal to use MARC records for this purpose. The problem for MARC, however, is that this embedding generally requires the use of the XML data structure, and MARC is not an XML record. The Library of Congress has created a way to translate the MARC record to XML, but it hasn't gained many enthusiasts, and probably for good reason: the MARC record is larger and more detailed than most systems need, and its use of numeric tags and subfield codes makes it hard to understand without considerable training. What was needed was a kinder, gentler version of MARC that could accept the key data elements from a MARC record and transmit them in an easy-to-understand XML format. So the Metadata Object Description Standard (MODS) was born.
MODS uses human-understandable tags in place of the three-digit tags and subfield codes of MARC (i.e. "title" instead of "245"). It ignores most of the fixed field data elements, with the exception of the physical format codes (from the 007) and the many codes for genre (from the 008). It also introduces some efficiencies and some innovations. MODS defines a structure called "name" that represents the fields and subfields for personal and corporate names and for conferences. This structure can be used anywhere that names would appear, either as main entries, added entries, or subjects. So with a name field like:
<name type="personal"> <namePart>Shakespeare, William</namePart> <namePart type="date">1564-1616</namePart> </name>
can be used as an author field, or it can become part of a subject heading:
<subject authority="lcsh"> <name type="personal"> <namePart>Shakespeare, William</namePart> <namePart type="date">1564-1616</namePart> </name> <topic>Bibliography</topic> <topic>Periodicals</topic> </subject>
Although it is derived from MARC21 and is much more detailed than Dublin Core, MODS has many fewer rules than MARC21. Like Dublin Core, there are no required fields and all fields are repeatable. MODS carries over many values from MARC, but it also makes radical departures from MARC21: there are no "main entry" or "added entry" concepts, all authors are simply authors; and a record can have multiple titles without a single "main title." When MARC21 records are translated to MODS, you get a record in XML that is a kind of "MARC-lite." MODS records can also be created from bibliographic metadata that did not originate as library cataloging, such as article citations, and it is often used in databases that will have a mixture of library cataloging and other bibliographic data.
There is document metadata whose purpose is not "description" in the cataloging sense of that term. One example is a metadata format that is being used by digital libraries and archives called Metadata Encoding and Transmission Standard (METS). METS refers to its role as that of a "wrapper" and it serves to hold together the files that make up a digital object. Unlike a bound book, digital documents are often made up of a number of separate files representing pages or other units. And unlike a physical book, there is no visible cover or title page, nor can one thumb through the pages to find a particular place in the book. Think of METS as the binding, cover and navigation for a group of digital files. It also includes technical information that will be needed to manage and understand those files, such as the file formats, the technology used in scanning if the item began its life on paper, and the digital transformations and compression that have been used on the files. What METS does not define is the descriptive metadata. Instead, it allows those creating the METS records to embed whatever descriptive metadata they wish to use for those materials. This illustrates an important characteristic of the world of metadata, which we also saw in the Creative Commons example: metadata can be reused rather than reinvented. METS records routinely carry descriptive metadata in Dublin Core or in MODS.
So what does all of this have to do with library cataloging, and, most importantly, will metadata replace cataloging? Above I said that one of the main problems with the Dublin Core record is that it lacks cataloging rules and therefore there is little predictability between communities or projects in terms of the content of the fields. What library cataloging and catalogs provide is a high degree of conformity in the data captured in the records. This conformity is a service to users, who can move from one library to another comfortably. But the main value of the conformity is our ability to catalog cooperatively and exchange cataloging records between libraries and library systems. It also allows library systems vendors to create a product that can be used in any library, just as the standard sized catalog card could fit into any card catalog drawer.
The efficiencies that result from this conformity are enormous and the library community depends on this for the cataloging of its primary materials. But as libraries move into the organization of less traditional materials, neither the cataloging rules nor the library systems provide workable solutions. Imagine that you have an archive that has photographs of your city from the early 19th century, and you'd like to make these available on the web. And let's say that you have about one thousand of them. For most of them, you have no idea of the photographer, and often no date. Someone in the past has penciled on the back what the photograph represents, i.e. "Main Street, circa 1910." To catalog and produce MARC21 records of these photographs would be very time consuming, and the resulting records would have little information. Instead, you can create a Dublin Core record that simply has:
date = circa 1910 description = Main Street
This record cannot be entered into your online catalog, although records like this can be targets of metasearch technologies that allow a single search to go against multiple databases with different metadata formats. The main advantage is that these records could be quickly and easily created by library staff with a minimum of training, and therefore some metadata could be created for resources that otherwise would get none.
Metadata like Dublin Core lacks the level of predictability that would allow for a broad systematic re-use of the records. In fact, these metadata formats, and those other data formats that use them, are often used in ad hoc and stand-alone systems. As these ad hoc systems begin to exchange data, much as libraries began to in the late 19th century, developers may indeed come to the conclusion that it is the content of the metadata records, not their record structure, that makes the difference between a single-system solution and a coherent bibliographic universe. In other words, we may see that when metadata grows up, it becomes cataloging.
1 This quip is alternately attributed to Tom Delsey, of the National Library of Canada ("Metadata: Cataloguing for Men"), and Michael Gorman ("… metadata is cataloging done by men.")
2 The fifteen Dublin Core elements are: Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, Type. For more information see http://dublincore.org
3 Lawrence Lessig is the author of: Code and Other Laws of Cyberspace (New York : Basic Books, c1999); The Future Of Ideas : The Fate Of The Commons In A Connected World (New York : Random House, 2001); Free Culture : How Big Media Uses Technology And The Law To Lock Down Culture And Control Creativity (New York : Penguin Press, 2004)
The copyright in this article is NOT held by the author. For copyright-related permissions, contact Elsevier Inc.