DCAM Explained

... or, an attempt to unravel key concepts (and open to correction)

by Karen Coyle

Introduction

If the term Dublin Core brings immediately to your mind a set of fifteen data elements, you need to try to wipe that thought out of your head. The Dublin Core Metadata Initiative (DCMI)is putting its energy today into fundamental metadata models, not any particular set of metadata elements. The most fundamental model in the DCMI suite is the Dublin Core Abstract Model (DCAM). Few outside of the small group that worked on the DCAM profess to understand it. I myself have read and re-read the DCAM document dozens of times and have still failed to have the "aha!" moment in which I would see how the DCAM explains everything of importance to metadata creation. I am, however, beginning to have some idea of what the creators of DCAM are aiming at even though I'm not yet of their world view.

It helps to understand that the Dublin Core Abstract Model defines a structure for metadata. By structure I mean something that can be acted on in a computing environment. It's not about "meaning" in the human sense of the word but about a uniformity that can facilitate the creation of programs that process the metadata. That said, having an agreement on structure is important, albeit more satisfying for machines than for humans.

The term "abstract" in the DCAM does not mean that the concepts are fuzzy or imprecise. The DCAM is very precise, in fact. It is abstract because it does not provide an actual record format for the data elements that it defines; the DCAM is concepts, not a schema or record. Yet it makes use of actual programming conventions in its definitions, such as the requirement that certain elements be represented by Uniform Resource Identifiers (URIs). The mixture of abstract concepts and programming precision is one of the things that can trip up the reader.

Another difficulty is that the DCAM is based on, and the document assumes knowledge of, the Resource Description Framework (RDF). RDF itself is poorly understood because of its own difficult concepts and obscure documentation. In this document I am not going to tackle the RDF-ness of DCAM, in part because I think that an understanding of RDF is not necessary for the implementation of metadata that takes advantage of DCAM concepts. This will be seen as heresy by some, but I liken it to natural language: we all know that we learn to speak long before we know what a "noun" or a "verb" is. In the same way, I think that we can create successful metadata without a knowledge of its deep structure, at least at an elementary level.

Why should we bother to try to understand the DCAM? It's because the DCAM provides a neat metadata archetype that can help us communicate to each other about our metadata. It could simplify crosswalking of metadata sets, and make standards more understandable across communities. Unfortunately, it has not done so yet because it uses obscure terminology and some contorted thinking (and contains no examples and little explanatory text). While everything in the DCAM may be absolutely as it should be, I'm going to suggest a simplified view that I think will make it accessible to more people. This simplied view will undoubtedly be seen as incorrect in some areas by the real cognoscenti of DCAM, but perhaps that can be clarified by discussion or additional documents.

The Metadata Record

At this point, I suggest that you open up the main DCAM diagram in a separate window so you can refer to it as you read along here.

The four boxes in the upper left of the diagram give the primary metadata structure hierarchy that the DCAM defines, which is that a record contains one or more description sets, each of which contain one or more descriptions, which in term contain one or more statements. I find this easier to understand by looking at it from the bottom up rather than top down.
DCAMMe
statement A statement is a key/value pair (e.g. title = "Moby Dick") that includes a metadata term (key) and the metadata value. DCAM uses the RDF term "property" for the metadata term. The same concept is called an "attribute" in FRBR.
description The description is the set of statements or key/value pairs that describe a resource. For example, your metadata may describe two resources: a book and its creator. The book resource would have metadata statements like title and number of pages. The creator resource would have a name, dates of birth and death, and perhaps some biographical information. The idea of metadata that describes multiple resources is a bit hard for those of us in the library world to understand: we only have one description in our metadata set because our MARC record is completely flat. FRBR suggests a different view that has multiple resources or entities, like person and topic. In a FRBR-ized view of bibliographic data there could be descriptions for each FRBR entity within our full metadata set.
description set The set of all of the descriptions in your metadata. Even though it has only one description, the MARC21 standard is a description set. A description set for FRBR would have descriptions for all of the FRBR entities, each of which would have statements for the related FRBR attributes. I can also imagine a description set that would include bibliographic data and authority data in a single set, something in between MARC21 and a fully FRBR'd metadata world.
record All of the above wrapped in a machine-readable package. The package allows for transmission of the data, and may include administrative data (like a record identifier and date). The nature of this package is not specified in the DCAM.

 

The DCAM diagram for this section is:

 

My diagram is:

The Metadata Itself

Now we get into the real meat -- the definitions of metadata terms and values. I have some problems with some of the definitions here, but overall I think that this typology is useful.

The remainder of the DCAM diagram defines the key/value pairs that are valid in the DCAM. This part of the diagram and the description is very detailed. I will begin with a simplified, macro view.

What this attempts to show (and I apologize, I couldn't get all the arrows to work properly in the drawing program) is that each statement has a property and a value. This is the same as saying that each statement is a key/value pair. The values are of two types: strings or vocabulary terms. The strings can either be plain strings, which means that they are just character data, or they can have a structure such as being of type date or currency, called a typed string. In its simplest form, a vocabulary is a list of values from which the value in the key/value pair must be chosen. Examples are the ISO lists of languages, or the MARC list of creator roles. However, the DCAM allows the vocabulary member to be more than just a text value in a list; the vocabulary member may itself be a property/value pair, another vocabulary, or a typed string. (Diagrammed here) I find this to be confusing because it is not a commonly understood use of the term "vocabulary." It might have been clearer if another category had been included in the diagram that allowed the value of a property to be another property or other type of value.

Now that we've uncovered the really basic concepts, we can look at the DCAM diagram and try to unbundle some of the difficult terminology that is used for these concepts, as well as some more details. One source of complexity in the diagram is that it includes both the expression of the value (either as a URI or a string) as well as the underlying value that the URI or string represents. It also divides the world of values into literals and non-literals, and refers to the expression of the value as a "surrogate." All of this adds to the difficulty in understanding the DCAM.

Here's my attempt to describe the value types that appear in DCAM:

DCAMMe
property /property URI This is what most of us think of as the metadata term, like "title" or "date," or the key of a key/value pair. Like everything else in the DCAM, it is identified with a Uniform Resource Identifier, a URI.
value surrogate DCAM calls all of the metadata values "surrogates." In the mind of the authors, the "value" would be something in the real world, and what is in the metadata is a surrogate. I happen to accept that everything in metadata is contrived and I'm happy to call a metadata value a value. It's my opinion that adding "surrogate" to "value" just makes it harder to understand. You may wish to use my technique of just skipping over the term "surrogate" when it appears in the DCAM documentation. I honestly don't think this will lead to any mis-formed metadata. Just remember that what's in the metadata represents something that exists elsewhere.
literal value surrogate This is a value that is a string of data. There are two types of literals: plain and typed. Those follow here.
plain value string This is pretty much what it says: the metadata element's value will just be a plain string of characters. In programming this is often called "character data."
typed value string If you've done any programming you are probably familiar with typed values. These are things like dates, currency, numeric types (integer, long). Typed strings have the advantage that they can be checked for structural validity by programs. So if you have defined a metadata element as a date type of the format "yyyy-mm-dd", a program can reject any values that don't conform to that format, and editing interfaces can enforce the format when the data is created.
syntax encoding scheme This refers to the rules for the typed value string above. The rule set for how your type is defined is called a "syntax encoding scheme" in DCAM, but I think of it just as a data type. An example of a data type is "gYear", one of the date data types in the W3C XML schema.
non-literal value surrogate Anything that isn't a literal string is a non-literal value. The basic thing to remember is that the non-literal values are all defined somewhere outside of the immediate metadata record -- in controlled lists, or perhaps through an identifier. Where plain literal values are wide open and anything goes, non-literal values can be defined somewhere and therefore more amenable to computer-based quality control and accurate cross-walking. There are two essential types of non-literals: one in which the value is represented entirely with an identifier (URI), and one in which the value is a combination of an identifier and a literal string. See the examples in the entries that follow.
value URI This element is an identifier (URI) that identifies an actual value. One type of common value that has an identifier is a member of a controlled list, such as "http://purl.org/dc/dcmitype/Image", which is one of the values in the DCMI Type vocabulary. It could also be the identifier for a property, a class, another vocabulary, or a typed string. This is the area of confusion about vocabulary terms that I mention above.
vocabulary encoding scheme URI This is the identifier for a list of terms. It is used by lists that don't have an identifier for each value. An example would be the ISO two- and three-letter language codes. The identifier for the ISO standard would be one element, and the the code itself would be carried as a related literal string. Like the value URI, this could also be the identifier for a property, a class, another vocabulary, or a typed string.

Here's a simplified diagram of the statement portion of the DCAM, based on my analysis, above:

You can see that my diagram doesn't include everything in the DCAM. Where the DCAM includes the representation of the key and value in the metadata as well as what it represents in the world outside of the metadata, I have collapsed those into one for the sake of simplifying the diagram.

What Can We Do with It?

In its current state as an abstract model, the DCAM itself cannot be used. It is, however, the basis for the DCMI Description Set Profile, which is a constraint language that supports application profiles. There is a first draft of an XML schema for the DSP that illustrates how some of the concepts in the DCAM might be utilized in an application profile.

We also need a machine-readable profile for metadata terms and for vocabulary lists. The profile for metadata terms and their rules could be derived from the schema for the Description Set Profile, with the addition of fields for definitions, input instructions, and other notes and comments. (Imagine a combination of the DSP and the NDSL Metadata Registry record.) For vocabulary lists of terms we may be able to use SKOS. You can see a sample of such a list in the Metadata Registry entry for RDA base material. Adopting these standards would allow us to create metadata that is interoperable, or, to use the current terminology, available for mashups. Anyone could "borrow" terms from anyone else's metadata schema, and metadata could be exchanged with greater confidence it will be understood by the recipients.

So if you've enjoyed this explanation and want more, the full DCAM document may be just your cup of tea.


©Karen Coyle, 2008
Creative Commons License
This work is licensed under a Creative Commons License.