Understanding Data

It is fairly obvious that a statement like: "This book is 23 centimeters tall" is a bit of text designed to be read by human beings. This statement is a bit less obvious:

300 $c 23 cm.

Is this data designed for machine use? Before we can answer that, we need to say what we mean by data.

Data defined

There is no one single definition of data, but we will adopt one for our purposes in this lecture. Our definition of data is:

information defined in a machine-actionable way (and remember that machines are very, very stupid)
... that is directly usable by algorithms or programs (if you have to normalize the data, find it in a string, or make "guesses" it isn't going to work)
... and is as unambiguous as possible

The main difference between what we are calling "data" and what we are calling "text" is that data is either quantitative or controlled in a way that it can be used in computer applications for things other than a display for humans. Text, although it is a data type under most data type definitions, is not designed for the rigor of algorithms.

Examples

Let's go back to our "23 cm."

data basics example

Why isn't our "23 cm." data under this definition? It isn't for all of the aspects of our definition:

Although this information is in a machine-readable format in a machine-readable record, it is written as a human readable text string. There is a lot that is assumed in that string, primarily that the human reader is aware that this "23 cm." refers to the height of the book, not, for example, the width of the book, the package it came in, or the length of the cataloger's hair.
The string "23 cm." would need to be parsed before it could be used because it contains two different pieces of information: one conceptual and one numeric. The data view below it, although conceptual and not actual code, separates the two bits of information: the unit of measure ("centimeters") and the value ("23").
By designating that this data refers to size, to the height of the thing being described, that it is measured in a unit called "centimeters" (and preferably that would be a controlled term that programs know about), and that the value is an integer, programs can work with this data. Something like this could be used to determine if a book will fit on a certain shelf or to select books for a particular kind of specialized storage.

Here's another example:

subject headings as data

This example is less obvious because the human-readable string is still there, and it is there for the purposes of displaying this data in a human-understandable form. The remainder of what you see in the data portion is a set of identifiers (which we will discuss later) that ensure that the human-readable string is identified uniquely for the machine, and that the relationships between this term and other terms are also coded and identified unambiguously.

relationships are also data

We tend to think of data as numbers or facts. Properly formatted, almost any information can be used as data. By creating a controlled list of relationships between creators and their creations, or between different creations, these relationships can be used in programmatic ways by applications. Relationshps of this type are key components of exploration and navigation capabilities in social networks.

Common types of data

There are some common types of data that you may be asked to provide when creating metadata. This list is not exhaustive, but be aware that most metadata has defined a data type for each field. If you use programs like Microsoft Excel or a database program like FileMaker you will have seen drop-down menus asking you to select a data type.

Integer (or other number type, like long integer)

Essentially numbers. It matters to applications whether numbers will have decimal portions, or if they will exceed certain values.

Currency

This is a type of number but defining it as currency automatically sets some defaults, like two decimal points and a particular character before the number in displays.

Date and time

It actually takes a fair amount of formatting to make dates unambiguous. If you see:

2/1/11

what date is that? To begin with, unless we know it is a recent date, we don't know for sure what century the final "11" refers to. In addition, unless you know whether it was written in the U.S. date style (month, day, year) or the European date style (day, month year) you don't know if this is the second of January or the first of February. We could make this clearer by writing "January 2, 2011" but then we run into the language problem: we have to recognize the names of months in the language of everyone who might provide information to us, and we also have to be able to translate that to the preferred languages of our users. Clearly a non-language based format is simpler.

There are standard date and time formats that, once we have indicated which standard format we are using (and there is more than one). Some common formats are:

ISO 8601

Library of Congress Extended Date and Time Format

Controlled list

One way to create data from text is to define a controlled list of terms. For example, you might want to indicate the color of an object. If this is an uncontrolled field then you can end up with an entirely open-ended list of colors that have to be interpreted semantically in order to be understood. To avoid having the same item's color being called alternately "purple," "plum," or "aubergine," you can limit the valid colors to a simple list: "red or blue or green or yellow or purple or orange." Lists are useful when there is a finite set of terms that serve the metadata needs. Lists are much less useful when the set of terms is not clear cut or cannot be predicted. And example of this latter is the set of titles of journal articles. Since authors can be as creative as they like in assigning titles, there is no possible list of titles that helps you identify future works. In any case, most lists will need to be updated as time goes on, so one of the requirements for well-constructed metadata is to have an update capability for lists.

Text

Wait a minute, isn't this about DATA? Yes, and at the same time our metdata will inevitably carry text that is intended to be read by humans. These may be explanations, the names by which things are known, helpful instructions, etc. By indicating that this particular bit of data is text you are essentially saying that this is information for humans, not for machines, and designating it such is important for applications and systems developers.

Metadata Basics