On the Web, of the Web

LITA 2011 Keynote, October 1, 2011

by Karen Coyle

I'm giving this talk as an overview because I believe that we are at the end of an era, and therefore at the beginning of something new. We are at the proverbial tipping point.

I have being speaking and writing for the last few years about the Semantic Web and Linked Data. These, however, are details that fit into this big picture, and it is vital for us to look at the context around those details so that we can make decisions as we move forward.

The New World

This is a new world. Let's do an environment scan: Today's world is:

Today's resources are:

Today's users:

The entire concept of communication is changing.

So our world is digital and interactive; information that is informal is part of the record, and text is waning. What is the library profession doing about this? I may have a skewed view because of the emphasis on metadata in my work, but I see two primary trends. The first is the development of the "FRamily": FRBR, FRAD, FRSAD. These are an attempt to create a new, modern, bibliographic model, based on the concept of relational databases, a concept that was new in the 1970's, but is being replaced by less structured data like triple stores. The new cataloging rules, RDA, are in part an implementation of these models, in particular of FRBR. The other trend is that of the Semantic Web and linked data, new general models of metadata coming out of the World Wide Web Consortium. These two trends I characterize, respectively, as relating to cataloging (FRs and RDA) and knowledge organization (linked data).

The Year of Linked Data

For libraries, this is the year of linked data. The year when it went from being a wild, harebrained idea to becoming accepted as mainstream. The British Library has issued the British National Bibliography as linked data. Sweden's national library's catalog, LIBRIS, exports bibliographic data in linked data format. Numerous German libraries have provided exports of their data in linked data format.

Here in the US, the Library of Congress is building up a set of essential bibliographic vocabularies in linked data format, including LCSH and the Name Authorities file, plus a number of the coded controlled lists from MARC21. The Virtual International Authority File, a union file of name authorities records from nearly 20 national libraries, is natively in linked data format.

The advantages of linked data are real. The primary reason to move to linked data is that it is a metadata format designed for data that lives on the Web. This in itself has a multiplicity of positives – being on the web means being able to interact with web resources; it means being visible to web users; it also means being able to take advantage of the web as a platform. This latter cannot be understated: the web is a huge system based on solid technology that is much more reliable than any small organization can create in house. (Google, not a small organization, may have a private network that rivals the web, but the rest of us cannot get anywhere near that.)  I'm talking not about 'web scale,' but 'on the web.'

There are other excellent reasons to move to linked data. Linked data provides a flexibility that previous technologies do not. Linked data allows expansion in a non-disruptive way; just as anyone on the web can link to documents on your web site with, anyone can link to your data. And that linking has no effect on what it links to, other than providing new paths of access. Not only can others link to your data, you can link to your own data. This means that you can build up your data incrementally without having to modify your data structure or the systems that manage that data.  No more waiting two years to add a code to our record format, then having to coordinate implementation on a broad basis so that all systems are in sync.  Systems can ignore "new" data until they are ready to make use of it.

Linked data allows some communities or some segments of communities to add more data where they need to, and still remain compatible with the larger data sharing activity.  One can add new controlled lists or expand general lists to express greater detail in some areas. Others don't have to adjust to that if they don't need that detail. This flexibility also extends to internationalization. Because linked data uses identifiers instead of names or terms, those identifiers can be presented to natural languages users (e.g. humans) in any language you wish. RDA elements and controlled lists in linked data format are already being translated by interested libraries in Europe.  And if there are particular needs in one country or region, those needs can be met by extending the metadata.

We only need to be the same in some areas to achieve linking. Where in the past, when we exchanged records, the record itself had to be a known and controlled unit. With linking, any part of the data can be compatible, but not all of it has to be. You can exchange or link to portions of someone else's data, say to an author and title in a bibliography. Linked data can be as simple or complex as you wish it to be.

There are the Semantic Web purists who have a fairly rigid notion of linked data, and admittedly the way they express their concept of linked data is abstract and obscure. I also think that the Semantic Web ideal of highly atomized, pure data is unrealistic. That doesn't negate the value of linked data.  It just means that we have a flexible approach to the concept if we are to bring our real-world complexity and messiness into the linked data realm.

Catalog Data

Nearly all of the discussion of linked data in the library world has been focused on catalog data. In a sense this is a logical place to start, because we already have that data in a machine-readable form. We also know it well, although quite possibly too well. Library catalog data is a language and a communication system all its own. Who else creates things that look like:

xii, 356 p. ; 23 cm.

Few outside of the library world have any idea what that means. What about:

Hamlet. French. 1923

It's like the secret language of twins, but shared over generations and with tens of thousands of initiates. We can have our secret inner circle, sure, but the library catalog is supposed to be our face to the world. It hasn't been that. And it's not going to be that if we continue to see the catalog of resources as our primary interaction with users. Taking the data we have today in the catalog and converting it to linked data would be a repetition of what we did when we went from the catalog card to the MARC format: translating the same functionality into a new data format. It's not going to make our data any more user friendly or any more useful.

In addition, the catalog primarily organizes things; resources, books, CDs and DVDs. Where we need to go is to KNOWLEDGE ORGANIZATION. Not STUFF organization.

Subject access in FRBR and RDA

Our cataloging rules do not address subject access. This includes FRBR and RDA, and these are the models that are supposed to take us into the future. AACR, FRBR and RDA provide a stub in the resource description for a classification number or subject heading, if you happen to have them, but none of them helps you create organized knowledge. You may wonder why we should even bother, since today everyone uses keyword searching, and users aren't doing their knowledge-seeking in the library catalog anyway. So let's look at keyword searching for a few minutes.

Why does Google (Yahoo, Bing) use keyword searching? Because it's easy. It is mechanical. It is a match between a string in a query and a string in a database (even with all of its enhancements, that's the bottom line). It requires no knowledge of the topic, no human intervention, no experts. Keyword searching is NOT knowledge organization. With keyword searching there are no relationships between things. You can't go broader or narrower; you can't get "things like thi,"; it doesn't even have facets. I said before that users are accustomed to the single search box, and many see it as representing freedom – wide open, anything goes. It's not a freedom, it's a constraint. It basically constrains the user to try to guess what words will bring up the information you are seeking – which is a bit unfair since the assumption is that there is something the user doesn't know which is why she is doing a search. The user has to translate what might be a complex information need to a couple of words. And as Elaine Svenonius notes:

"At the same time, it is known that users in
their attempts to search by subject
sometimes find themselves at a loss for

Svenonius, The Intellectual Foundation of Information Organization, p. 135

What works for keyword searching?

What doesn't work?

Google has all of the knowledge basis of a phone book. You name it, you retrieve it.

Did you ever wonder why so many searches turn up Wikipedia in the first few hits? Wikipedia is ORGANIZED INFORMATION. To me it is the proof that organized information is needed, works, and helps people find and learn. Wikipedia does have pages for concepts, it does have links between related subjects, it IS organized knowledge. How well does keyword searching work? Some analogies:

We tend to ignore the false hits and zoom in on the successes. But the main thing is that this imprecise retrieval puts a huge burden on the user, who has to essentially game the system to get retrievals and then has to dig through what comes back to sort wheat from chaff. In his book Everything is Miscellaneous David Weinberg talks about tagging, and says that a search on Flickr for "San Francisco" will bring up photos of a number of different places named San Francisco, but what does that matter? I think it matters, and it matters especially for the least experienced users who find such things confusing. Everything might be miscellaneous but it is also time consuming and annoying.

Where does this leave us? We've spent the last 15 years re-making cataloging (for the fourth time in 50 years), but we are still using the Knowledge Organization (Library of Congress Classification, Dewey Decimal Classificatin, Library of Congress Subject Headings) from a century ago; three Victorian era knowledge schemes. They are like information retrieval of Brideshead Revisited. Yes, cookery has just now become cooking, but that is simply too little, too late. All of these knowledge schemes precede computers and have the limitations of the analog world. Today's technology could help us do much better topical access.

With computing, a classification is no longer limited to having only one place for a topic in order to make relationships like broader and narrower; and you can give any resource as many places in the classification as you like because we aren't constrained by physical space. The other big constraint, and especially on the use of faceting, was that of notation: the entire topic, no matter how complex, had to be represented by a single notation that not only conveyed the meaning of the topic but also had to fit on the spine of a book. Facets became a topic in the 1930's with Ranganathan. The Classification Study Group in the UK in the 1960's and 70's was heavy on faceted classification. Yet it never made it into our repertoire, even though facets, based on LC subject headings which is only partially faceted, have proven to be useful. The only topical 'advance' has been FAST – by OCLC. Faceted Application of Subject Terminology. And it was just a way to make use of LCSH without adding anything. FAST would be interesting to experiment with, but is virtually unknown: it was developed without community input (internal to OCLC research) so there isn't buy-in, and it has to be licensed from OCLC, so it isn't really available for experimentation.

But Wait, There's More!

Users not only need to find information, they need to use it. Eric Morgan, in his beta sprint entry for the Digital Public Library of America, lists these uses:

analyze, annotate, cite, compare & contrast, confirm, delete, discuss, evaluate, find opposite, find similar, graph & visualize, learn from, plot on a map, purchase, rate, read, review, save, share, summarize, tag, trace idea, or transform.

At the Linked Open Data - Libraries, Archives and Museums (LOD-LAM) summit in San Francisco in May of 2011 we developed the following diagram of uses in a short one-hour session:

Users and uses diagram

Use is an information activity; and users need to interact with resources during use. These are the services we need to support. We will support them partially with data from the catalog, but we will also use data from the web – because we will be of the Web. If a user asks a question, we can answer it using any resources we like. Linked data will help us get there, but not because OUR data has the answers – it will help us get there because we will be on the Web. Our data will play a role because it will connect users with resources managed by the library. But the library catalog needs to be a back-room database; the ugly stuff that keeps track of what parts of the information universe the library actually HAS, but not the only part of the information universe that the library can interact with.

The Message

We've got to move beyond the catalog. It is not longer an end in itself, and it is no longer a primary user service. Yes, we need the metadata that describes our holdings and our licensed resources, but this inventory isn't for our users but is fodder for services that will be used in a larger information environment. It needs to be like the OpenURL server database that sits between information resources on the network and the library user. This also means that FRBR and RDA will have to evolve. The catalog that they address, that they create, is no longer serving our users. Our data needs to focus on making connections outside of the library that will bring library resources to users as they interact with the world of information. Those connections can't be limited to connecting to the names of authors and titles, or to works and manifestations, but absolutely have to have a knowledge organization component. In fact, our main emphasis should be on knowledge organization, quite the opposite of where we are today.