I'm giving this talk as an overview because I believe that we are at the end of an era, and therefore at the beginning of something new. We are at the proverbial tipping point.
I have being speaking and writing for the last few years about the Semantic Web and Linked Data. These, however, are details that fit into this big picture, and it is vital for us to look at the context around those details so that we can make decisions as we move forward.
The New World
This is a new world. Let's do an environment scan: Today's world is:
- Highly computerized, and networked. A tremendous amount of our activity takes place through computers and over computer networks; activity that we used to do in the analog world.
- It is highly interactive. Every expects to be able to create in this virtual space (even if it is only their own facebooks page) and to have a chance to have an effect on this maleable technology, or at least to be allow to express their opinion on it. If Moses came down from the mountain today with the ten commandments, he would be expected to have comments "enabled," he'd know he was successful if his commandments went viral, and some number of his readers would develop their own commandments in reply.
- It is a pluralistic world. The old powers of the analog world still exist, but the networked world is able to produce sometimes surprising new powerful forces, like Wikileaks, which has changed the course of politics and international diplomacy, and like the independent bloggers who have become important commentators without having the backing of standard media companies.
Today's resources are:
- either going digital (being digitized) or already born digital. This is not the end of print, but print is clearly the waning technology, not the new, modern technology. I can imagine that print (on paper) will have a role the way that live performances do today: 150 years ago if you wanted to see a performance you saw it live; today we have many other ways to experience performances virtually. The live performance hasn't disappeared, but it has become special, the exception, not the rule. Many years from now, paper will be part of those special occasions: wedding invitations will still be sent on lovely paper with beautiful printing.
- Resources today are relatively easy to find and to obtain: they can be found by searching online or in databases, and because many are online they can be accessed right on your screen. They are, however, difficult to use. You must have a device and that device must have software, and the software is imperfect and requires frequent maintenance. for all that we have been accessing and using digital objects for decades, common actions like annotating are still crude and frustrating.
- Expect to work independently without prior instruction. If you tell them that they have to attend a one hour class in order to access your web site, they will simply go away. They are the generation of the "single search box," the seemingly effortless access to the web and all of its resources.
- They are dependent on software tools that are not controlled by the library, and may not work all that well. There is a significant layer of software and hardware between the user and the library services, and we all know that our users do not see a bright dividing line between what the library controls and what it does not. The user may blame the library when something goes wrong, even though it may not b something that the library can fix.
- "Access" often means obtaining a copy, as when we view an article in PDF and a copy is stored on our hard drive. Users end up with substantial libraries of their own that they need to manage and there is little to help them do that, in spite of Zotero and Mendeley and other software packages that purport to help in that area. Note that those software tools would not exist if the basic technology of the hard drive were more advanced, but I am stunned at the fact that when I look at files on my drive I am still, after over 30 years of personal computing, looking at file names, and not at titles or authors.
The entire concept of communication is changing.
- Communication is increasing remote, not face-to-face.
- It is faster. Academics talk about the "slow conversation" of books, where someone publishers his or her ideas in book, it is the read by others, and at some point, years or decades later, another book is published at responds to or adds to the original book. This "slow conversation" of books has given way to blogs, which are short and quick, with responses generally received within days if not hours, and Twitter, which is even shorter and faster, with messages and replies happening so quickly that it's hard to say if it is asynchronous or synchronous interaction.
- I see a growing dominance of new media over old, in particular video over text. Not that long ago when you bought software you got a box with 12 or 15 diskettes and a large trade-paperback book. That book was the instruction manual for the software. Today if you want to learn how to use a program you are likely to find training in the form of videos. Performances of all kinds are no longer ethereal but are easily captured and can have the impact that the printed word had in the past. Classroom lectures, talks like this one, all can be fixed in a form that will persist beyond the original moment. And that is not to mention any activity on public streets and the antics of millions of cats, all of which become part of this civilization's record.
- Once informal communication, like email or chat, becomes permanent, whether we like it or not. We rely on being able to go back to utterances that are years ol and even courts today consider email "evidence" while they would never assign that same importance to a conversation over cocktails or a latte at Starbucks. We've pretty much lost our ability to be "off the record."
So our world is digital and interactive; information that is informal is part of the record, and text is waning. What is the library profession doing about this? I may have a skewed view because of the emphasis on metadata in my work, but I see two primary trends. The first is the development of the "FRamily": FRBR, FRAD, FRSAD. These are an attempt to create a new, modern, bibliographic model, based on the concept of relational databases, a concept that was new in the 1970's, but is being replaced by less structured data like triple stores. The new cataloging rules, RDA, are in part an implementation of these models, in particular of FRBR. The other trend is that of the Semantic Web and linked data, new general models of metadata coming out of the World Wide Web Consortium. These two trends I characterize, respectively, as relating to cataloging (FRs and RDA) and knowledge organization (linked data).
The Year of Linked Data
For libraries, this is the year of linked data. The year when it went from being a wild, harebrained idea to becoming accepted as mainstream. The British Library has issued the British National Bibliography as linked data. Sweden's national library's catalog, LIBRIS, exports bibliographic data in linked data format. Numerous German libraries have provided exports of their data in linked data format.
Here in the US, the Library of Congress is building up a set of essential bibliographic vocabularies in linked data format, including LCSH and the Name Authorities file, plus a number of the coded controlled lists from MARC21. The Virtual International Authority File, a union file of name authorities records from nearly 20 national libraries, is natively in linked data format.
The advantages of linked data are real. The primary reason to move to linked data is that it is a metadata format designed for data that lives on the Web. This in itself has a multiplicity of positives – being on the web means being able to interact with web resources; it means being visible to web users; it also means being able to take advantage of the web as a platform. This latter cannot be understated: the web is a huge system based on solid technology that is much more reliable than any small organization can create in house. (Google, not a small organization, may have a private network that rivals the web, but the rest of us cannot get anywhere near that.) I'm talking not about 'web scale,' but 'on the web.'
There are other excellent reasons to move to linked data. Linked data provides a flexibility that previous technologies do not. Linked data allows expansion in a non-disruptive way; just as anyone on the web can link to documents on your web site with, anyone can link to your data. And that linking has no effect on what it links to, other than providing new paths of access. Not only can others link to your data, you can link to your own data. This means that you can build up your data incrementally without having to modify your data structure or the systems that manage that data. No more waiting two years to add a code to our record format, then having to coordinate implementation on a broad basis so that all systems are in sync. Systems can ignore "new" data until they are ready to make use of it.
Linked data allows some communities or some segments of communities to add more data where they need to, and still remain compatible with the larger data sharing activity. One can add new controlled lists or expand general lists to express greater detail in some areas. Others don't have to adjust to that if they don't need that detail. This flexibility also extends to internationalization. Because linked data uses identifiers instead of names or terms, those identifiers can be presented to natural languages users (e.g. humans) in any language you wish. RDA elements and controlled lists in linked data format are already being translated by interested libraries in Europe. And if there are particular needs in one country or region, those needs can be met by extending the metadata.
We only need to be the same in some areas to achieve linking. Where in the past, when we exchanged records, the record itself had to be a known and controlled unit. With linking, any part of the data can be compatible, but not all of it has to be. You can exchange or link to portions of someone else's data, say to an author and title in a bibliography. Linked data can be as simple or complex as you wish it to be.
There are the Semantic Web purists who have a fairly rigid notion of linked data, and admittedly the way they express their concept of linked data is abstract and obscure. I also think that the Semantic Web ideal of highly atomized, pure data is unrealistic. That doesn't negate the value of linked data. It just means that we have a flexible approach to the concept if we are to bring our real-world complexity and messiness into the linked data realm.
Nearly all of the discussion of linked data in the library world has been focused on catalog data. In a sense this is a logical place to start, because we already have that data in a machine-readable form. We also know it well, although quite possibly too well. Library catalog data is a language and a communication system all its own. Who else creates things that look like:
xii, 356 p. ; 23 cm.
Few outside of the library world have any idea what that means. What about:
Hamlet. French. 1923
It's like the secret language of twins, but shared over generations and with tens of thousands of initiates. We can have our secret inner circle, sure, but the library catalog is supposed to be our face to the world. It hasn't been that. And it's not going to be that if we continue to see the catalog of resources as our primary interaction with users. Taking the data we have today in the catalog and converting it to linked data would be a repetition of what we did when we went from the catalog card to the MARC format: translating the same functionality into a new data format. It's not going to make our data any more user friendly or any more useful.
In addition, the catalog primarily organizes things; resources, books, CDs and DVDs. Where we need to go is to KNOWLEDGE ORGANIZATION. Not STUFF organization.
Subject access in FRBR and RDA
Our cataloging rules do not address subject access. This includes FRBR and RDA, and these are the models that are supposed to take us into the future. AACR, FRBR and RDA provide a stub in the resource description for a classification number or subject heading, if you happen to have them, but none of them helps you create organized knowledge. You may wonder why we should even bother, since today everyone uses keyword searching, and users aren't doing their knowledge-seeking in the library catalog anyway. So let's look at keyword searching for a few minutes.
Why does Google (Yahoo, Bing) use keyword searching? Because it's easy. It is mechanical. It is a match between a string in a query and a string in a database (even with all of its enhancements, that's the bottom line). It requires no knowledge of the topic, no human intervention, no experts. Keyword searching is NOT knowledge organization. With keyword searching there are no relationships between things. You can't go broader or narrower; you can't get "things like thi,"; it doesn't even have facets. I said before that users are accustomed to the single search box, and many see it as representing freedom – wide open, anything goes. It's not a freedom, it's a constraint. It basically constrains the user to try to guess what words will bring up the information you are seeking – which is a bit unfair since the assumption is that there is something the user doesn't know which is why she is doing a search. The user has to translate what might be a complex information need to a couple of words. And as Elaine Svenonius notes:
"At the same time, it is known that users in
their attempts to search by subject
sometimes find themselves at a loss for
Svenonius, The Intellectual Foundation of Information Organization, p. 135
What works for keyword searching?
- nouns, especially proper nouns
- named things
- programming languages (Python, Ruby) (Note that you don't retrieve much about snakes or gems with these searches, showing a particular bias in the content of the Web itself)
- titles of books or essays (Moby Dick)
What doesn't work?
- searching for concepts
- searching for things with common terms in their names (library, catalog) (Often when I'm searching for topics relating to libraries I find myself in github.)
- you can't ask a specific question: When did Melville write Moby Dick? You can only put in those terms and hope that a retrieved web page contains the answer. (Wolfram Alpha is trying to address this problem)
Google has all of the knowledge basis of a phone book. You name it, you retrieve it.
Did you ever wonder why so many searches turn up Wikipedia in the first few hits? Wikipedia is ORGANIZED INFORMATION. To me it is the proof that organized information is needed, works, and helps people find and learn. Wikipedia does have pages for concepts, it does have links between related subjects, it IS organized knowledge. How well does keyword searching work? Some analogies:
- it's like dumpster diving for information; you dig through a lot of garbage but you might find a clean, wrapped sandwich
- it's like dynamite fishing; you through dynamite into a lake and see what gets thrown up in the air.
- it's like your grandmother's button box; you need a button and you can spend ages digging through trying to find one that matches on size and color. Or you can go to the store where they have the buttons in order by size and color, and pay a couple of bucks.
We tend to ignore the false hits and zoom in on the successes. But the main thing is that this imprecise retrieval puts a huge burden on the user, who has to essentially game the system to get retrievals and then has to dig through what comes back to sort wheat from chaff. In his book Everything is Miscellaneous David Weinberg talks about tagging, and says that a search on Flickr for "San Francisco" will bring up photos of a number of different places named San Francisco, but what does that matter? I think it matters, and it matters especially for the least experienced users who find such things confusing. Everything might be miscellaneous but it is also time consuming and annoying.
Where does this leave us? We've spent the last 15 years re-making cataloging (for the fourth time in 50 years), but we are still using the Knowledge Organization (Library of Congress Classification, Dewey Decimal Classificatin, Library of Congress Subject Headings) from a century ago; three Victorian era knowledge schemes. They are like information retrieval of Brideshead Revisited. Yes, cookery has just now become cooking, but that is simply too little, too late. All of these knowledge schemes precede computers and have the limitations of the analog world. Today's technology could help us do much better topical access.
With computing, a classification is no longer limited to having only one place for a topic in order to make relationships like broader and narrower; and you can give any resource as many places in the classification as you like because we aren't constrained by physical space. The other big constraint, and especially on the use of faceting, was that of notation: the entire topic, no matter how complex, had to be represented by a single notation that not only conveyed the meaning of the topic but also had to fit on the spine of a book. Facets became a topic in the 1930's with Ranganathan. The Classification Study Group in the UK in the 1960's and 70's was heavy on faceted classification. Yet it never made it into our repertoire, even though facets, based on LC subject headings which is only partially faceted, have proven to be useful. The only topical 'advance' has been FAST – by OCLC. Faceted Application of Subject Terminology. And it was just a way to make use of LCSH without adding anything. FAST would be interesting to experiment with, but is virtually unknown: it was developed without community input (internal to OCLC research) so there isn't buy-in, and it has to be licensed from OCLC, so it isn't really available for experimentation.
But Wait, There's More!
Users not only need to find information, they need to use it. Eric Morgan, in his beta sprint entry for the Digital Public Library of America, lists these uses:
analyze, annotate, cite, compare & contrast, confirm, delete, discuss, evaluate, find opposite, find similar, graph & visualize, learn from, plot on a map, purchase, rate, read, review, save, share, summarize, tag, trace idea, or transform.
At the Linked Open Data - Libraries, Archives and Museums (LOD-LAM) summit in San Francisco in May of 2011 we developed the following diagram of uses in a short one-hour session:
Use is an information activity; and users need to interact with resources during use. These are the services we need to support. We will support them partially with data from the catalog, but we will also use data from the web – because we will be of the Web. If a user asks a question, we can answer it using any resources we like. Linked data will help us get there, but not because OUR data has the answers – it will help us get there because we will be on the Web. Our data will play a role because it will connect users with resources managed by the library. But the library catalog needs to be a back-room database; the ugly stuff that keeps track of what parts of the information universe the library actually HAS, but not the only part of the information universe that the library can interact with.
We've got to move beyond the catalog. It is not longer an end in itself, and it is no longer a primary user service. Yes, we need the metadata that describes our holdings and our licensed resources, but this inventory isn't for our users but is fodder for services that will be used in a larger information environment. It needs to be like the OpenURL server database that sits between information resources on the network and the library user. This also means that FRBR and RDA will have to evolve. The catalog that they address, that they create, is no longer serving our users. Our data needs to focus on making connections outside of the library that will bring library resources to users as they interact with the world of information. Those connections can't be limited to connecting to the names of authors and titles, or to works and manifestations, but absolutely have to have a knowledge organization component. In fact, our main emphasis should be on knowledge organization, quite the opposite of where we are today.