Meaning, Technology, and the Semantic Web

By Karen Coyle

Preprint. Published in the Journal of Academic Librarianship, v. 34, n. 3, May, 2008, pp. 263-264

When the World Wide Web was developed in the early 1990's at Cern, it created a linked web of documents positioned on top of the basic networking of the Internet. The Web has very few rules; text of nearly any stripe can be placed on the Web, indexed and retrieved. Yet the fact that the items on the web consist of unstructured data for the most part puts great limitations on what we can and cannot do in terms of combining resources and providing services. The goal of the Semantic Web is to transform the Web into a web of data, rather than a web of documents. Web resources must cease to be undifferentiated strings of text; instead, they have to be able to reveal meaning within the text. Meaning, of course, in this case is: meaning that machines can process. To achieve this, the Semantic Web must create interactions between the ideas and facts in web resources.

First presented as a new concept at the World Wide Web conference in 1994, the general public learned of the Semantic Web through an article in Scientific American in May, 2001 [1] authored by the father of the web, Tim Berners-Lee, and colleagues James Hendler and Ora Lassila. This article "described the evolution of a Web that consisted largely of documents for humans to read to one that included data and information for computers to manipulate. The Semantic Web is a Web of actionable information…" [2] Underlying the description of the needs of the Semantic Web is the recognition that most data on today's web is uncoded and undifferentiated; it is natural language text. Anyone who has done data processing knows that you can only write computer programs for data that has been coded for meaning ("zip code=20001") and that has certain regularities that a program can take advantage of. Simple text that poses no problem for human beings can be inert in the face of an instruction like "What is the date of this document?" if that data does not exist in a predictable structure and format.

Semantics as Technology

The semantics of the Semantic Web should not be entirely unfamiliar to librarians. One very basic example of a semantic statement is: "Herman Melville wrote the book Moby Dick," or "Moby Dick, by Herman Melville." If you want a computer to make use of this, you might create something like this in MARC21 format:

	100 $a Melville, Herman
	245 $a Moby Dick
or this in Dublin Core:
	Creator = Herman Melville
	Title = Moby Dick

Obviously there's more that is needed beyond these two fields, but just this much semantic coding makes the difference between actionable and non-actionable data. With this coding you can ask the question: "Who wrote Moby Dick?" and possibly get an answer. This is not thinking on the part of machines, or any kind of intelligence; just data processing.

The semantic web manages meaning in a machine-actionable way by expressing all concepts as "triples" - a subject, a predicate, and an object. This triple is like a very basic sentence, and in its most simplistic form our statement would read:

   subject (Herman Melville)     predicate (is author of)      object (Moby Dick)

Actual semantic web statements are much more complex than this underneath the hood. They use the Resource Description Framework (RDF) [3] and are often coded in XML. They make extensive use of Uniform Resource Identifiers (URIs), which are essential for machine processing but very unfriendly to a human reader. From Wikipedia, [4] here is an RDF expression of the statement "the postal code for New York is NY":

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:terms="http://purl.org/dc/terms/"> <rdf:Description rdf:about="urn:x-states:New%20York"> <terms:alternative>NY</terms:alternative> </rdf:Description> </rdf:RDF>

This is computer code and is not intended to be human-readable. It illustrates, however, that concepts can be made machine-actionable using this technique. What we have to imagine are the information services that could be offered if the information in the many documents on the World Wide Web could be used in a meaningful way.

Inferring Meaning

One of the common objections to the Semantic Web idea is that it just isn't possible to consider semantically coding the many billions of items on the Web. This is not a minor point. If the implementation of the Semantic Web depends on the manual coding of all past, current and future texts, it is doomed to failure. Instead, one can take advantage of the regularity of some patterns of data. In addition, inferences can be made from similarities between coded data and plain texts.

The use of patterns to add semantics to text exists already in some software. For example, Microsoft Office detects patterns that indicate addresses, dates, place names and the names of persons. When the software encounters a string that may be a personal name it can offer to look up this name in the user's email address book and open an empty email form addressed to that person. A possible geographical name could be checked against a web-based gazetteer, and either maps, directions, or related photographs could be offered. The automated assignment of meaning to texts or portions of texts can be facilitated by the existence of coded vocabularies or "ontologies" (structured vocabularies, like thesauri). Thus, if there exists on the web a machine-actionable vocabulary that states that "dog" is a sub-type of "mammal," then software can infer that texts with the term "dog" may be relevant to a search on "mammals." Over time, just as the Web increases with value as the quantity of resources increases, a qualitative increase in value should be expected as more semantic connections are added.

Semantic Web Adoption

Unfortunately the Semantic Web has not come into being as rapidly as hoped. There is considerable standards development activity taking place through the World Wide Web Consortium but actual use of Semantic Web techniques is rare. Part of the problem is the complexity of the underlying model, and the fact that as yet we do not have applications that sufficiently hide that complexity. Another part of the problem, however, is that the Semantic Web effort itself is lacking in user-friendliness.

To begin with, the Semantic Web is expressed in language that is often unique to the Semantic Web activity; thus the use of the term "ontologies" (which in philosophy means "the study of being") to refer to what is more commonly known as thesauri. To keep the language of the Semantic Web neutral in terms of data types, rather than talking about "articles" or "authors" or "images" the Semantic Web documents use the terms "resource" or "entity" for all things that can be targeted. These concepts are so abstract in some contexts that it can be hard to understand what is meant by them.

The Semantic Web relies heavily on formal identifiers for the elements of its meaningful statements. So a term like Dublin Core's "title" becomes "http://purl.org/dc/elements/1.1/title," and MARC21 role term "translator" becomes "http://www.loc.gov/loc.terms/relators/TRL." The meaning of these fields doesn't change, but the underlying format, which should be hidden from view behind a human-friendly interface, is quite foreign to most. This makes it difficult to read and understand the Semantic Web standards documents. It is perhaps because of this learning curve that libraries have not been more involved in the Semantic Web development, even though we hold data that could provide a very useful basis for many Semantic Web activities. It's clear that the Semantic Web is somewhat stuck in "engineer mode." The documents on the Semantic Web site develop concepts, set rules, and illustrate code. But even the most basic explanatory document, the RDF Primer, [6] lacks examples of what services could be provided and how it might look to a user of the Semantic Web.

The Semantic Web and Libraries

What does the Semantic Web have to do with libraries? Libraries are in a unique position to take advantage of the Semantic Web; they have a huge store of semantically coded public information in their library catalogs. To become part of the emerging semantic web, the library community needs to do two things. First, the libraries' bibliographic data needs to be re-formatted from its current machine-readable structure into a Semantic Web-compatible format. Second, the library data must no longer be closed away in databases, but must reside on the Web where it can interact with other Web resources. This requires a significant change to the form and the storage of library data. Work is taking place to create the first Semantic Web interpretation of standard library data elements [7] in a joint effort between the Joint Steering Committee for RDA [8] and Dublin Core Metadata Initiative.[9]

While it is true that the Semantic Web is not positioned to take advantage of library data today, the availability of that data in a compatible form could be a stimulus for the development of friendly Semantic Web applications around a very common and popular format: the book. With the increased availability of books online through the various library digitization projects, the book is finally becoming visible on the Web. Shouldn't the library data about those books be available as well? That's a goal we should work toward.


[1] Tim Berners-Lee, James Hendler and Ora Lassila. The Semantic Web. Scientific American Magazine, May, 2001. http://www.sciam.com/article.cfm?id=the-semantic-web (Accessed February 14, 2008)
[2] Nigel Shadbolt, Wendy Hall, Tim Berners-Lee. The Semantic Web Revisited. IEEE Intelligent Systems. May/June 2006. p. 96
[3] http://www.w3.org/RDF
[4] Resource Description Framework http://en.wikipedia.org/wiki/Resource_Description_Framework (Accessed February 15, 2008)
[5] W3C Semantic Web Activity http://www.w3.org/2001/Semantic Web/
[6] Frank Manola, Eric Miller. RDF Primer. February, 2004. http://www.w3.org/TR/rdf-primer/ (Accessed February 15, 2008)
[7] Dublin Core Metadata Initiative: DCMI/RDA Task Group. http://dublincore.org/dcmirdataskgroup/FrontPage
[8] Joint Steering Committee for Development of RDA (Resource Description and Access).
http://www.collectionscanada.gc.ca/jsc/rda.html
[9] http://dublincore.org/