Questions about Linked Data

- with a bit of a bent toward libraries and librarians

Q: What is the easiest, cheapest way to implement linked data?

The easiest, cheapest way today is probably to add schema.org markup to some fields on your page. Schema.org is a way to add meaningful metadata to your web page. As an example, your web page may have this

Author: Melville, Herman

To a search engine or other software, this looks like:

<TR>  <TH>Author:</TH>  <TD> Melville, Herman, 1819-1891.  </TD>  </TR>

There is nothing there to tell a bit of software what that means. Like the Gary Larson cartoon about "what dogs hear," to a search engine this looks like:

What you do to fix this is that you add "microdata" to your web display that tells the search engine that this is an author:

<TR> <TH>Author:</TH> <TD> <span itemprop="author">Melville, Herman, 1819-1891. </span> </TD> </TR>

However, what linked data really needs is identifiers, which means that your "Melville, Herman, 1819-1891." needs to be identified with a standard identifier that others also understand.

Q: How can we use existing data that is out in the "linked data cloud"?

The diagram in the OpenAgris documentation is a good visual of what one needs to do. The essential element of using linked data is to access data that has been made available with what is called a "SPARQL end-point." SPARQL is the query language that allows you to search and retrieve linked data. It is similar to SQL but is designed for the structure ("triples") of linked data.

While SPARQL can be used to search any type of data, the most efficient use is when your data has an identifier ("URI") that can also be found in another database. For example, if your data has the VIAF identifier for a person, you can use SPARQL to search for information about that same person in Wikipedia using the VIAF identifier. Not all persons with VIAF identifiers can be found in Wikipedia, and the addition of VIAF identifiers in Wikipedia is not complete, but when a match is found it is then possible to select structured data from Wikipedia for display or other use.

One example of this type of use is the experimental Facebook app that makes use of WorldCat data. I did a short video showing some examples of this. Another example is OpenAgris. Here is a record that brings up data from a number of different sources. Note that OpenAgris, an FAO project, pulls some of its data from from other FAO databases, such as its GeoPolitical ontology. Linked data does not have to come from an outside source, but can be used to link between diverse resources in a single organization.

Q: How do you extend a vocabulary when you need a term that does not exist in that vocabulary?

First, it's important to understand that each entry in each vocabulary is owned by someone, and that someone can be identified as the owner of the domain name that forms the basis of the URI. You extend a vocabulary by creating new and preferably related terms within your own domain. You cannot create a term with a URI using a domain that you do not own.

As an example, let's say that there is a vocabulary with three terms:

http://www.fred.com/vocabulary.owl :
http://www.fred.com/vocabulary.owl#flightnumber
http://www.fred.com/vocabulary.owl#carrier
http://www.fred.com/vocabulary.owl#destination

You need to add the origin of the flight. To do this, you create a new vocabulary, with your domain:

http://www.mary.com/vocabulary.owl#

In this vocabulary you include all of the members of Fred's Vocabulary, plus any other terms you wish to use.

http://www.mary.com/vocabulary.owl :
include:http//www.fred.com/vocabulary.owl#
http://www.mary.com/vocabulary.owl#origin

This gives you these vocabulary elements to use:

http://mary.com/vocabulary.owl :
http://www.fred.com/vocabulary.owl#flightnumber
http://www.fred.com/vocabulary.owl#carrier
http://www.fred.com/vocabulary.owl#destination
http://www.mary.com/vocabulary.owl#origin

In RDF/XML, the include statement looks like:

<owl:Ontology rdf:about="http://www.mary.com/vocabulary.owl">
        <owl:imports rdf:resource="http://www.fred.com/vocabulary.owl"/>
    </owl:Ontology>

That is an example of adding a term at the same conceptual level. You can also add terms that are conceptually broader or narrower than existing ones. Let's say that you want to add more specific destinations, like domestic and international. And we'll also say that in this case you are not interested in flight numbers. You can therefore create a vocabulary that looks like this:

http://marysvocabulary.com :
include: http://www.fred.com/vocabulary.owl

http://www.mary.com/vocabulary.owl#origin
http://www.mary.com/vocabulary.owl#domestic subPropertyOf 
      http://www.fred.com/vocabulary.owl#destination
http://www.mary.com/vocabulary.owl#internationl subPropertyOf 
      http://www.fred.com/vocabulary.owl#destination

There are now relationships between Fred's destination and Mary's domestic and international.

Q: How do you publish an ontology?

The actual publication of an ontology consists of making it available at the URI of the vocabulary. Information about the vocabulary and its member classes and properties makes the vocabulary more usable. The minimum information should include the primary OWL properties domain and range and any relationships with other terms. To facilitate use, human-readable labels, definitions and any other explanatory information should be included.

Q: How can you embed linked data in web pages using RDFa?

First, see http://schema.org for a common vocabulary for embedding data in web pages. RDFa ...

Q: What's the difference between microformats and RDFa?

The difference between microformats and RDFa "lite" is minimal. Here are two examples, the first in microformat the second in RDFa:

Q: How do you harvest linked data from other sources?

Q: What are the different serializations of linked data?

N3 (aka: Notation3) - A simple notation with human-readability in mind. Compatible with Turtle.

<#pat> <#knows> <#jo> . 
<#pat> <#age> 24 .

Turtle (aka: Terse RDF Triple Language) - Represents triples, but also allows one to simplify the display by avoiding repetition of subjects (or subjects and predicates). It uses qnames to shorten the URIs. This means that instead of writing:

<http://example.com/1234> <http://purl.org/dc/terms/title> "Moby Dick"
<http://example.com/1234> <http://purl.org/dc/terms/creator> <http://viaf.org/viaf/27068555/>
<http://example.com/1234> <http://purl.org/dc/terms/date> "1851"

(And imagine that going on for many more lines)

You can write this:

@prefix ex: <http://example.com/> .
@prefix dct: <http://purl.org/dc/terms/> .

ex:1234 
    dct:title "Moby Dick" ;
    dct:creator <http://viaf.org/viaf/27068555/> ;
    dct:date "1851" .

RDF/XML - An encoding of RDF graphs in XML. If you are familiar with XML, RDF/XML will look very familiar, but that familiarity holds some traps because there are non-obvious but significant factors that make RDF/XML unique. In particular, a tecnique called "striping" is used to represent graphs in XML. The caution here is that non-RDF XML and RDF/XML are different. Here is a short RDF/XML document from the W3C specification:

<?xml version="1.0"?> 
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
          xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:ex="http://example.org/stuff/1.0/">
   <rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar"
 		   dc:title="RDF/XML Syntax Specification (Revised)">
     <ex:editor>
       <rdf:Description ex:fullName="Dave Beckett">
 	<ex:homePage rdf:resource="http://purl.org/net/dajobe/" />
       </rdf:Description>
     </ex:editor>
   </rdf:Description> 
</rdf:RDF>

The W3C has an RDF/XML validator that will validate your code, return the data as triples, and/or draw a graph of your RDF.

JSON-LD - A way to encode linked data in JSON. New, but likely to become a very popular serialization for web services and other applications that currently use JSON. Like RDF/XML, JSON-LD is not "plain old JSON" so read the documentation. The JSON-LD site provides a web-based viewer and debugger called "The Playground." A simple example:

{
  "@context": "http://json-ld.org/contexts/person.jsonld",
  "@id": "http://dbpedia.org/resource/John_Lennon",
  "name": "John Lennon",
  "born": "1940-10-09",
  "spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}

Q: How do you establish relationships using RDFs and OWL?

Q: A demonstration of discovery of new "answers" using queries

Q: How can a web application interfaces with linked data?

Q: Do you need a triple store DBMS or can you use an RDBMS?

Many developers store triples in an RDBMS because that is what they know best or what they have available. If you use an RDBMS to store your triples, you must store them as triples, not using RDBMS concepts.

Q: Do you need triples, or can you store data in XML and convert to triples as needed?

In the end, all RDF data is made up of triples. How these are serialized at any moment in time should not change that fact. There is an XML serialization of RDF known as RDF/XML. This has specific requirements that are necessary to maintain the triple structure, in particular the concept of striping. You can test your RDF/XML using the W3C RDF Validation service.

All serializations of RDF can be re-serialized - in other words, they contain the same content. RDF/XML can be re-serialized to Turtle or to triples (N3), and Turtle and N3 can be serialized to RDF/XML. Which serialization you use will probably depend on the tools available to you, as well as your own comfort with the different serializations. Be aware, however, that RDF/XML will differ from the XML that you are currently accustomed to, and that you cannot rely on some common XML thinking when working in RDF. Many RDF experts consider Turtle and N3 to be more accurately expressive of RDF, so it is a good idea to begin to explore these serializations as soon as possible in your learning.

Q: When should you query in real time, and when should you cache the data?

There are no hard and fast rules for this, but a certain amount can be determined using experience and common sense. Remember that some of your data sources may be internal to your own system. An implementation of BIBFRAME could store bf:Works and bf:Instances in separate locations, and link them internally; or it could have a database of reviews and tables of contents that is queried by the primary bibliographic database. The technique for querying these sources could be identical to the technique of querying external sources. (Try to get something by Malmstem on LIBRIS - no difference between internal and external.)

On the other hand, some of your sources may be external. Then you need to think about the function of those sources, and what risk you take of a real-time failure. Let's say that your bibliographic database enhances its displays with reviews and tables of contents that are not stored locally. If your experience is that your success rate is 99.% in retrieving these resources, then you face the small risk that occasionally these will not display. Since you will not have this data for all bibliographic resources in your database, the lack of this enhancement will have little impact on the users. If, instead, your external data is essential for retrieval and for coherency of the display then even a 1% rate of failure is intolerable.

You will most likely store copies of any data that is included in your system search capabilities. This is not unlike today's treatment of authority data, which exists in a central, shared file, but specific entries are copied to local catalogs. The entries in the shared Name Authority files are updated through system functions that notify users of corrections to the central file. This same type of storage, update and notification would be expected regardless of the data format being used. The propagation of updates is common on the Web, with the primary example being the Domain Name System.

Q: Other than SPARQL, how do you search linked data

I don't know of any way to search linked data other than SPARQL. There are some services that intend to act as search engines (e.g.Swoogle) but they exist mainly to help you discover linked data resources, but, as far as I can tell, not to actually use them as linked data.

Q: What is the difference between classes and properties?

Q: What are some acceptable substitutes for OWL:sameAs?

Q: Need some advanced SPARQL training

Q: OWL and Inferencing and Validation

Imagine that you have a robot and you want to teach it tasks that would be useful, such as "get me a chair." This is a typical artifical intelligence problem, and it is harder than it seems. First, the robot has to learn to identify something that is called a "chair." (I'm going to assume that the robot has vision and that you have some way to communicate the query "get me a chair" to it.) Since the robot needs to operate in different environments and contexts, you can't teach it just about one single chair; it has to be able to determine "chairness" in its immediate surroundings.

The artificial intelligence solution to this is to give the robot information that it can then apply to the activities of that moment. Whereas somehow (and we do not exactly know how) humans have a generalized concept of "chair," the robot is going to need a set of rules that it can process to determine what is and what isn't a chair. Let's say that the robot has been programmed with this formula:

A chair:

has four legs
has a flat, horizontal surface on those four legs, called a "seat"
the legs rest on the floor
has a vertical surface rising up from part of the seat.

The robot has also learned about a "stool":

has three legs
has a flat, horizontal surface on those three legs, called a "seat"
the legs rest on the floor
has no vertical surface rising up from part of the seat

Right away we humans can see that these definitions do not cover all possibilities of "chair" and "stool." But for the purpose of getting a robot to perform a task, we are going to apply the "80/20" rule.

We set the robot down in a house and say: "Get me a chair." The robot is going to look at all of the things within its vision and compare those with the rule set. Hopefully, it will select a chair.

Next we teach the robot that when a person wants to sit down, it needs to sit on a chair. If there is no chair available, a stool will do. Then we tell the robot: "I want to sit down." The robot has a rule that "sit down" requires a chair, and looks for a chair. If the robot does not find a chair, then it looks for a stool. If it finds either, it brings it to us. (Admittedly, although not with chairs, a dog or a monkey could learn this lesson and probably be able to apply more flexibility in thinking. A cat, of course, would only reply, "You MUST be kidding.")

This is the kind of inferencing that has influenced the development of the Semantic Web. In fact, much of the early work on the Semantic Web came directly out of the work of the artificial intelligence community. The "semantic" of the Semantic Web relates specifically to a branch of mathematics that deals with formal rules, rules like:

declared: if A = B, and B = C, then A = C

Only, in fact, most rules are much, much more complex than that.

On the Semantic Web, inferencing allows you to have huge amounts of interrelated factual statements (A = B) and to derive new information from them by following rules. A simple case is if you define a rule like:

If A is child of B, then B is parent of A

With the statement: "Tom is the child of Mary," through inferencing you can also know that "Mary is the parent of Tom", and therefore you can answer the question: "Who is the parent of Tom?" (Answer: Mary)

Reasoners, the programs that traverse the Web using inference rules, can make use of longer chains of relationships than that example. A next step might be to add that a parent is the responsible party for a child. If you have a form that must be signed by a responsible party, then you can ask " who is the responsible party who must sign the form for Tom?" The reasoner can traverse the (formal) logical and conclude that for Tom the responsible party is Mary because Mary is a parent, and Mary is Tom's parent.

You can declare that a child can have as many as two parents and/or two guardians, that both parents and guardians are responsible parties, and include the statement "John is the guardian of Tom" in your data. Then when you ask who can sign the form, your answer will be "Mary or John."

Note that nothing in this has any enforcement function over your data. You can have one, two or many statements that say "X is the parent of Y." What the reasoner will do with this will depend on details of how your reasoner operates. It may ignore data that is inconsistent with the rules that it is applying to the data. It may return a message that the data does not allow it to make a clear decision, just like the robot may conclude "I don't see a chair here" if the room has a only beanbag chair or an office chair on a shaft with 5 wheels, but nothing with four legs. The room may also have some other piece of furniture with four legs, a horizontal piece, and a vertical piece (maybe a vanity table?) that the robot would mistake for a chair. With data in a world-wide graph we probably have an easier job than trying to train a robot to understand the real world, but our results may be more approximate than precise.

This is the functionality that OWL allows. It allows you, given some data and some rules, to make inferences based on that data and those rules. Now, what should you do with that? When developing an ontology you need to think about what inferences you wish to make. Will you need to interpret statements like: "Is John a man?" based on information such as "All fathers are men; John is a father." You should not create OWL rules where there are no inferencing needs. The reason to keep your OWL definitions "light-weight" is that OWL definitions affect the semantics of your classes and properties in the open world. This means that it affects everyone's use of your data in the cloud. Even you may have more than one application operating on the data, and those applications may have different requirements. The less restrictive your OWL definitions the more likely that different applications will be able to operate on the same data.

Also, remember that the graph grows, and something that may be true at the first moment of metadata creation, for example, may not be true when your graph combines with other graphs. So you may say that there is one and only one main author to a work title, but that means one and only one URI. If your data combines with data from another source, and that source has used a different author URI, then what should happen? Each OWL rule makes a statement about a supposed reality, yet you may not have much control over that reality. Fewer rules ("least ontological commitment") means more possibilities for re-use and re-combining of your data; more rules makes it very hard for your data to play well in the world graph.

There are cases where OWL increases the flexibility and scope of your properties and classes, in particular declaring sub-class/sub-property relations. If we say that RDA:titleProper is a subproperty of dct:title then anyone who "knows" dct:title can make use of RDA:titleProper. This creates a situation where anyone can infer that RDA:titleProper is a dct:title, and make use of the former anywhere that the latter is appropriate.

Q: Care and feeding of triple stores

Q: Some measures and costs: server capacity, disk space, other tangibles

Q: What are the open source and commercial applications that people are using?

Q: Is there backward compatibility between linked data and XML?

In some rare instances there may be, but it is unlikely. Although you can serialize linked data as RDFXML, the data needs to be designed according to the rules of RDF. Those rules include the requirement that all data can be reduced to a set of triples that represent the subject, the object, and a relationship between them. The subject and the relationship must be expressed as URIs. The object may be expressed as a URI or a literal. The URIs should resolve to information about the thing or relationship that the URI represents.

Q: Is there development software that provides a user interface that hides the programming details?

Q: What tutorials exist?

Q: How do you find vocabularies to use?

The Linked Open Vocabularies project has the largest and best organized set of vocabularies. It includes only vocabularies that have at least one connection to another vocabulary -- that is, it does not include vocabularies that are unused or orphaned.

Karen Coyle on the Web