by Karen Coyle
Column: Managing Technology
PREPRINT. Managing Technology column, Journal of Academic Librarianship v. 34, n. 5, 2008. pp 452-453
Throughout history, librarians have managed "sameness" in various aspects of the library's service. The most obvious use of sameness is in determining when two documents represent the same text. Library cataloging is designed to facilitate the identification of works from the metadata record. It also concerns itself with identifying the same author of different works, and works that are on the same topic. Sameness can also be a broad measure, such as when public libraries organize their physical shelves using the same genre ("mysteries") or audience ("young adult"). There are also measures of sameness that libraries have not traditionally provided, such as identifying texts that have citations to other texts in common.
Determining sameness today is a task that simply cannot be done using manual methods. Not only is our universe of works extremely large today, there is an increasing amount of it that we cannot hold in our hands, and much of that latter category of materials is stored outside of the library as remote digital materials. Added to that, unlike hard copy works which are reproduced by a relatively small set of publishers, digital works "in the wild" can be reproduced in any number of copies and in uncontrolled environments. The question of determining sameness has become many times more complex than it once was, and needs a technology solution.
When determining sameness we are usually working with metadata that represents the published work, and metadata is but a limited surrogate for the resource itself. If determining sameness were just a matter of finding two identical records, the task would be simple, but in fact rarely are two metadata records identical. OCLC's database is the large implementation of metadata de-duplication in the library world, bringing together records from thousands of different sources. Complaints about the degree of duplication in OCLC's WorldCat are really statements about variations in the descriptions of the items, making an algorithmic determination of sameness difficult. This imprecision means that the answer to: "do these two records represent the same item?" could be yes, no or maybe. A key factor in how we make decisions based on sameness depends on what we will do with the "maybe" case.
Using the familiar OCLC database as one example, the key design factor was that OCLC retained only a single record for each set of "same" records that were identified. This means that the one retained OCLC record for the set masked any visible differences between the contributed records. In a situation like this where the system keeps only one record, mismatches are essentially fatal because there is no way to look into the set for variations that might represent a different publication. It is therefore better to treat the "maybe" cases as "no" so that there is less risk of throwing away a record for a unique resource. However, this decision means that you will undoubtedly present your users with some records that they will consider duplicates.
An entirely different approach to sameness is that used by Google. In a Google search, Google will let you know that it has not shown you certain results that it deemed to be very similar to the ones displayed. Instead the user is given the option to reveal these if desired. The risk of masking useful results with this functionality is relatively low, and measures of sameness can be less precise because they can be overridden by the user.
Of course, it's not fair to compare Google and OCLC; they are very different services with different functions. But I mention the two approaches because they represent different points in the evolution of technology. The need for OCLC to reduce its database to a single version of a record for each cataloged item undoubtedly had to do with the availability of computing resources and the need to provide the most efficient results for the catalogers using the database. Google's world is one of seemingly unlimited storage, and their user service view is less concerned with precise retrievals than in ranking within a broad retrieved set. What Google's approach can show us, however, is that we have the option to make different decisions today that can give users some control over how to deal with the gray area between "same" and "different." This approach acknowledges that sameness is not an absolute, but a matter of degree.
The recent efforts to "FRBR-ize" databases [1] has brought up the concept of managing sameness in the context of bibliographic databases and for a particular definition of "work," bringing together all of the various editions and re-printings of the same text. This view of sameness allows us to expand searches for a particular printing to others that might be equivalent for the user, in addition to providing a de-duplicated set. Beginning with a view of a particular published item the user can potentially request to view other members of the work set.[2] This view of sameness can expand beyond the immediate catalog to other data stores, so that an interlibrary loan request can be stated in terms of: user seeks this item or any other in the same work set. This of course requires that we have a shared definition of the work set, and OCLC is currently defining that in its xISBN[3] and xOCLCNUM [4] services, which return the identifiers for items in the work set based on a the identifier for any member of the set.
Where once libraries dealt with monographic items and serials at a title level, we are now involved in serving individual articles to our users through a variety of sources: indexing vendor systems, like EBSCO and ProQuest; community collections, like JSTOR and HighWire Press; the many repositories of articles and preprints; and the World Wide Web. There is duplication between these not only in that they may provide access to the same published work, but they may also provide access to other versions of the same work, such as the preprint version.
It falls to metasearch engines to avoid presenting the user with the same article many times over in a retrieved set. The crux of the problem, however, is how do you define "same?" In an environment where all of your metadata is produced by one community following a community standard, you can rely on particular fields to have much in common, with minor variations. But when you begin mixing together metadata from a variety of communities with no standard in common, the question of sameness becomes more complex. Complex for algorithms, that is. It can be the case that something that is obviously the same to a human is much less obviously so in a computed environment.
It may surprise most readers to learn that key bibliographic information like authors and titles are in many cases not the best indicators of sameness, especially when the metadata has not been created based on a single standard. In the case of journal articles, the identifying elements are not the first that come to mind when we think of "bibliographic data." Instead, the best measure of sameness between articles uses the journal title or ISSN, the volume and number, and the starting page. In other words, a machine algorithm may use very different data elements to determine sameness than those used effectively by humans.
If we can accept that sameness can be measured even when traditional bibliographic data isn't available, we can consider ways to make use of non-library created metadata. Already libraries link to outside services like Amazon for book covers, but generally such linking is limited to items with the same ISBN. To go beyond that one would have to confront the fact that the metadata in Amazon is not the same as that created using library cataloging. That doesn't mean that they have nothing in common, nor that they cannot be compared for sameness, especially if our definition of "same" does not require that all of the relevant metadata be identical.
The creators of the Open Library [5] at the Internet Archive do not come from a traditional library background. When they set out to create a Web site that is "one page per book, for every book ever printed" they did not consider limiting themselves to metadata produced by libraries. Instead they have created a combined database with library data and data from the online bookseller, Amazon. De-duplicating between Amazon data and library data is undoubtedly less accurate than de-duplicating between libraries. Even more interesting is the work to link the Amazon authors with the authors in the library data. [6] This is mainly due to the nature of the library name headings, which do not match the form of the author's name that is found on the book itself (e.g. the library's form "Tolkien, J. R. R. (John Ronald Reuel)" vs. the title page "J. R. R. Tolkien"). However, with some manipulation of the strings to be compared, and within the context that the book records have a high degree of matching, it is possible to come up with a reasonable determination of sameness between authors, probably the most difficult matching problem that we have.
Libraries are hardly the first to discover the need to determine sameness in their data. The Soundex Indexing system used by the U.S. Census was developed in the early 1900's to locate names that could sound the same when spoken.[7] Today there are more sophisticated measures of sameness in business and science that are probabilistic in nature; items A and B can be considered to be 80% the same or 60% the same. The problem is often called "record linkage" or "duplicate record detection."[8] These measures are designed work on written text, rather than using the phonetic basis of Soundex. For example, there is the Jaro-Winkler measurement between strings.[9] This technique can give you a number that represents the difference between written words, such as "Dixon" and "Dickson" (which is: 0.767). Presumably, using such a figure on a rich metadata record would produce a good measure of sameness, although once again there is the issue of determining the cut-off between "yes" and "maybe."
All of these techniques are designed on the assumption that the data that represents the same person or item will not match exactly. A company that specializes in data matching, Netrics, [10] has the motto: "Making Imperfect Data Perfectly Usable." Imperfect data is indeed the problem, and it is also the norm. Use of such scientific measurements may make it easier to combine library data in applications with data from other sources, because it is less concerned with finding identical or nearly identical data in key fields but in determining an overall measure of sameness. These measures may be of use for traditional de-duplication of result sets, but the most exciting possibility in my mind is that of providing the user with "more like this" options for retrieved items. Digital texts could take us even further into suggested further reading and possible influences.
In the end, "maybe" becomes the most interesting answer to the question: "are these the same?"
[1] http://www.oclc.org/research/projects/frbr/
[2] VTLS, inc. Enriched User Searching; FRBR as the Next Dimension in Meaningful Information Retrieval. http://vtls.com/presentations/Virtua_Enriched_User_Searching.ppt (Accessed July 1, 2008)
[3] http://www.worldcat.org/affiliate/webservices/xisbn/
[4] http://xisbn.worldcat.org/xisbnadmin/xoclcnum/
[5] http://openlibrary.org
[6] http://kcoyle.blogspot.com/2008/05/authors.html
[7] http://www.archives.gov/genealogy/census/soundex.html
[8] Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association 1969;64:1183-1210
[9] http://en.wikipedia.org/wiki/Jaro-Winkler_distance
[10] http://www.netrics.com/
The copyright in this article is NOT held by the author. For copyright-related permissions, contact Elsevier Inc.