Machine Indexing

PREPRINT. Managing Technology Column, Journal of Academic Librarianship, v.34 n.6, 2008

By Karen Coyle

One of the ways that libraries excel over the many recently-arrived information services is our early discovery of and consistent use of controlled vocabularies. Not only do these vocabulary terms gather together disparate terminology ("fiber optics," "optical fibers") but they also overcome language differences ("fiber optics," "fibre optics," "fibre optique," "Faseroptik").

Controlled vocabularies, however, only work within environments that you can control. A library can carefully choose terms and guide its readers to the appropriate catalog entry, but those same readers will find a different set of terminology in each non-library bibliographic database that they visit. When resources are combined for a single search, such as in metasearch situations where the user’s chosen terms are sent to a number of databases simultaneously, the lack of consistency in searchable terms makes it very difficult to formulate a good search strategy.

Controlled vocabularies also come with a cost. Unlike automated keyword indexing of documents, controlled vocabularies require human effort, sometimes a great deal of human effort. We can argue that the effort pays off, but in fact although we have some great anecdotes that illustrate the value of controlled vocabulary,i we haven’t found a way to measure those benefits. We do, however, know something about the costs because we can get a general idea of the time that indexers and subject catalogers spend on the assignment of terms and headings. It is therefore not surprising that that there is an interest in applying automation subject indexing as a way to reduce labor costs for cataloging and indexing.

Automatic indexing is not a single technology, and it is not uncommon for indexing algorithms to simultaneously make use of a number of techniques. The indexing of text is often a combination of statistical measurements combined with some applied analysis of language forms. Algorithms can analyze the syntax of sentences and can make use of document structures like titles, headings and paragraphs. Enhanced with a list of target terms, the results have some overlap with the output of human indexers and this fact is encouraging further research.

Automated indexing can also be applied to digital resources that are not textual in nature; this column will, however, focus on the indexing of texts.

Indexing Machine or Machine Indexing?

In the "fiber optics" example above, I was illustrating that a keyword search might miss some documents because of different spellings or different languages used. Another common problem with keyword searching is that it is based on individual words, when in fact many concepts are expressed in more than one word. Controlled vocabularies represent concepts, not words, and many of the concepts in those vocabularies require more than one word: solar power; civil war; childbirth at home. Vocabularies also follow linguistic conventions, so that "indexing machine" and "machine indexing" are distinct concepts.

Two projects that have addressed this problem are the Keyphrase Extraction Algorithm (KEA) of the University of Waikato in New Zealand,ii and iViaiii at the University of California at Riverside. KEA uses a subject thesaurus in the general subject area of the article to select single and multi-term thesaurus entries. This use of a controlled vocabulary to guide the algorithmic decision-making is similar to how a controlled vocabulary guides human indexers to the correct subject assignment. The University of California at Riverside’s iVia research program is called PhraseRate.iv PhraseRate doesn’t use a controlled vocabulary, but analyzes the text within the context of the document. Where KEA extracts a multi-word term like "ultraviolet light," PhraseRate extracts that two-word term as well as some longer phrases, like "cloud-penetrating ultraviolet light." While these longer phrases may be too long for a normal controlled vocabulary, if they appear frequently they may serve as very short abstracts of a topic covered by a larger article or book.

Some of the phrases that any of these systems derive are not ones that would be selected by human indexers. An example from PhraseRate is "earth’s insect population in check" or "bee venom drops." Others show an impressive analysis of human speech, like "broad range of public policy issues" or "principals of wiccan belief."

A commercial software package, Extractor,v is marketed as software for automatically summarizing text by generating lists of key words and key sentences. Its key words can be "key phrases." You can test Extractor on any web page to see its For me, it did extract some phrases, like "digital rights management" although it primarily gave me single-word terms.

The Machine as Indexer’s Helper

It’s clear from experimentsvii that compare the results of machine-generated indexing with human-generated indexing that we aren’t ready to step back and let the machines take over this task. That doesn’t mean that human indexing is proven to be superior to machine-indexing. We do know, however, that algorithmic assignments differ from human assignments.

The National Library of Medicine, an institution that indexes over 700,000 articles per year, began experimenting with automated indexing in 2000. Their Medical Text Indexerviii (MTI) is what they call "semi-automatic indexing,"ix that is, algorithmic analysis of a document that suggests possible subject headings to the human indexer.

In studies of the effectiveness of their indexing tool,x NLM found that about half of their indexers were voluntarily using the MTI tool to see suggested terms. Its greatest value seems to be in saving time, because the tool suggests actual MeSHxi subject headings, and indexers can select headings from the MTI list to avoid having to key them. Indexers also frequently use the list to make sure that they haven’t missed a heading in their own analysis. In this case the list serves as a prompt to the human indexer. There was a significant tendency of less experienced indexers to make more use of this prompting function, and those indexers felt that the tool helped them be more efficient.

The MTI tool uses only titles and abstracts, and NLM is looking into the possibility to improve MTI’s results by doing algorithmic indexing on the full text of articles.xii While full text is richer, the difficulty is in identifying the main points of the article in the larger text. With titles and abstracts, a person has crafted them precisely to express the main points of the text. The work to distill the essence of an article is similar to that which is needed to create automated summaries and includes some of the tricks used in determining key phrases.

As of today, the researchers at NLM have been unable to improve on the MTI results that are based on titles and abstracts by using the full text instead. It is intriguing, however, to contemplate the possibility that we will one day be able to assign – or at least suggest – controlled subject headings from full text documents.

Documents Index Documents

As we increase our body of digitized works, we increase our ability to act algorithmically on the world of knowledge. The INFOMINE project at the University of California at Riversidexiii has resulted in a database of about 100,000 digital resources of which over 26,000 were created by librarians and assigned Library of Congress Subject Headings (LCSH). The iVia project at that same institution has been able to experiment with the assignment of LCSH to new documents based on the documents’ similarity to ones that have received human subject indexing.

Citation indexing also provides some interesting possibilities. Citation indexing is not indexing in the sense of subject assignment; it is an index of the links created when authors cite other works, either in footnotes or in bibliographies. When created on paper, citation analysis is linear and static. But when the citation links can be made between digital documents, a web of references is created that can be used to perform some automated subject analysis.

The Eventual Future

Work on automated indexing has been taken place since the mid-20th century. Early researchers like Eugene Garfieldxiv were hindered in their work by the physicality of the documents they wished to organize. Work on digital documents is showing great promise, and could change how we do indexing. It seems obvious that we will need to turn to automation in order to handle the great increase in document production.

The copyright in this article is NOT held by the author. For copyright-related permissions, contact Elsevier Inc.