One Word: Digital

By Karen Coyle

Preprint. Published in The Journal of Academic Librarianship, v. 32, n. 2, March, 2006, pp. 205-207.


"I want to say one word to you. Just one word."
"Yes, sir."
"Are you listening?"
"Yes, I am."
"Plastics."
The Graduate, 1967

Possibly the most famous word in movie history, the one word "plastics" was meant to invoke modernity. The term today refers to an established industry, one that spans everything from grocery bags to the portals on the space shuttle. Like "plastics," the word "digital" is our keyword for a promising future, and it also covers a wide range of useful items. Digital is a kind of genus term for all things composed of ones and zeroes, much in the same way that mammal means warm-blooded and having live births. While useful to describe the broad category, it is less helpful when we wish to communicate about specific resources or projects. Yet, as this article will evidence, more precise terminology for types of digitization has not yet developed, making it hard for us to talk about specific types of digital resources.

A great deal of discussion has taken place around the recent Google project to digitize the print works held in a group of large research libraries. In many cases, participants in the discussion are talking at cross purposes because each has different expectations arising from the statement that Google is "digitizing" the books. Some complain that the books are not easy to read, others that the digital versions being created are not suitable for long-term preservation. Enthusiasts of Google's project talk about creating a "digital library where everything is available at the touch of a button." Both critics and enthusiasts are misunderstanding the Google project. Google's digitization of the books for Google Print has a fairly narrow scope that we might call: digitization for discovery.[1] Google is not intending to provide books for reading or for preservation, and it does not call its service a digital library. Google is creating an index of the terms in the books, and is displaying those words in context by showing a portion of a page.

The Google experience is evidence that we need to talk more specifically about types of digitization to accurately communicate what is happening in libraries and information technology services today. To say that you are planning to digitize some items, or that you will create a digital library, is somewhat like saying that you will buy your daughter a mammal for her birthday. Is it a hamster, or a Bengal tiger? Is your digital object an e-book or a set of statistical data? Is it optimized for long term preservation, for machine processing, or for viewing in a web browser? In this article we will explore some of the kinds of digitization that take place in libraries and archives. They are not mutually exclusive, and this list is not to be considered complete. It may, however, begin to provide some digitization species within the digital genus.

Digitization for Preservation

Digital preservation requires a particular set of decisions that look toward the future, or at least do so to the extent that we can surmise what the digital future will need. In general, digital preservation formats must be able to capture the level of detail that will render the original work as faithfully as possible at some time in the future. Ideally, the formats would be based on open and well-documented standards. With open standards, even if the format falls into disuse in the future and the programs that render the format are no longer available, new programs can be written because the format of the data is known. [2]

Digital preservation formats may be different from the file formats that the library delivers directly to users. For example, a common, high-detail format for images is TIFF. A file in this format can be very large, and the details that it holds may be lost when the file is rendered for a computer screen. TIFF is also not a format that can be opened by standard web browsers. This means that a TIFF file is good for preservation, but online users are better served by a smaller file in JPEG or GIF format. It is not uncommon that a library or archive will store one digital format for preservation purposes and will create service copies in other formats for its online user access. That said, there are no digital formats that are used exclusively for preservation; the TIFF image file is often used for quality printing of images.

Digitization for Discovery

An early keyword index was demonstrated at the 1958 International Conference on Scientific Information (ICSI) is held in Washington, D.C. [3] Unlike today's keyword access, this was a print index using the Keyword In Context (KWIC) format where short phrases are sorted by each significant word in the phrase. Since then, keyword searching in digital texts has become for many the primary mode of discovery.

Digitization for keyword searching usually takes the form of scanning an analog document and performing optical character recognition (OCR) to convert the text to a machine-readable form. OCR generally reduces a book or article to its underlying text, without the formatting that exists in a printed version, and without any illustrations or graphs (although some OCR programs are able to identify structural components of a text like chapter headings). Automated discovery of non-textual items is more difficult to achieve, and research is being done on automating discovery of pictures and sound. [4] The existing picture search systems are often aimed at those doing illustration or advertising and emphasize the aesthetic qualities of color and layout over topical discovery, the latter being very hard to automate. [5] Sound and video discovery are highly desired but not yet at a marketable stage in their development. Other types are discovery are through geographical characteristics and time-based markers.

Digitization for Delivery

Today's information seekers are less likely to actually enter the library than in the past. The library must now deliver materials to the user, both in a convenient format and as close to instantaneously as possible. Digital files are ideal delivery formats because they can be placed online for user access, or faxed or e-mailed. While most digital files are delivered to users over networks, digitization specifically for delivery often takes the form of an un-enhanced facsimile of the original. The text-based digital facsimile is often destined for printing, and in this category we can include the digital files of a print on demand service. For non-textual media, delivery services like online streaming allow individual users to receive and experience content.

Digitization for Reading

Many of the items that we read on a screen were born digital: e-mail, text messages, documents in formats like Microsoft Word or Adobe PDF. In fact, we've been using digital technology in the production of non-digital works, like books and reports, for many decades. There is a reverse trend happening of digitizing printed texts, generally as a way to bring pre-digital materials into the modern information space. Digitizing a print or manuscript item doesn't always result in a text that someone would want to read from cover to cover as they would read a paper book. What we do know about screens and reading is that most people prefer to print a long text rather than read it online. The goal of digitizing for reading is to produce a viable reading experience in the digital format.

The Holy Grail of the e-book world is a device that is as pleasing for reading as the paper book. As yet this device has not been developed and the e-book market appears to be stagnant. Studies have been done on current (and defunct) devices, though, that give us some understanding of the characteristics of "readability" for digital materials. [6] [7] It's not just a matter of having a pleasant screen and well-formed type; readers of digitized works need to have many of the characteristics of paper books, such as numbered pages (so that citations can be accurate), bookmarking, and navigation to individual pages or chapters. Digital files designed for extended reading (as opposed to a quick lookup) need to be portable, as it is rarely convenient to read a lengthy text sitting at a workstation. There are other features that users of e-books appreciate, such as interlinked dictionaries and the ability to annotate and highlight. These are generally functions of the reading device, but the file formats should not prevent these from being offered.

Many projects are digitizing books, from Project Gutenberg [8] to Google, but not all of these have the characteristics that are necessary to be a readable digital version of a book. These digital versions do not meet the criteria for sustained reading and therefore fall better into the categories of Digitizing for Discovery or Digitizing for Research. We are hindered in our creation of digital files that support and encourage sustained reading due to a lack of open standards for e-book markup. The most common e-book files today are in proprietary formats such as Microsoft Reader or Adobe E-Book format. There are few e-book formats that interact well with the web browser, although there is some work in that direction through the British Library's Page Turner and some formats being presented by the Internet Archive.

Digitalization for Research

In print form, a bibliographic indexing service, a dictionary, or an encyclopedia is a continuous text made up of many individual entries. When these are reformatted as searchable databases the ease of use and general value of these tools increases. These reference tools are actually much better suited to the digital world they were to that of print because their value is in the discovery and display of individual entries. When these resources are digitized it seems to go without saying that they have been digitization to facilitate a research function.

Continuous texts can also be digitized for research although it's not as easy to recognize these digital products or to categorize their use in this way. The creators of the Questia system digitize texts that will primarily be of use for college students writing papers. The subject outline of the system is called "Research Topics." The entire database of digital texts and within each text can be searched using keyword searches. The display of texts is on screen one page at a time. In theory a person could read these texts from the first page to the last, but only overcoming a significant amount of inconvenience, such as limited display space and fonts that are hard to read. The system surrounding the texts features the ability to copy small quantities of text which are then captured along with the citation that would be needed when the quote was entered into an academic paper. Similarly, the eBrary system allows users to do research within full texts and gain access to an entire book or article, but the online delivery of these books is designed more for viewing a small number of pages rather than extended reading of the texts. These types of digitized texts are often referred to as "electronic books" because they are electronic versions of print books, but their functionality for research is more than a print book provides and their potential for reading is considerably less.

Digitization for Machine Manipulation

Not all digital files are destined to be viewed by humans. Large banks of data files, such as census or survey data, exist. There are also huge volumes of digitized map and satellite data that are used for weather and ecology studies. These files may be provided in file formats that are especially suitable to machine manipulation such as the general tab delimited format that most database programs can import. Some are produced in file formats specific to individual programs, like Microsoft Excel, but the main characteristic of these files is that their data is not to be read or viewed but will be used to produce new data after some programming is applied.

Born Digital

All of the above distinctions could also be applied to born digital materials, but not without some difficulty. The purposes of the born digital might be determined based on the programs that created them and the formats in which they are produced. However, these file formats are often program-specific, such as the .doc format of Microsoft Word or the .pdf of the Adobe Portable Document Format, and there are dozens, if not hundreds, of different formats. In addition, the people using these programs exercise varying degrees of creativity in producing their outcomes. I have seen a Microsoft Excel file that held a meeting agenda and another one that produced a geometrical drawing. Born digital files will be the hardest to characterize based on their file format, and provide the greatest challenge for long-term preservation.

The Digital Library

When we combine all of the above meanings of "to digitize" with the myriad formats of born digital materials, it becomes obvious that the "digital" in digital library can refer to a broad range of formats and content that have in common only that their fundamental carrier is a string of ones and zeroes. The distinctive aspects of these resources matter in various ways, most notable in how we communicate our digital library services to our users. The user experience will be degraded if their expectations of digital library materials do not meet the actual capabilities. I cringe at the idea that a student might be expected to read a chapter of a book from Google's digitized holdings, or will try to use Project Gutenberg's texts for a class paper. Although both are possible, the user will come away from that experience concluding that the Digital Library is difficult to use and not well suited to his needs. We can provide a better Digital Library experience when we match user needs to the appropriate digital materials and services.

Notes and References

[1] Google itself realized that their service was being misunderstood, and attempted to correct this in late November of 2005 by changing the name of the service to "Google Book Search."

[2] Recent announcements by the Microsoft Corporation that they will work to create open standards for their office products is based on the concern by businesses that their documents in Microsoft Office format will not be readable in the future. see: http://www.microsoft.com/presspass/features/2005/nov05/11-21Ecma.mspx (Accessed 12/13/05)

[3] The Chemical Heritage Foundation. The Chronology of Chemical Information Science, 1950 to present. http://www.chemheritage.org/explore/timeline/CC1950.HTM (accessed 11/11/05).

[4] If you're interested in research in this area, the digital library of the Association for Computing Machinery has numerous articles on the topic. (http://portal.acm.org)

[5] Google's image search does not search on the images themselves. Instead, it appears to use keywords from the HTML "alt" text in the <img> tag, and perhaps some terms from the textual context around the images.

[6] Pietila, Katri, et al. Reading with eBooks. eFinland, 2005. http://e.finland.fi/netcomm/news/showarticle.asp?intNWSAID=43197 (accessed 12/06/05).

[7] Malama, Chrysanthi, Monica Landoni, Ruth Wilson. What Readers Want: A Study of E-Fiction Usability. In: D-Lib Magazine, v. 11, n. 5, May 2005. http://www.dlib.org/dlib/may05/wilson/05wilson.html (accessed 12/6/05).

[8] Project Gutenberg has created plain text digital versions of about 17,000 public domain documents. http://www.gutenberg.org



The copyright in this article is NOT held by the author. For copyright-related permissions, contact Elsevier Inc.