Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Reports

Procedure for Selecting Pre-1923 Texts for Digitization

April 29, 2002

As put forth in the Library of Tomorrow report, the development of the Digital Library is an opportunity to bring all Library staff to a new level of understanding of library service in the twenty-first century. Policies relating to the selection of digital material have their roots in policies for traditional materials, but digital materials provide a host of complex issues, not the least of which are copyright and prioritizing, in light of limited resources for an expensive process. Therefore, it is important that selectors understand the issues surrounding digital creation, purchase, and harvesting so as to make informed decisions about digital collection building. In addition to asking selectors and Information Community coordinators to be responsible for allocating finite resources, involvement with digital selection will improve selectors' "ownership" of the Digital Library as they make decisions that build collections of digital teaching and research materials.

This procedure outlining text selection for digitization was developed by the Digital Contents Review Team (DCRT), the group that advises Digital Library Production Services (DLPS) on Library priorities for digitization. It addresses the criteria to be used for selecting books for digitization (in most, but not necessarily all, cases from our existing pre-1923 print collection), and outlines how to respond to faculty requests for digitizing books.

Assumptions
Digital Collection Building
Four Ways of Building Digital Collections
Responding to and Referring Faculty Requests
Faculty Without Funding
Faculty With Funding
Digital Page Allotments

Assumptions

Eight assumptions and guidelines govern the text selection procedure outlined here:

 1. The UVA Library will build a collection of digital materials that will support programmatic needs (both teaching and research) at the University of Virginia. Within this collection high priority materials will  meet at least one of the following criteria:
  •  Are unique to UVA or the region
  •  Gain added value by being digitized
  •  Have interdisciplinary interest
  •   Warrant reformatting due to condition and/or high use

Only books, which are out of copyright or for which we have gained permission to digitize, can be added to the collection; although there may also be instructional materials in copyright, which are included, with access limited to UVA users. Selection of these materials is made by the subject specialist in each area in collaboration with their faculty; or by an Information Community coordinator. The ECenters, subject specialists, and Information Community coordinators will work closely to provide materials requested for faculty projects, with subject specialists and Information Community coordinators responsible for "funding" these materials out of their annual page allocations (explained in the section below, Digital Page Allotments).

2. Once digitized, text may require multiple delivery options. A book may need to be searched, read or viewed on- or offline, analyzed, or printed. These multiple purposes underscore the importance of creating a flexible delivery schema. Ideally, digital text created locally should have page images associated with them. Likewise, page images should be accompanied by searchable text. But other options are available as outlined below in Four Ways of Building Digital Collections. 

3. The digitization of hand-written manuscripts and other Special Collections materials are not addressed by this procedure. These materials will require different workflows, as yet undeveloped, for the handling of fragile and rare materials and for materials that require transcription.  

4. The Library should follow emerging national interoperability standards as developed by the Digital Library Federation in order to be positioned to share data with other Digital Library institutions in the future.  

5. In the absence of a digital preservation policy, the Library should adopt the Digital Library Federation standard for digital masters until such time that a policy is developed locally. 

6. ARL's Copyright and Intellectual Property Strategy, still in development, will guide the Library's policies on access to copyrighted and licensed material. DLPS would like to be able to address digitizing works after 1923, but is unable to do so at the present time. The added complications of digitizing copyrighted texts may call for a different selection process and production model not yet developed or supportable. However, selectors or faculty may choose to do the legwork to obtain permission for copyrighted materials without DLPS support. 

7. Selection will be ongoing throughout the fiscal year, and DLPS will work on digitizing the selections throughout the year. Materials of a time-sensitive nature, however, will be digitized at the discretion of the DLPS Director as time and workflow permits. 

8. The digitization of all texts will be accomplished without despining or otherwise damaging books. 

Digital Collection-Building

The building of the UVA Digital Library must be envisioned as an ongoing process. The decisions made today must be informed by the anticipation of collaborative ventures of tomorrow. All ARL libraries will need similar core digital collections, but these collections can be built cooperatively. DLPS and others will actively explore a future of cross-institutional partnerships with sharing of content, while today UVA selectors focus on building digital collections not only supporting UVA teaching and research but with an emphasis on its own programs, region, and history. OCLC provides the precedence: the efficiency of collaboration outweighs the need for local control. Local resources are then distributed in a more cost-effective manner, and the UVA Library is positioned to both meet local user needs and contribute unique holdings to other DL institutions across the globe.  Further explanations of digital collection-building priorities are as follows:

Supporting teaching and research

As with selecting print materials, a selector's close relationship with the academic departments will largely inform digital collection-building decisions. Similarly, selecting books that are cross-disciplinary will offer the Library the opportunity to support these growing areas and maximize the use of these materials. Toward this end, each selector will receive a VIRGO report detailing the books in his/her areas, sorted by call number, that were published before 1923 and have circulated at least once. Selectors may use this list, browse the stacks, consult bibliographies, or use other methods to select books for the digital collection in their areas of responsibility. Selectors will be allotted "digital pages" with which to make these collection decisions (see Digital Page Allotments, below).

Promoting and improving access to unique and rare books

The Digital Library will support the unique research and programs of the UVA community and improve access to books that are difficult for local and distant scholars to obtain.

Repurposing

Selectors may choose to digitize works to "add value." For instance, a dictionary that is digitized can be searched by keyword or in other ways not possible in print. Some repurposing may require work with ECenter staff to address special delivery needs (interface, functionality, or tools). Preserving AccessAccess to the content of brittle books can be preserved by digitization. Selectors will be offered various options for preservation outlined in another document.  Responding to Faculty RequestsIt is fully recognized that faculty will play an important role in any digital collection-building process. See Responding to Faculty and Referring Requests below for more details.

Four Ways of Building Digital Collections

Not all digitization is created equal. Selectors and Information Community coordinators can make requests of DLPS to create, purchase, harvest, or link to digital text, but only the first three are truly integrated into the Digital Library.

1. Creating Content for the DL-- fully searchable

Text is digitized to make it searchable (Modern English Collection), or to create high-quality page images for reading and studying (Gordon collection), or both (Early American Fiction). Text can be made searchable via the processes of keyboarding or OCR, but a page image alone is not searchable.

Page images are digital photocopies of each page of a book, and can be done in-house by DLPS. These images can be viewed by the end-user, but the text itself is not searchable.

OCR can be done in-house. The book is scanned to create page images, and each image is run through OCR (optical character recognition) software that attempts to interpret each character on the page image and transforms it into plain text. The many mistakes must be found and corrected manually, and XML mark-up must be added. The out-of-pocket expense per book is small, but staff time and resources are considerable. This whole process provides accuracy similar to materials in JSTOR and the Making of America.

Keyboarding is done through outsourcing to a vendor. The text and XML mark-up are typed in twice and the versions are compared to achieve a very high level of accuracy. This amount of handling is expensive: the Library pays by the keystroke, but the results are high quality.

 2. Purchasing Content for the DL -- fully searchable

The Library has long had the ability to purchase digital access, but to a lesser degree has had the opportunity to purchase digital content. Examples of purchased digital content are the Chadwyck-Healy databases (such as English Poetry) and the Oxford English Dictionary, which were repurposed and aggregated with other SGML-tagged databases by the Electronic Text Center. The end result to the user is a uniform interface that searches thousands of texts in one search. Whenever feasible, the Library should purchase content rather than linking to external content. 

3. Harvest Content for the DL -- fully searchable

Harvesting text will be an effective and inexpensive way to amass digital content in the future. Harvested content is produced and sometimes hosted at other DL institutions and meets (or can be made to meet) Library format and metadata standards, but will appear to all users to be fully integrated into the Digital Library, behaving like any other fully searchable text in the collection. What is lacking at this writing is collaboration among DL powerhouses to share data with this level of cooperation. DLPS and others will follow the developments as recommendations are made by the Digital Library Federation and other national groups. 

4. Link to External Content -- not searchable

The strength of a Digital Library is that it brings all digital formats together, and provides similarity in delivery, tools, and user customization. Teaching and research materials should be presented to users in this uniform, high quality manner. However, linking to existing electronic versions of text is often an expedient and inexpensive alternative to digitizing them locally or purchasing the content, and may be appropriate for materials that are not considered to be core. While linking to external content is an acceptable choice, it should be understood that each link will provide its own interface and variety of quality and will not be searchable in the Digital Library. Also, selectors should not allow fear of "duplicating effort" by digitizing titles that may already exist electronically to unduly influence the decision to digitize locally. 

Responding to and Referring Faculty Requests

Collaboration in making digitization selections will occur between selectors, ECenter staff, Information Communities, faculty, and others. The Library will depend on an active and cooperative selection community to make informed referrals and decisions. 

As stated in the above section, Digital Collection Building, a book can be digitized for numerous reasons, including faculty requests. It is hoped that the Library will be able to honor these digitization requests in the same way requests for purchasing print materials are honored (but expense will play a role). Many faculty requests will be straightforward and selectors can follow the procedures outlined below. If faculty requests for digitization are based on needs that can be met by the evolving standard interfaces, functions or tools of the Digital Library, then the selectors should be the ones to initiate the request to DLPS. 

However, selectors should refer projects to the ECenters that have complicated needs, for instance the integration of multiple media forms or the development of specialized tools or interfaces.  The ECenters should then respond as they have always done: by advising on technical and administrative matters, by helping to develop project plans, and, if appropriate, by seeking funding for the projects. In such cases, ECenter staff (in collaboration with selectors or the appropriate Information Community coordinator) will submit the digitization requests to DLPS. When completed, DLPS will place the parsed, digitized project material in the Digital Library Repository for general access through the standard interfaces and tools. ECenter staff will then be responsible for further development on the project in partnership with faculty and other project stakeholders. Procedures for dealing with new versions and error correction will be developed in the near future. 

There are two basic types of faculty requests that selectors will come across: those that come with money to keyboard texts, and those that don't. The following procedures address both scenarios. DLPS is not yet ready to handle faculty requests for OCR, or faculty projects that offer student labor rather than funding. 

All faculty requests to digitize books (but who do not have funding to offer) will go through a selector or the Information Community Coordinator, who asks:

  • Is it out of copyright; or has the faculty member received written permission to digitize (if the latter, does the Library have to restrict access by IP address or in some other way)?
  • Does the Library have a print copy in the collection (if no: selector decides if the item is worthy of inclusion in the collection and submits an ILL request, if necessary, to obtain the item)?
  • Does an electronic copy exist elsewhere?
  • Does this relate to an Information Community (yes: forward to Information Community Coordinator for decision)?
  • Should the item be keyboarded or OCRed (or will a link suffice if it exists)?
  • Can the selector cover the requested item through their digital page allotment for the chosen form of digitization (keyboarding or OCR)?
  • Is there a time dependency, course-related or otherwise (if yes: forward to DLPS Director for decision)?
  • What category does the book fall into: prose, poetry, drama, letter, newspaper, journal article, dictionary/encyclopedia?
  • Are there languages other than English or non-Western characters in the text?
  • Might the Library want this work done for reasons other than those mentioned above (if yes: forward to DCRT via DLPS Director)?

Answers to all questions are forwarded to DLPS for entering the book(s) into the work queue. DCRT makes the final decision if there are considerations other than those listed above.  

Faculty requests for keyboarding (with funding) will go through a selector or the Information Community Coordinator, who asks:

  • Is it out of copyright, or has the faculty member received written permission to digitize (if the latter, does the Library have to restrict access by IP address or in some other way)?
  • Does the Library have a print copy in the collection (if no: selector decides if the item is worthy of inclusion in the collection)?
  • Does an electronic copy exist elsewhere?
  • What category does the book fall into: prose, poetry, drama, letter, newspaper, journal article, dictionary/encyclopedia?
  • Are there languages other than English or non-Western characters in the text?
Assumptions to be shared with faculty member:
  • Faculty member will pay for the keyboarding
  • Materials of a time-dependent nature, course-related or otherwise, will be processed as quickly as possible, but cannot always be guaranteed a quick turnaround time. The DLPS Director will communicate the expected schedule to the faculty member; scheduling modifications will be made at her discretion.
  • The faculty member will deliver a non-Library copy of the item (in the correct edition) for despining
  • DLPS will parse the keyboarded text against our standard DTDs to identify problems but will not do any training, cleanup, or interface work
  • A copy of the parsed work will go into the DL (unrestricted if possible)

    Answers to all questions are forwarded to DLPS for entering the book(s) into the work queue. DCRT makes the final decision if there are considerations other than those listed above. 

Digital Page Allotments

Collection area coordinators and the Information Community coordinator will receive annual "digital page allotments" from DLPS and will work with selectors to make decisions as to what books get digitized, and what level of digitization is needed. DLPS staff will maintain an accounting system for these allotments, preferably online so coordinators and selectors can determine their "balance" at any time. Selectors can choose to digitize pre-1923 books in their areas in the collection, use it for faculty requests, or both. 

Area coordinators may decide whether to further disburse their allotment by selector or library, or to "pool resources" to make their digital pages go further.  

Coordinators will be annually credited with a certain number of pages (see example below) to be digitized. There will be three allotments: Keyboarding Pages, OCR Pages, and Page Images. They and selectors will decide which digitization treatment each book will receive:

Keyboarding Pages

Collection materials that require high quality search access with full integration into the Digital Library should be keyboarded. Keyboarded text will be accompanied by page images. 
OCR Pages
The search quality of OCRed text will be less than that of keyboarded texts, but the texts will still be fully integrated into the Digital Library. These books will be scanned to create page images, OCRed, and marked up in XML. OCRed text will be accompanied by page images. 
Page Images
Books that are digitized by page image only will not have searchable text, but can be read, studied, or printed.

Links

If a free electronic version exists, a selector may request that a link be added to the VIRGO record, but the item will not be searchable within the Digital Library. 

For fiscal year 2002/03 the digital page allotments will correspond to the 2001 monographic budget percentages for each area, as follows:  

Humanities/Social Sciences (Marshall) -- 42
Clemons (Coleman) -- 5Arts (Penner) --  7
Sciences (O'Bryant/Hunter) -- 9
Special Collections (Clendenning) -- 27
90% 

In addition, the Information Community coordinator and DCRT will receive an allotment:  

Information Communities (Ruotolo) -- 5
DCRT (Baumann) -- 5
10% 

Example: if DLPS determines it can pay for keyboarding 50,000 pages in fiscal year 2002/03, Humanities would receive an allotment of 21,000 keyboarded pages; Clemons, 2500 pages; Arts, 3500 pages; Sciences, 4500 pages; Special Collections, 13,500 pages; Information Communities, 2500 pages; DCRT, 2500 pages. DLPS may also determine that it can OCR 100,000 pages in fiscal year 2002/03 so that the allotments for OCRed pages for each area would be double those of the keyboarding pages; and a similar determination will be made for page imaging. 

These allotments will be reviewed annually. 

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, June 02, 2008
© The Rector and Visitors of the University of Virginia