Procedure for Selecting Pre-1923 Texts for Digitization
April 29, 2002
|
As put forth in the Library of Tomorrow report, the development of the Digital Library is an opportunity to bring all Library staff to a new level of understanding of library service in the twenty-first century. Policies relating to the selection of digital material have their roots in policies for traditional materials, but digital materials provide a host of complex issues, not the least of which are copyright and prioritizing, in light of limited resources for an expensive process. Therefore, it is important that selectors understand the issues surrounding digital creation, purchase, and harvesting so as to make informed decisions about digital collection building. In addition to asking selectors and Information Community coordinators to be responsible for allocating finite resources, involvement with digital selection will improve selectors' "ownership" of the Digital Library as they make decisions that build collections of digital teaching and research materials. This procedure outlining text selection for digitization was developed by the Digital Contents Review Team (DCRT), the group that advises Digital Library Production Services (DLPS) on Library priorities for digitization. It addresses the criteria to be used for selecting books for digitization (in most, but not necessarily all, cases from our existing pre-1923 print collection), and outlines how to respond to faculty requests for digitizing books.
Eight assumptions and guidelines govern the text selection procedure outlined here: 1. The UVA Library will build a collection of digital materials that will support programmatic needs (both teaching and research) at the University of Virginia. Within this collection high priority materials will meet at least one of the following criteria: The building of the UVA Digital Library must be envisioned as an ongoing process. The decisions made today must be informed by the anticipation of collaborative ventures of tomorrow. All ARL libraries will need similar core digital collections, but these collections can be built cooperatively. DLPS and others will actively explore a future of cross-institutional partnerships with sharing of content, while today UVA selectors focus on building digital collections not only supporting UVA teaching and research but with an emphasis on its own programs, region, and history. OCLC provides the precedence: the efficiency of collaboration outweighs the need for local control. Local resources are then distributed in a more cost-effective manner, and the UVA Library is positioned to both meet local user needs and contribute unique holdings to other DL institutions across the globe. Further explanations of digital collection-building priorities are as follows:
Four Ways of Building Digital Collections Not all digitization is created equal. Selectors and Information Community coordinators can make requests of DLPS to create, purchase, harvest, or link to digital text, but only the first three are truly integrated into the Digital Library. 1. Creating Content for the DL-- fully searchable Text is digitized to make it searchable (Modern English Collection), or to create high-quality page images for reading and studying (Gordon collection), or both (Early American Fiction). Text can be made searchable via the processes of keyboarding or OCR, but a page image alone is not searchable.
2. Purchasing Content for the DL -- fully searchable The Library has long had the ability to purchase digital access, but to a lesser degree has had the opportunity to purchase digital content. Examples of purchased digital content are the Chadwyck-Healy databases (such as English Poetry) and the Oxford English Dictionary, which were repurposed and aggregated with other SGML-tagged databases by the Electronic Text Center. The end result to the user is a uniform interface that searches thousands of texts in one search. Whenever feasible, the Library should purchase content rather than linking to external content. 3. Harvest Content for the DL -- fully searchable Harvesting text will be an effective and inexpensive way to amass digital content in the future. Harvested content is produced and sometimes hosted at other DL institutions and meets (or can be made to meet) Library format and metadata standards, but will appear to all users to be fully integrated into the Digital Library, behaving like any other fully searchable text in the collection. What is lacking at this writing is collaboration among DL powerhouses to share data with this level of cooperation. DLPS and others will follow the developments as recommendations are made by the Digital Library Federation and other national groups. 4. Link to External Content -- not searchable The strength of a Digital Library is that it brings all digital formats together, and provides similarity in delivery, tools, and user customization. Teaching and research materials should be presented to users in this uniform, high quality manner. However, linking to existing electronic versions of text is often an expedient and inexpensive alternative to digitizing them locally or purchasing the content, and may be appropriate for materials that are not considered to be core. While linking to external content is an acceptable choice, it should be understood that each link will provide its own interface and variety of quality and will not be searchable in the Digital Library. Also, selectors should not allow fear of "duplicating effort" by digitizing titles that may already exist electronically to unduly influence the decision to digitize locally. Responding to and Referring Faculty Requests Collaboration in making digitization selections will occur between selectors, ECenter staff, Information Communities, faculty, and others. The Library will depend on an active and cooperative selection community to make informed referrals and decisions. As stated in the above section, Digital Collection Building, a book can be digitized for numerous reasons, including faculty requests. It is hoped that the Library will be able to honor these digitization requests in the same way requests for purchasing print materials are honored (but expense will play a role). Many faculty requests will be straightforward and selectors can follow the procedures outlined below. If faculty requests for digitization are based on needs that can be met by the evolving standard interfaces, functions or tools of the Digital Library, then the selectors should be the ones to initiate the request to DLPS. However, selectors should refer projects to the ECenters that have complicated needs, for instance the integration of multiple media forms or the development of specialized tools or interfaces. The ECenters should then respond as they have always done: by advising on technical and administrative matters, by helping to develop project plans, and, if appropriate, by seeking funding for the projects. In such cases, ECenter staff (in collaboration with selectors or the appropriate Information Community coordinator) will submit the digitization requests to DLPS. When completed, DLPS will place the parsed, digitized project material in the Digital Library Repository for general access through the standard interfaces and tools. ECenter staff will then be responsible for further development on the project in partnership with faculty and other project stakeholders. Procedures for dealing with new versions and error correction will be developed in the near future. There are two basic types of faculty requests that selectors will come across: those that come with money to keyboard texts, and those that don't. The following procedures address both scenarios. DLPS is not yet ready to handle faculty requests for OCR, or faculty projects that offer student labor rather than funding. All faculty requests to digitize books (but who do not have funding to offer) will go through a selector or the Information Community Coordinator, who asks:
Answers to all questions are forwarded to DLPS for entering the book(s) into the work queue. DCRT makes the final decision if there are considerations other than those listed above. Faculty requests for keyboarding (with funding) will go through a selector or the Information Community Coordinator, who asks:
Collection area coordinators and the Information Community coordinator will receive annual "digital page allotments" from DLPS and will work with selectors to make decisions as to what books get digitized, and what level of digitization is needed. DLPS staff will maintain an accounting system for these allotments, preferably online so coordinators and selectors can determine their "balance" at any time. Selectors can choose to digitize pre-1923 books in their areas in the collection, use it for faculty requests, or both. Area coordinators may decide whether to further disburse their allotment by selector or library, or to "pool resources" to make their digital pages go further. Coordinators will be annually credited with a certain number of pages (see example below) to be digitized. There will be three allotments: Keyboarding Pages, OCR Pages, and Page Images. They and selectors will decide which digitization treatment each book will receive: Keyboarding Pages Collection materials that require high quality search access with full integration into the Digital Library should be keyboarded. Keyboarded text will be accompanied by page images.OCR Pages The search quality of OCRed text will be less than that of keyboarded texts, but the texts will still be fully integrated into the Digital Library. These books will be scanned to create page images, OCRed, and marked up in XML. OCRed text will be accompanied by page images.Page Images Books that are digitized by page image only will not have searchable text, but can be read, studied, or printed. Links
For fiscal year 2002/03 the digital page allotments will correspond to the 2001 monographic budget percentages for each area, as follows: Humanities/Social Sciences (Marshall) -- 42 In addition, the Information Community coordinator and DCRT will receive an allotment: Information Communities (Ruotolo) -- 5 Example: if DLPS determines it can pay for keyboarding 50,000 pages in fiscal year 2002/03, Humanities would receive an allotment of 21,000 keyboarded pages; Clemons, 2500 pages; Arts, 3500 pages; Sciences, 4500 pages; Special Collections, 13,500 pages; Information Communities, 2500 pages; DCRT, 2500 pages. DLPS may also determine that it can OCR 100,000 pages in fiscal year 2002/03 so that the allotments for OCRed pages for each area would be double those of the keyboarding pages; and a similar determination will be made for page imaging. These allotments will be reviewed annually. |