Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Reports

Text Migration Team Report

December 18, 2002

Introduction

Electronic text has been produced by the University of Virginia Library for over a decade. During that time our methods, needs, and knowledge have changed significantly. Therefore, moving these many varied collections to a robust digital library architecture poses many challenges. However, creating a unified collection will be of enormous benefit to our users, and will better support use, reuse, and preservation of our electronic text collections.

For this exploratory project, a two-stage method was devised. The first stage, currently wrapping up, is an assessment of the data, and second is the actual production work needed to migrate the collections to the central repository (which could begin immediately or could be put on hold to focus on other priority projects).

Conclusions

  • Assessment of the data is the first step to any migration project.
  • This assessment should ideally be done by a high-level programmer with considerable experience with XML, TEI, PERL, and XSLT; however, a text migration committee should provide some review and guidance to ensure that Library priorities are met.

  • All data will be brought into XML.
  • This has been a working assumption of our DL program since 2001. Some SGML may not migrate without loss of data.

  • All text data should be as predictable as is considered practicable.
  • Data that is the most predictable is most reusable and sustainable. This position is supported in the Text DTD Practices report.

  • TEIxTite is to be the tightest document class.
  • The data that is brought to this document class will behave most predictably and will provide the most stability and the widest range of search and display options.

  • TEIP4 is probably the lowest acceptable standard of conformity.
  • TEIP4 is the broadest and loosest standard for TEI-based central repository texts. While it is advantageous to have most text data in TEI, it should be understood that this level of conformity will provide minimal search and index functions. Only by regularizing the data will the functionality of the data improve.

  • TEIxLite is a subset of TEIP4.

    Most existing TEILite (SGML) texts can be fairly easily migrated to TEIxLite (XML), but migration to TEIxTite is recommended.

  • Migration is an evolutionary process.

    Texts will move from one document class to the next as there is time and priority given to doing the work to regularize them. A long-term plan to tighten data is both manageable and sustainable, and can be directly informed by Library priorities. However, in determining document classes, the benefits must always be weighed carefully against the costs. In particular, the purchased material (such as that produced by Chadwyck-Healey) would take the most effort to migrate. It may be decided that a lesser document class (and corresponding loss in functionality) will suffice.

  • Workflows and standards will differ for different kinds of data.
  • Thus far, two primary workflows have emerged: one for legacy TEILite material, which can fairly easily be brought into TEIxLite; and one for purchased material, most of which does not currently parse against TEIP4 and will take considerable programming time to do so. Other workflows may be identified in the future, based on further exploration of legacy data and the condition of newly-purchased text collections.

  • Migrating texts is largely a programming task.
  • The success of any text migration project is dependent upon the programmer’s ability to recognize data patterns from past projects and his/her ability to assess what it would take to get the data to parse against one of the UVA document class DTDs. The programmer must also have a firm grasp of the delivery needs and Library priorities.

  • Library Administration will have to decide staffing priorities.
  • Careful consideration must be given to the significant resources needed to migrate some data to the most desirable document class. The time and effort needed to migrate some text will take time and effort away from other priorities.

  • Over time, methods for producing electronic text should be standardized throughout the Library.
  • While current projects and workflows may not permit immediate change, it must be recognized that texts that are produced in a standardized manner will more easily be brought into the central repository.

  • EAD documents will not be brought into TEI.
  • EAD documents and TEI-based data are fundamentally different and have different purposes. There may be keyword searching across these types of text, but little sophisticated search and display functionality will be possible between TEI and EAD collections.

  • The TMT will not provide a recommendation as to what to do about migrating TEI headers.

    A newly-formed metadata working group will need to make recommendations before TEI headers can be addressed.

Recommendations

An overview of our methodology is followed by both a recommendation for future migration procedures and a general outline of the steps needed to move beyond this first exploratory stage.

Assessment Methodology

1. Technical review and approval of chosen content (All content was considered for technical feasibility and generally approved)

2. Determined if all files will parse against TEIP4

A. Legacy materials were already in TEIxLite (XML)

i. 182 files parsed cleanly

ii. One file was identified as needing considerable work, so was set aside

B. Chadwyck materials will not parse against TEIP4 without significant alteration

i. Chadwyck data has its own DTDs, so the DTD and files were first analyzed

ii. High-level tags for major structural divisions were altered with XSLT stylesheets and PERL scripts (eg. replacing <chad1> with <div1>)

iii. Other, lower-level tags were assessed but left as-is for this exploratory process

3. Next, will develop a superclass stylesheet for delivery (lightly-modified EAF stylesheet)

Next Steps for Migrating Priority Content Collections

1. Develop TEIxTite and other UVA DTDs

2. Modify Chadwyck data (poetry, Kafka) for delivery from cenrepo under TEIP4

3. Review all priority content files and assess what it would take to move each to the next document class (TEIxLite or TEIxTite, depending on the condition of the data)

4. Modify data again as deemed practicable

5. Review TEI headers and fix or enhance as needed (Metadata group will have to make recommendations for standards first)

6. Copy files into cenrepo

7. Create a location where migration tools and software can be placed and used for future migration projects, and document changes to all scripts and stylesheets

Proposed Migration Method for Future Projects

This procedure is suggested for future text migration projects, with the caveat that it will need further refinement as more is discovered about delivery needs and data and metadata standards.

1. DCRT provides a list of prioritized content for migration team to work through

2. Technical review and approval of chosen content

3. Notify DCRT of excluded content

4. Examine content

A. Process for legacy TEILite materials

i. Use migration scripts as needed and as available to convert SGML to XML

ii. Parse files against TEIxLite

iii. Set aside problematic files until easily-migrated text goes into cenrepo

B. Process for purchased materials

i. Convert SGML to XML as needed

ii. Edit XSLT stylesheet or PERL scripts to modify the files

a. Alter the high-level tags with simple scripts

b. Examine other, lower-level tags to determine viability of programmatic fixes or creating extensions to TEIP4

iii. Parse against TEIP4

5. Determine next steps for data

A. As needed and deemed viable, other programmatic changes should be made to tighten the data to the next document class

B. A determination will be made as to when the file has reached the highest practicable tightening

6. Review TEI headers and fix/enhance as needed

7. Review problematic files

A. Repeat steps 4 and 5

8. Copy files into cenrepo

Recommended Tools and Software

Standard XML validating parsers and XSLT processors are already in use. Most significant will be the development, over time, of a collection of PERL scripts and XSLT stylesheets that can be used and reused to modify and regularize data.

Appendix

Priority Content Collections

Priority content collections identified by the Digital Contents Review Team and approved by the Text Migration Team were:

  • Jeffersonian Cyclopedia
  • Showpiece work; Discrete file

  • Chadwyck-Healey 20th Century African American Poetry

    Paid for but unavailable

    Chadwyck data represents much of the legacy collection, therefore rules and patterns may emerge

    Note: a format problem required sending the CD-ROMs back, so we substituted the Chadwyck-Healey Kafka database (also currently unavailable)

  • A subset of the Modern English collection

    Large, heavily-used collection with subcollections (the chosen subset includes those Modern English files that were put up in the past few years in XML: about 180 titles)

Charge

• Selection and prioritization of collections will occur in the Digital Contents Review Team, but will be reviewed by the Text Migration Team for technical feasibility. DCRT will recommend preliminary text collections for investigation by 11/1/02.

• The team should anticipate technical issues with these preliminary texts and create a work plan for migration.

• The team's report should make recommendations on tools and software, prioritize the needs, and identify what can and should be created in-house.

• The team is not charged with doing an inventory of existing texts or off-line collection masters.

• Implementation responsibilities will be assigned once this team's recommendations have been received.

Submitted by the Text Migration Team

Melinda Baumann, Chair
Matthew Gibson
Greg Murray
Perry Roland
Chris Ruotolo 

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, June 02, 2008
© The Rector and Visitors of the University of Virginia