Introduction
Electronic text has been produced
by the University of Virginia Library for over a decade. During that time
our methods, needs, and knowledge have changed significantly. Therefore, moving
these many varied collections to a robust digital library architecture poses
many challenges. However, creating a unified collection will be of enormous
benefit to our users, and will better support use, reuse, and preservation
of our electronic text collections.
For this exploratory project, a
two-stage method was devised. The first stage, currently wrapping up, is an
assessment of the data, and second is the actual production work needed to
migrate the collections to the central repository (which could begin immediately
or could be put on hold to focus on other priority projects).
Conclusions
- Assessment of the data is the
first step to any migration project.
This assessment should ideally
be done by a high-level programmer with considerable experience with XML,
TEI, PERL, and XSLT; however, a text migration committee should provide
some review and guidance to ensure that Library priorities are met.
- All data will be brought into
XML.
This has been a working assumption
of our DL program since 2001. Some SGML may not migrate without loss of
data.
- All text data should be as predictable
as is considered practicable.
Data that is the most predictable
is most reusable and sustainable. This position is supported in the Text
DTD Practices report.
- TEIxTite is to be the tightest
document class.
The data that is brought to
this document class will behave most predictably and will provide the
most stability and the widest range of search and display options.
- TEIP4 is probably the lowest
acceptable standard of conformity.
TEIP4 is the broadest and loosest
standard for TEI-based central repository texts. While it is advantageous
to have most text data in TEI, it should be understood that this level
of conformity will provide minimal search and index functions. Only by
regularizing the data will the functionality of the data improve.
- TEIxLite is a subset of TEIP4.
Most existing TEILite (SGML)
texts can be fairly easily migrated to TEIxLite (XML), but migration
to TEIxTite is recommended.
- Migration is an evolutionary
process.
Texts will move from one document
class to the next as there is time and priority given to doing the work
to regularize them. A long-term plan to tighten data is both manageable
and sustainable, and can be directly informed by Library priorities. However,
in determining document classes, the benefits must always be weighed carefully
against the costs. In particular, the purchased material (such as that
produced by Chadwyck-Healey) would take the most effort to migrate. It
may be decided that a lesser document class (and corresponding loss in
functionality) will suffice.
- Workflows and standards will
differ for different kinds of data.
Thus far, two primary workflows
have emerged: one for legacy TEILite material, which can fairly easily
be brought into TEIxLite; and one for purchased material, most of which
does not currently parse against TEIP4 and will take considerable programming
time to do so. Other workflows may be identified in the future, based
on further exploration of legacy data and the condition of newly-purchased
text collections.
- Migrating texts is largely a
programming task.
The success of any text migration
project is dependent upon the programmers ability to recognize data
patterns from past projects and his/her ability to assess what it would
take to get the data to parse against one of the UVA document class DTDs.
The programmer must also have a firm grasp of the delivery needs and Library
priorities.
- Library Administration will
have to decide staffing priorities.
Careful consideration must
be given to the significant resources needed to migrate some data to the
most desirable document class. The time and effort needed to migrate some
text will take time and effort away from other priorities.
- Over time, methods for producing
electronic text should be standardized throughout the Library.
While current projects and
workflows may not permit immediate change, it must be recognized that
texts that are produced in a standardized manner will more easily be brought
into the central repository.
- EAD documents will not be brought
into TEI.
EAD documents and TEI-based
data are fundamentally different and have different purposes. There may
be keyword searching across these types of text, but little sophisticated
search and display functionality will be possible between TEI and EAD
collections.
- The TMT will not provide a recommendation
as to what to do about migrating TEI headers.
A newly-formed metadata working
group will need to make recommendations before TEI headers can be addressed.
Recommendations
An overview of our methodology
is followed by both a recommendation for future migration procedures and a
general outline of the steps needed to move beyond this first exploratory
stage.
Assessment Methodology
1. Technical review and approval
of chosen content (All content was considered for technical feasibility and
generally approved)
2. Determined if all files will
parse against TEIP4
A. Legacy materials were already
in TEIxLite (XML)
i. 182 files parsed cleanly
ii. One file was identified
as needing considerable work, so was set aside
B. Chadwyck materials will not
parse against TEIP4 without significant alteration
i. Chadwyck data has its own
DTDs, so the DTD and files were first analyzed
ii. High-level tags for major
structural divisions were altered with XSLT stylesheets and PERL scripts
(eg. replacing <chad1> with <div1>)
iii. Other, lower-level tags
were assessed but left as-is for this exploratory process
3. Next, will develop a superclass
stylesheet for delivery (lightly-modified EAF stylesheet)
Next Steps for Migrating
Priority Content Collections
1. Develop TEIxTite and other UVA
DTDs
2. Modify Chadwyck data (poetry,
Kafka) for delivery from cenrepo under TEIP4
3. Review all priority content
files and assess what it would take to move each to the next document class
(TEIxLite or TEIxTite, depending on the condition of the data)
4. Modify data again as deemed
practicable
5. Review TEI headers and fix or
enhance as needed (Metadata group will have to make recommendations for standards
first)
6. Copy files into cenrepo
7. Create a location where migration
tools and software can be placed and used for future migration projects, and
document changes to all scripts and stylesheets
Proposed Migration Method
for Future Projects
This procedure is suggested for
future text migration projects, with the caveat that it will need further
refinement as more is discovered about delivery needs and data and metadata
standards.
1. DCRT provides a list of prioritized
content for migration team to work through
2. Technical review and approval
of chosen content
3. Notify DCRT of excluded content
4. Examine content
A. Process for legacy TEILite
materials
i. Use migration scripts as
needed and as available to convert SGML to XML
ii. Parse files against TEIxLite
iii. Set aside problematic
files until easily-migrated text goes into cenrepo
B. Process for purchased materials
i. Convert SGML to XML as needed
ii. Edit XSLT stylesheet or
PERL scripts to modify the files
a. Alter the high-level tags
with simple scripts
b. Examine other, lower-level
tags to determine viability of programmatic fixes or creating extensions
to TEIP4
iii. Parse against TEIP4
5. Determine next steps for data
A. As needed and deemed viable,
other programmatic changes should be made to tighten the data to the next
document class
B. A determination will be made
as to when the file has reached the highest practicable tightening
6. Review TEI headers and fix/enhance
as needed
7. Review problematic files
A. Repeat steps 4 and 5
8. Copy files into cenrepo
Recommended Tools and Software
Standard XML validating parsers
and XSLT processors are already in use. Most significant will be the development,
over time, of a collection of PERL scripts and XSLT stylesheets that can be
used and reused to modify and regularize data.
Appendix
Priority Content Collections
Priority content collections identified
by the Digital Contents Review Team and approved by the Text Migration Team
were:
Paid for but unavailable
Chadwyck data represents
much of the legacy collection, therefore rules and patterns may emerge
Note: a format problem required
sending the CD-ROMs back, so we substituted the Chadwyck-Healey Kafka
database (also currently unavailable)
- A subset of the Modern English
collection
Large, heavily-used collection
with subcollections (the chosen subset includes those Modern English
files that were put up in the past few years in XML: about 180 titles)
Charge
Selection and prioritization
of collections will occur in the Digital Contents Review Team, but will
be reviewed by the Text Migration Team for technical feasibility. DCRT will
recommend preliminary text collections for investigation by 11/1/02.
The team should anticipate
technical issues with these preliminary texts and create a work plan for
migration.
The team's report should
make recommendations on tools and software, prioritize the needs, and identify
what can and should be created in-house.
The team is not charged
with doing an inventory of existing texts or off-line collection masters.
Implementation responsibilities
will be assigned once this team's recommendations have been received.
Submitted by the Text Migration
Team
Melinda Baumann, Chair
Matthew Gibson
Greg Murray
Perry Roland
Chris Ruotolo
|