|
Recommendation
1: Adopt a "document class" approach for texts to be included in the
Digital Library.
Because multiple levels of encoding
practice cannot be mechanically validated against a single DTD, the committee
recommends that a "document class" view be adopted. A document
class is a group of documents that have a high degree of markup and functional
similarity.
The committee recommends three
document classes at this time:
1. The TEIxKeyboard class corresponds
to the low-level practice requirements suggested by the committee charge.
It may be used for either in-house markup or for sending texts out to vendors
for markup. The DTD for this class is described in the DTD specifications
under I.
2. The TEIxTite class corresponds
to the high-level practice requirements suggested by the committee charge.
The DTD for this class is described in the DTD specifications under II.
3. The TEIxLite class accommodates
the legacy markup that exists in TEIxLite form. This class utilizes the unmodified
TEIxLite DTD.
This approach offers advantages
over the single-DTD one. First, specific requirements can be enforced and
encoding conveniences can be allowed at different points in a document's life
cycle. For example, the requirements of the keyboarding stage do not have
to be the same as those of the delivery stage. This allows tighter control
over the encoding of a document at each stage. Second, because tighter control
has been exercised over the creation of each document, the entire class of
documents is more predictable, making integrated searching and delivery more
practical.
The list of document classes above
is not exhaustive; new classes of documents will be necessary. For example,
the Library already has a great amount of non-TEI conformant markup. However,
the proliferation of document classes should be avoided. A new document class
should be created only after a determination that a group of documents does
not fit an existing class. If the number of document classes is kept to a
minimum, a higher level of integration can be achieved. Since the creation
of a new document class may have far-reaching effects on production, delivery
and meta-data processes, this should not be an ad hoc decision. Decisions
regarding the creation of new classes should be handled in the periodic review
process described below.
In order to enforce the Library's
practices, a DTD for each document class must be made available. The committee
has begun this work by listing some technical requirements for the two new
DTDs, but has not yet created the DTDs themselves. If the committee's recommendations
are approved, then the DTDs should be finalized as quickly as possible. Because
it will prove difficult to tighten up the DTDs later, they should enforce
the tightest restrictions possible at this time. Additional restrictions
may become apparent when the DTDs are actually written. The committee recommends
an evaluation period before any DTD is used for production material.
In the specifications of the DTDs
provided below, deviations from TEIxLite have been minimized, but not entirely
avoided. Wherever possible, requirements have been met by constraining TEIxLite
and thereby preserving compliance with it. In those cases where this was
not possible, the committee has insured that programmatic conversion to TEIxLite
without loss of data can be accomplished.
Because TEIxLite is designed for
document production when the markup requirements are loose and for document
interchange between users of TEI with different markup requirements, it is
recommended that no new document production take place using unmodified TEIxLite.
While TEIxLite is highly flexible and serves a large number of users most
of the time, it is not rigorous enough for the production, storage and delivery
processes of the Digital Library, which must enforce rules in order to achieve
a high degree of consistency for production and delivery.
Recommendation 2: Implement
a quality assurance process that goes beyond parsing.
Since DTDs can only be used to
validate the syntax and not the semantic content of a document, the committee
recommends implementing a quality assurance process in addition to parsing.
Some examples of items that could be checked include the formulation of value
attributes on the date element in ISO format, the existence of name parts
within author, editor, and other personal name elements, as well as the proper
use of milestone elements and "follow-able" attributes on referencing
elements such as pb, ref, and ptr.
For example, the following XSLT
code could be used to generate a warning if the titleStmt element within the
fileDesc lacks the name parts necessary for the provision of sort-able and
searchable author meta-data.
<xsl:template
match="teiHeader/fileDesc/titleStmt/author">
<xsl:choose>
<xsl:when test="not(persName) or not(orgName)">
WARNING: A NAME
ELEMENT IS NEEDED HERE: <xsl:value-of select="." />
</xsl:when>
<xsl:otherwise>
<xsl:if test="not(persName/foreName)">
WARNING: A FORENAME ELEMENT MAY BE NEEDED HERE: <xsl:value-of
select="./persName"/>
</xsl:if>
<xsl:if test="not(persName/surName)">
WARNING: A SURNAME ELEMENT MAY BE NEEDED HERE: <xsl:value-of
select="./persName"/>
</xsl:if>
</xsl:otherwise>
</xsl:template>
In addition to general-use "semantics
checkers", scripts may be written to accommodate project-specific semantics.
Any programming language may be used; however, XSLT is efficient, is relatively
easy to learn, and provides a method for addressing individual parts of a
document through its use of XPath syntax.
Recommendation 3: Institute
a periodic review process.
In order to control the creation
of new classes and to manage the migration of documents between existing classes,
it is recommended that a periodic review of the document class structure and
the accompanying DTDs be conducted.
Recommendation 4: Address the
migration of existing electronic texts later
The issue of migration of existing
texts presents complex problems that should be addressed in a more in-depth
fashion than the committee had the time to do. However, the committee recommends
that data analysis be performed on the existing texts in order to determine
which elements and attributes have actually been used and what their values
are before starting the migration process. Also, as stated above, no new
texts should be created using unmodified TEIxLite so that the body of texts
that may be candidates for conversion does not grow larger.
Ideally, the DTD specifications
below should be the model for conversion efforts. However, recent conversion
efforts have demonstrated that conversion to these specifications cannot be
accomplished entirely programmatically.
Recommendation 5: Address the
teiHeader element later
Since it is anticipated that the
source of the meta-data within the Digital Library will be collected mechanically
from each document's header, the committee recommends a closer look into the
Library's header-specific practices and requirements. It is highly recommended
that representatives from Cataloging be involved in the process.
DTD Specifications
I. TEIxTite DTD
A. Extensions to TEIxLite
1. Add entity attribute on
<pb> for referencing page image. Use <figure> only for illustrations
embedded in the document's text. This change is endorsed in "TEI
Text Encoding in Libraries: Guidelines for Best Encoding Practices".
Reverting to TEI-conformant markup can be accomplished very easily. Intuitively,
the <pb> element is the place where one would expect the page image
information to be located. So thischange actually clarifies the distinction
between page images and illustrations for novice encoders.
2. Modify <pb> element's
content model to allow for page image meta-data: (pgDesc?). TEI-conformance
may be obtained by simply omitting the sub-elements within <pb>.
The new <pgDesc> element will have the same model as <figDesc>
plus the <fw> (forme work) element for capturing running headers
and footers, catchwords, etc. A similar change has been implemented by
the Brown University Women Writers Project.
3. Substitute the XHTML table
model in place of the TEI model. This is suggested in the TEI Guidelines
because the XHTML model is a well-known tag set and the practice anticipates
the end result of translation for display.
4. Modify models of <author>,
<editor>, <sponsor>, <funder>, <principal> and
<respStmt> to allow <persName> and <orgName> sub-elements
which are provided by the full TEI DTD. This is recommended at <http://www.lib.umich.edu/staff/ocu/teiguide.html>.
This change will provide the ability to control the presence of name parts,
such as <surName> and <foreName>, provide additional control
over sorting within a name, and enforce the differentiation of personal
and corporate names.
B. Restrictions of TEIxLite
1. Constrain global rend attribute
to behavioral values, i.e. inline|block|none. Doing this will encourage
typographic markup to be nested within structural markup, i.e. <p><hi
rend="bold"></hi></p>.
2. Modify <text> content
model -- <!ELEMENT text (front?, (body|group), back?)>
3. Modify <group> content
model -- <!ELEMENT group (text|group)>
4. Modify <body> content
model -- <!ELEMENT body ((div1|divGen), (div1|divGen)*)>
5. Eliminate mixed content
model for <note>, <item>, <author>, <editor>,
<sponsor>, <funder>, <principal>.
6. Modify <note> ATTLIST
-- require id, and eliminate the target and targetEnd attributes in order
to encourage the use of <note> only for the actual text of a note,
not the referencing text. Use <ref> for referencing text or <ptr>
(with n attribute, see no. 14 below) when there is no referencing text.
7. Disallow <gi> at the
phrase level, but allow within tagset documentation elements, e.g. <tagsDecl>
and its descendants, <tagUsage>, <rendition>, <creation>,
<classCode>.
8. Require <pb> as first
child of <titlePage>: (pb, (index | interp | interpGrp | lb | milestone
| pb | gap | anchor)*, (byline | docAuthor | docDate | docEdition | docImprint
| docTitle | epigraph | titlePart), (byline | docAuthor | docDate | docEdition
| docImprint | docTitle | epigraph | titlePart | index | interp | interpGrp
| lb | milestone | pb | gap | anchor)*) Use the QA process to ensure that
<titlePage> contains at least one <pb> and one <docImprint>
element.
9. Require reg attribute on
<orig>.
10. Require corr attribute
on <sic>.
11. Require sic attribute on
<corr>.
12. Require orig attribute
on <reg>.
13. Require lang attribute
on <foreign>.
14. Require n attribute on
<ptr> in order to record a label for the pointer.
15. Require type attribute
on <title>. The TEI Guidelines suggest the following values: main,
subordinate, parallel, and abbreviated. Default to "main".
16. Require type attribute
on <titlePart>. The TEI Guidelines suggest the following values:
main, sub, alt, desc. Default to "main".
17. Require type attribute
on <divN>.
18. Require type attribute
on <lg>.
19. Require type attribute
on <num>. The presence of a <num> element within fileDesc/extent
should be checked by the QA process.
20. Make "illegible"
default reason attribute on <unclear>. Require desc and reason attributes
on <gap>.
21. Require value attribute
on <date>. Check for ISO format using QA process.
22. Require type attribute
on <idno>.
23. Require type attribute
on <list>. Suggested values are: simple, gloss. It is a semantic
error for a glossary list not to contain <label><item> pairs.
This should be checked via the QA process.
24. Require type attribute
on <stage>. The TEI Guidelines suggest the following values: setting,
entrance, exit, business, novelistic, delivery, modifier, location, mixed.
No default value.
25. Require id attribute on
<term>. Also, target attribute on <gloss>.
26. Employ controlled vocabulary
liberally especially for required attributes. Review EAF files, Brown's
WWP DTD, and markup guidelines from other institutions for potential controlled
vocabulary lists.
II. TEIxKeyboard DTD
A. Extensions to TEIxTite
1. Encode typographic changes
using "convenience" elements (<i>, <b>, <u>,
<sub>, <sup>, <smcap>) for common typographic features.
These elements will be converted to <hi rend="italics">,
etc. after keyboarding.
2. Add a <quotedletter>
element with a content model that is the same as <divN> for encoder
convenience for cases where <divN> contains a letter and other text.
This change will allow for more intuitive markup and eliminate the need
to artificially segment the <divN>. <quotedletter> will be
converted to <q><text><body><div1>[text of letter
here]</div1></body></text></q> after keyboarding.
3. Add an empty <cols>
element as a "convenience element" for marking beginning of
columnar layout and an empty <cb> element to mark the start of an
individual column. The <cols> element's #REQUIRED n attribute will
be used for recording the number of columns. <cols> will be converted
to <milestone unit='columnation' n='x' /> after keyboarding. Following
<cols n="[>1]"/>, the number of columns indicated by
the n attribute should match the number of <cb /> elements before
the next <cols> element. A return to single-column layout (<cols
n="1" />) doesn't require a following <cb />. This
will need to be checked by the QA process. All <cb> elements will
be converted to <milestone unit="column" /> after keyboarding.
B. Restrictions of TEIxTite
1. Encode basic structure using
level 3 recommended elements from "TEI Text Encoding in Libraries:
Guidelines for Best Encoding Practices". Remove unused elements,
probably: ident, kw, rs, corr, reg, sic, emph, foreign, mentioned, soCalled,
term, title, s, gi, eg, listBibl, index, interp, interpGrp. *remove <title>
only from phrase-level class. [Greg: The keyboarding DTD should disallow
the use of logical elements for changes in typeface. Typical encoding
practices encourage the use of logical elements to identify the meaning
or purpose of a change in typeface, rather than simply recording its physical
appearance. For example, if a phrase is italicized in the print source,
the encoder is normally expected to determine whether the phrase is italicized
because it is a title, a foreign phrase, a subject-specific term, a rhetorical
emphasis, etc. However, practical experience with keyboarding vendors
suggests that expecting the vendor to make such distinctions is not the
best approach. It is preferable to require the vendor to mark changes
in typeface as physical changes (italic, bold, etc.) and to enhance the
markup with logical elements on our end if we deem it valuable to do so.
Such enhancement is not a necessary step for general production texts.]
2. Constrain model for <respStmt>:
((index | interp | interpGrp | lb | milestone | pb | gap | anchor)*, resp,
(index | interp | interpGrp | lb | milestone | pb | gap | anchor | name)+
). This model requires <resp>. The existence of <resp> and
<name> element pairs within <respStmt> should be checked by
the QA process.
3. Constrain model for <cit>:
((q, (lb|pb|milestone)*, (bibl | biblFull), (lb|pb|milestone)*) |
((bibl | biblFull), (lb|pb|milestone)*,
q, (lb|pb|milestone)*)). This model more rigorously enforces the concept
of a citation as a container for a quote and a bibliographic reference
pair.
4. Constrain model for <figure>:
((head, (lb|pb|milestone)*)?, (p, (lb|pb|milestone)*)*, (figDesc, (lb|pb|milestone)*)?).
This model captures the essential parts of a figure and disallows the
(treacherous) ability to embed a distinct text with a figure.
5. Remove the level attribute
on <title> due to past abuse by contracted encoders.
III. TEIxLite
This class utilizes the unmodified
TEIxLite DTD.
----------
DTD Committee Addendum
The DTD committee's results and
recommendations for the UVa Library's digital text production covers the basic
level of textual creation to be provided by the Digital Library Production
Services (DPLS). However, this committee did not and could not address several
issues: 1) the method in which the Electronic Text Center continues to encode
material, 2) the problems regarding the migration of legacy licensed data
and partner projects to the repository, and 3) the effect using a non-TEI
standard DTD will have on the TEI community--a community in which UVa plays
an integral part.
1) Present encoding policies
at the Etext Center vary from project to project. These policies are variable
primarily because each project requires a different level of markup even
beyond the TeixLite.dtd to describe content and to provide different text
functionality--the NSF German project is an excellent example of this.
The future text-encoding policies at the Electronic Text Center will continue
to ensure individuated project functionalities. Until the Digital Library
can provide a delivery system for text, the Etext Center will be unable
to accommodate the new markup requirements because to do so would require
the ETC to modify all delivery tools currently in use and currently there
are no resources for doing so. Where possible we will follow the guidelines
set forth by the DTD Committee.
2) As far as legacy licensed
data is concerned, several questions arise that need to be answered before
any attempt at migration can begin:
a. Does the UVa Library have
permission to edit the markup of these -- e.g. Chadwyck Healey's English
Poetry -- SGML databases? Before we can begin the migration of these
licensed databases into any form of XML we will need to obtain
rights clearance to do so.
b. Questions regarding functionality
and cost-effectiveness will also undoubtedly arise. If we do get permissions
to change the markup of SGML licensed data, does it make sense to invest
additional time, energy, and money if the text functionality of these
databases cannot be enhanced through such a migration strategy? Or, should
we strive for a text system, like the one we have now, that can handle
both SGML and XML.
c. How will the library approach
the creators and partners of user-based projects regarding the conversion
of those projects to fit the constraints of the repository? What system
do we have in place to ensure that our partners have authorized the conversion
or change in markup? What system do we have in place for our partners
to continue to work on their texts and projects once they have been migrated
to the repository?
3) UVa is a primary partner
in the TEI Consortium. We feel strongly that any deviations from the current
TEI.dtd or TEIxLite.dtd need to be presented to the Consortium for discussion
and possible ratification in future TEI Guidelines. We think that because
of our position in the Consortium, we need to be sensitive to the political
message that we will send by using "non-TEI standard" DTDs.
|