Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Reports

Text DTD Practices Group Report

October 2, 2002

Recommendation 1: Adopt a "document class" approach for texts to be included in the Digital Library.

Because multiple levels of encoding practice cannot be mechanically validated against a single DTD, the committee recommends that a "document class" view be adopted.  A document class is a group of documents that have a high degree of markup and functional similarity.

The committee recommends three document classes at this time:

1.  The TEIxKeyboard class corresponds to the low-level practice requirements suggested by the committee charge.  It may be used for either in-house markup or for sending texts out to vendors for markup.  The DTD for this class is described in the DTD specifications under I.

2.  The TEIxTite class corresponds to the high-level practice requirements suggested by the committee charge.  The DTD for this class is described in the DTD specifications under II.

3.  The TEIxLite class accommodates the legacy markup that exists in TEIxLite form.  This class utilizes the unmodified TEIxLite DTD.

This approach offers advantages over the single-DTD one.  First, specific requirements can be enforced and encoding conveniences can be allowed at different points in a document's life cycle.  For example, the requirements of the keyboarding stage do not have to be the same as those of the delivery stage.  This allows tighter control over the encoding of a document at each stage.  Second, because tighter control has been exercised over the creation of each document, the entire class of documents is more predictable, making integrated searching and delivery more practical.

The list of document classes above is not exhaustive; new classes of documents will be necessary.  For example, the Library already has a great amount of non-TEI conformant markup.  However, the proliferation of document classes should be avoided.  A new document class should be created only after a determination that a group of documents does not fit an existing class.  If the number of document classes is kept to a minimum, a higher level of integration can be achieved.  Since the creation of a new document class may have far-reaching effects on production, delivery and meta-data processes, this should not be an ad hoc decision.  Decisions regarding the creation of new classes should be handled in the periodic review process described below.

In order to enforce the Library's practices, a DTD for each document class must be made available.  The committee has begun this work by listing some technical requirements for the two new DTDs, but has not yet created the DTDs themselves.  If the committee's recommendations are approved, then the DTDs should be finalized as quickly as possible.  Because it will prove difficult to tighten up the DTDs later, they should enforce the tightest restrictions possible at this time.  Additional restrictions may become apparent when the DTDs are actually written.  The committee recommends an evaluation period before any DTD is used for production material.

In the specifications of the DTDs provided below, deviations from TEIxLite have been minimized, but not entirely avoided.  Wherever possible, requirements have been met by constraining TEIxLite and thereby preserving compliance with it.  In those cases where this was not possible, the committee has insured that programmatic conversion to TEIxLite without loss of data can be accomplished.

Because TEIxLite is designed for document production when the markup requirements are loose and for document interchange between users of TEI with different markup requirements, it is recommended that no new document production take place using unmodified TEIxLite.  While TEIxLite is highly flexible and serves a large number of users most of the time, it is not rigorous enough for the production, storage and delivery processes of the Digital Library, which must enforce rules in order to achieve a high degree of consistency for production and delivery.

Recommendation 2: Implement a quality assurance process that goes beyond parsing.

Since DTDs can only be used to validate the syntax and not the semantic content of a document, the committee recommends implementing a quality assurance process in addition to parsing.  Some examples of items that could be checked include the formulation of value attributes on the date element in ISO format, the existence of name parts within author, editor, and other personal name elements, as well as the proper use of milestone elements and "follow-able" attributes on referencing elements such as pb, ref, and ptr.

For example, the following XSLT code could be used to generate a warning if the titleStmt element within the fileDesc lacks the name parts necessary for the provision of sort-able and searchable author meta-data.

<xsl:template match="teiHeader/fileDesc/titleStmt/author">

   <xsl:choose>

   <xsl:when test="not(persName) or not(orgName)">

WARNING: A NAME ELEMENT IS NEEDED HERE: <xsl:value-of select="." />

   </xsl:when>

   <xsl:otherwise>

<xsl:if test="not(persName/foreName)">

   WARNING: A FORENAME ELEMENT MAY BE NEEDED HERE: <xsl:value-of

   select="./persName"/>

</xsl:if>

<xsl:if test="not(persName/surName)">

   WARNING: A SURNAME ELEMENT MAY BE NEEDED HERE: <xsl:value-of 

   select="./persName"/>

</xsl:if>

   </xsl:otherwise>

</xsl:template>

In addition to general-use "semantics checkers", scripts may be written to accommodate project-specific semantics.  Any programming language may be used; however, XSLT is efficient, is relatively easy to learn, and provides a method for addressing individual parts of a document through its use of XPath syntax.

Recommendation 3: Institute a periodic review process.

In order to control the creation of new classes and to manage the migration of documents between existing classes, it is recommended that a periodic review of the document class structure and the accompanying DTDs be conducted.

Recommendation 4: Address the migration of existing electronic texts later

The issue of migration of existing texts presents complex problems that should be addressed in a more in-depth fashion than the committee had the time to do.  However, the committee recommends that data analysis be performed on the existing texts in order to determine which elements and attributes have actually been used and what their values are before starting the migration process.  Also, as stated above, no new texts should be created using unmodified TEIxLite so that the body of texts that may be candidates for conversion does not grow larger.

Ideally, the DTD specifications below should be the model for conversion efforts.  However, recent conversion efforts have demonstrated that conversion to these specifications cannot be accomplished entirely programmatically.

Recommendation 5: Address the teiHeader element later

Since it is anticipated that the source of the meta-data within the Digital Library will be collected mechanically from each document's header, the committee recommends a closer look into the Library's header-specific practices and requirements.  It is highly recommended that representatives from Cataloging be involved in the process. 

DTD Specifications

I. TEIxTite DTD

A. Extensions to TEIxLite

1. Add entity attribute on <pb> for referencing page image. Use <figure> only for illustrations embedded in the document's text.  This change is endorsed in "TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices".  Reverting to TEI-conformant markup can be accomplished very easily. Intuitively, the <pb> element is the place where one would expect the page image information to be located.  So thischange actually clarifies the distinction between page images and illustrations for novice encoders.

2. Modify <pb> element's content model to allow for page image meta-data: (pgDesc?).  TEI-conformance may be obtained by simply omitting the sub-elements within <pb>.  The new <pgDesc> element will have the same model as <figDesc> plus the <fw> (forme work) element for capturing running headers and footers, catchwords, etc.  A similar change has been implemented by the Brown University Women Writers Project.

3. Substitute the XHTML table model in place of the TEI model.  This is suggested in the TEI Guidelines because the XHTML model is a well-known tag set and the practice anticipates the end result of translation for display.

4. Modify models of <author>, <editor>, <sponsor>, <funder>, <principal> and <respStmt> to allow <persName> and <orgName> sub-elements which are provided by the full TEI DTD.  This is recommended at <http://www.lib.umich.edu/staff/ocu/teiguide.html>.  This change will provide the ability to control the presence of name parts, such as <surName> and <foreName>, provide additional control over sorting within a name, and enforce the differentiation of personal and corporate names.

B. Restrictions of TEIxLite

1. Constrain global rend attribute to behavioral values, i.e. inline|block|none.  Doing this will encourage typographic markup to be nested within structural markup, i.e. <p><hi rend="bold"></hi></p>.

2. Modify <text> content model --  <!ELEMENT text (front?, (body|group), back?)>

3. Modify <group> content model -- <!ELEMENT group (text|group)>

4. Modify <body> content model --  <!ELEMENT body ((div1|divGen), (div1|divGen)*)>

5. Eliminate mixed content model for <note>, <item>, <author>, <editor>, <sponsor>, <funder>, <principal>.

6. Modify <note> ATTLIST -- require id, and eliminate the target and targetEnd attributes in order to encourage the use of <note> only for the actual text of a note, not the referencing text. Use <ref> for referencing text or <ptr> (with n attribute, see no. 14 below) when there is no referencing text. 

7. Disallow <gi> at the phrase level, but allow within tagset documentation elements, e.g. <tagsDecl> and its descendants, <tagUsage>, <rendition>, <creation>, <classCode>.

8. Require <pb> as first child of <titlePage>: (pb, (index | interp | interpGrp | lb | milestone | pb | gap | anchor)*, (byline | docAuthor | docDate | docEdition | docImprint | docTitle | epigraph | titlePart), (byline | docAuthor | docDate | docEdition |  docImprint | docTitle | epigraph | titlePart | index | interp | interpGrp | lb | milestone | pb | gap | anchor)*) Use the QA process to ensure that <titlePage> contains at least one <pb> and one <docImprint> element.

9. Require reg attribute on <orig>.

10. Require corr attribute on <sic>.

11. Require sic attribute on <corr>.

12. Require orig attribute on <reg>.

13. Require lang attribute on <foreign>.

14. Require n attribute on <ptr> in order to record a label for the pointer. 

15. Require type attribute on <title>.  The TEI Guidelines suggest the following values: main, subordinate,  parallel, and abbreviated.  Default to "main".

16. Require type attribute on <titlePart>.  The TEI Guidelines suggest the following values: main, sub,  alt, desc.  Default to "main".

17. Require type attribute on <divN>.

18. Require type attribute on <lg>.

19. Require type attribute on <num>.  The presence of a <num> element within fileDesc/extent should be  checked by the QA process.

20. Make "illegible" default reason attribute on <unclear>. Require desc and reason attributes on <gap>.

21. Require value attribute on <date>.  Check for ISO format using QA process.

22. Require type attribute on <idno>.

23. Require type attribute on <list>.  Suggested values are: simple, gloss.  It is a semantic error for a glossary  list not to contain <label><item> pairs.  This should be checked via the QA process.

24. Require type attribute on <stage>.   The TEI Guidelines suggest the following values: setting,  entrance, exit, business, novelistic, delivery, modifier, location, mixed.  No default value.

25. Require id attribute on <term>.  Also, target attribute on <gloss>.

26. Employ controlled vocabulary liberally especially for required attributes. Review EAF files, Brown's WWP DTD, and markup guidelines from other institutions for potential controlled vocabulary lists.

II. TEIxKeyboard DTD

A. Extensions to TEIxTite

1. Encode typographic changes using "convenience" elements (<i>, <b>, <u>, <sub>, <sup>, <smcap>) for common typographic features.  These elements will be converted to <hi rend="italics">, etc. after keyboarding.

2. Add a <quotedletter> element with a content model that is the same as <divN> for encoder convenience for cases where <divN> contains a letter and other text.  This change will allow for more intuitive markup and eliminate the need to artificially segment the <divN>.  <quotedletter> will be converted to <q><text><body><div1>[text of letter here]</div1></body></text></q> after keyboarding.

3. Add an empty <cols> element as a "convenience element" for marking beginning of columnar layout and an empty <cb> element to mark the start of an individual column.  The <cols> element's #REQUIRED n attribute will be used for recording the number of columns.  <cols> will be converted to <milestone unit='columnation' n='x' /> after keyboarding.  Following <cols n="[>1]"/>, the number of columns indicated by the n attribute should match the number of <cb /> elements before the next <cols> element. A return to single-column layout (<cols n="1" />) doesn't require a following <cb />.  This will need to be checked by the QA process.  All <cb> elements will be converted to <milestone unit="column" /> after keyboarding.

B. Restrictions of TEIxTite

1. Encode basic structure using level 3 recommended elements from "TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices". Remove unused elements, probably: ident, kw, rs, corr, reg, sic, emph, foreign, mentioned, soCalled, term, title, s, gi, eg, listBibl, index, interp, interpGrp. *remove <title> only from phrase-level class. [Greg: The keyboarding DTD should disallow the use of logical elements for changes in typeface.  Typical encoding practices encourage the use of logical elements to identify the meaning or purpose of a change in typeface, rather than simply recording its physical appearance.  For example, if a phrase is italicized in the print source, the encoder is normally expected to determine whether the phrase is italicized because it is a title, a foreign phrase, a subject-specific term, a rhetorical emphasis, etc. However, practical experience with keyboarding vendors suggests that expecting the vendor to make such distinctions is not the best approach. It is preferable to require the vendor to mark changes in typeface as physical changes (italic, bold, etc.) and to enhance the markup with logical elements on our end if we deem it valuable to do so.  Such enhancement is not a necessary step for general production texts.]

2. Constrain model for <respStmt>: ((index | interp | interpGrp | lb | milestone | pb | gap | anchor)*, resp, (index | interp | interpGrp | lb | milestone | pb | gap | anchor | name)+ ).  This model requires <resp>.  The existence of <resp> and <name> element pairs within <respStmt> should be checked by the QA process.

3. Constrain model for <cit>: ((q, (lb|pb|milestone)*, (bibl | biblFull), (lb|pb|milestone)*) |

   ((bibl | biblFull), (lb|pb|milestone)*, q, (lb|pb|milestone)*)).  This model more rigorously enforces the concept of a citation as a container for a quote and a bibliographic reference pair.

4. Constrain model for <figure>: ((head, (lb|pb|milestone)*)?, (p, (lb|pb|milestone)*)*, (figDesc, (lb|pb|milestone)*)?).  This model captures the essential parts of a figure and disallows the (treacherous) ability to embed a distinct text with a figure.

5. Remove the level attribute on <title> due to past abuse by contracted encoders.

III.  TEIxLite

This class utilizes the unmodified TEIxLite DTD.

----------

DTD Committee Addendum

The DTD committee's results and recommendations for the UVa Library's digital text production covers the basic level of textual creation to be provided by the Digital Library Production Services (DPLS).  However, this committee did not and could not address several issues: 1) the method in which the Electronic Text Center continues to encode material, 2) the problems regarding the migration of legacy licensed data and partner projects to the repository, and 3) the effect using a non-TEI standard DTD will have on the TEI community--a community in which UVa plays an integral part.

1) Present encoding policies at the Etext Center vary from project to project.  These policies are variable primarily because each project requires a different level of markup even beyond the TeixLite.dtd to describe content and to provide different text functionality--the NSF German project is an excellent example of this.  The future text-encoding policies at the Electronic Text Center will continue to ensure individuated project functionalities.  Until the Digital Library can provide a delivery system for text, the Etext Center will be unable to accommodate the new markup requirements because to do so would require the ETC to modify all delivery tools currently in use and currently there are no resources for doing so.  Where possible we will follow the guidelines set forth by the DTD Committee.

2) As far as legacy licensed data is concerned, several questions arise that need to be answered before any attempt at migration can begin:

a.  Does the UVa Library have permission to edit the markup of these -- e.g. Chadwyck Healey's English Poetry -- SGML databases?  Before we can begin the migration of these licensed databases into any form of XML we will need to obtain rights clearance to do so.

b.  Questions regarding functionality and cost-effectiveness will also undoubtedly arise.  If we do get permissions to change the markup of SGML licensed data, does it make sense to invest additional time, energy, and money if the text functionality of these databases cannot be enhanced through such a migration strategy?  Or, should we strive for a text system, like the one we have now, that can handle both SGML and XML.

c.  How will the library approach the creators and partners of user-based projects regarding the conversion of those projects to fit the constraints of the repository?  What system do we have in place to ensure that our partners have authorized the conversion or change in markup?  What system do we have in place for our partners to continue to work on their texts and projects once they have been migrated to the repository?

3)  UVa is a primary partner in the TEI Consortium.  We feel strongly that any deviations from the current TEI.dtd or TEIxLite.dtd need to be presented to the Consortium for discussion and possible ratification in future TEI Guidelines.  We think that because of our position in the Consortium, we need to be sensitive to the political message that we will send by using "non-TEI standard" DTDs.

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, June 02, 2008
© The Rector and Visitors of the University of Virginia