Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Reports

Repository Text Delivery Committee Report

December 10, 2003

 This report summarizes the accomplishments of the Text Delivery Committee as of the end of November, 2003. The original charge of the committee was to have Phase I functioning completely within the Fedora framework by December 1, 2003. This was not possible due to a range of issues uncovered by the committee during the course of their work that are addressed in this document. The scope of the newly created Fedora Implementors Committee encompasses several of the charge areas of the Text Delivery Committee, which makes a case for efficiency by merging the charges of the two committees. As of December 15, 2003, the work of the Text Delivery Committee is to be considered complete. Any outstanding issues not addressed by the Text Delivery Committee will be absorbed into the charge of the Fedora Implementors Committee.

Fedora Repository Software

At the time of the Text Delivery Committee’s inception, the Fedora Software was at release 1.1. The committee identified several features that were lacking in the Fedora 1.1 release that would be helpful in designing and developing work flow tools including the problem of PID (Persistent Identifier) substitution in data objects and the ability to modify batches of objects. Fedora 1.2 is scheduled for release on December 10, 2003 will include significant enhancements including the addition of a new feature that will enable the pre-generation of PIDs and a retainPIDs option in the Fedora configuration file. These two new features will greatly simplify the design and development of workflow tools and resolve many of the difficulties cited with the previous PID substitution issue. In addition, Fedora 1.2 will include full content versioning and enhanced GUI (Graphical User Interfaces) tools for creating Behavior Definition and Behavior Mechanism objects and for manipulating and editing data objects. Fedora 1.2 will also include a migration utility that eases the process of migrating repository content from previous versions of Fedora.

The addition of a batch modify utility is not slated until Fedora 1.3 which will be released sometime in early 2004, but in the interim modifies can be accomplished interactively or as a batch of deletes and adds.

The committee decided to delay the development of workflow tools related to object creation/ingestion until after the Fedora 1.2 release for several reasons.

  • New features in Fedora 1.2 would greatly simplify the design and development of workflow tools.
  • Content created in Fedora 1.x would require migration before it could be ported to Fedora 1.2
  • Fedora 1.2 provided enhanced support for modification of existing objects.
  • Fedora 1.2 provided fixes for several known bugs in Fedora 1.1.

Fedora 1.1 was configured and installed on the Library unix server known as hatbox.lib.virginia.edu. A DNS alias was created named repo.lib.virginia.edu which would be the permanent hostname used for the Fedora repository. The alias would facilitate future migrations if the Fedora software ever required moving to a new machine. The Fedora software on hatbox was also upgraded to the latest CVS version of Fedora to enable testing of the latest new features of Fedora 1.2

The committee recommends waiting for the release of Fedora 1.2 in December before developing workflow tools and ingesting final versions of objects into the repository.

XPat Search Software

The XPat search software was installed on the unix server named scripta.lib.virginia.edu. The committee demonstrated that the new XPat software was identical in functionality to the previous version of OpenText being run on etext.lib.virginia.edu. The new XPat software included some bug fixes to the earlier code, but the basic code and the way the scripts are invoked and used remained the same. Tests were successfully conducted on scripta.lib using OpenText scripts that had been running on etext.lib indicating that little if any changes would be necessary in porting OpenText scripts to XPat scripts. Although a rigorous analysis was not performed, it was noted that the creation of indexes on scripta.lib was approximately twice as fast as performing the same task on etext.lib.

The DLXS middleware that comes bundled with XPat was also installed on scripta.lib. The middleware software contained updated versions for the MRSID retrieve and tif2gif programs for manipulating MRSID and bitonal tiff images. These new programs replaced the earlier versions being used in the new delivery scripts.

Library System Infrastructure

A number of system infrastructure issues were identified related to the implementation of a large-scale production Fedora repository. These issues include disk storage capacity and planning, backup and archival capacity and planning, and server load and performance. These issues are being addressed as part of long term Library Infrastructure planning, but funding resources for many of these issues are not presently identified. The committee notes that while most of these issues are not critical for the current Phase I implementation, they will become more critical as content is added to the repository and will require addressing in the near future.

Since the Phase I content and delivery mechanisms are being used for demonstrations, it was necessary to protect the existing Phase I content and programs from changes required to bring the content and programs into Fedora. Necessary changes to the content involve using persistent identifiers (PIDs) to identify objects rather than filenames or simple URLs. These changes in the content may also necessitate changes in the delivery programs and stylesheets. To prevent the Fedora work from impacting the current Phase I repository, we created a mirror copy of the Phase I Text content. Once the Fedora implementation is complete and functional, the current Phase I repository in CENREPO will be replaced with the content of the mirror. In the interim, both can coexist without affecting one another.

The machine known as hatbox.lib.virginia.edu is also currently designated as the “text application server” which means that is where most text-based applications like XSLT engines and XML parsers and validators reside. Hatbox.lib is also the machine housing the Fedora repository and the MySQL database used by Fedora. As the repository grows, Fedora and the MySQL server will require additional system resources that may mean moving these applications to other machines or moving the text applications to another server. A DNS alias of text.lib.virginia.edu has been created to facilitate future machine name changes or software relocations. Any text-based applications running on hatbox.lib should be invoked using the alias of text.lib.virginia.edu. There is a similar image alias of image.lib.virginia.edu, which currently points at the server named iris.lib.virginia.edu. All image-based applications on iris.lib should be invoked using this alias.

XPat is currently installed on the server named scripta.lib. Eventually, all of the text-based applications should be co-located on the same server. This would most likely mean that the text-based applications on hatbox.lib would be moved to scripta.lib. Since the XPat software requires a unix filesystem path in order to build its indexes, all text files would have to be visible from the machine where XPat is installed. This is currently being accomplished by remote mounting the CENREPO filesystem to scripta.lib. These logistical issues do not directly impact the development of Phase I or Phase II, but are important to keep in mind in planning for future expansion of the repository.

The committee also noted some inconsistencies in the directory naming structure of CENREPO. Part of this is due to standards and best practices that have evolved over time. The mirror of CENREPO should have its directory structure modified to bring the directory naming structure up to the current best practice standard. When the final move occurs, the new directory structure would replace what is currently in CENREPO. It is vital that the directory naming structure used for CENREPO be consistent to facilitate the development of workflow tools and limit potential migration issues in the future.

Metadata

The Metadata Steering Group (MSG) finished the final mappings between the TEIHeader elements and uvaDescMeta elements before the Thanksgiving holidays and programming has begun to develop an XSL stylesheet to generate the descriptive metadata records for texts. The MSG’s next task is to begin work looking at the mappings for uvaAdminMeta. Once that is completed and the XSL programming is completed, it will be possible to create both the descriptive and administrative metadata necessary for each text object. Mappings for the EADHeader to uvaDescMeta are slated for review after the administrative metadata mappings are completed.

Text Object Model

The committee reviewed the report from the Text Image Object Model Committee and decided that a single text object model for TEI-encoded texts would not be feasible because of the variation that can occur in different types of TEI texts. The committee recommends using a slightly modified model that adds specialized text disseminators for each unique class of texts and providing a general text disseminator with a limited number of behaviors that are true for all texts. The committee also recommended adding a static datastream that would contain table of contents information. The decision to handle this statically versus dynamically was because of the potential variation in document tagging structures. The resulting TEI-Book object model is shown in figure 1.

Figure 1.  TEI Book Object Model

Disseminators

Although the committee’s charge involved only Texts, there were several additional common disseminators that would eventually be required. Other groups are handling some of these disseminators, such as the uvaImage disseminator. Others, like the metadata disseminator and default disseminator, have not been assigned to a group. The committee decided to develop working models for any missing disseminators required by texts until other groups have made final decisions on the desired sets of behaviors. Disseminators developed and/or reviewed by the committee include:

uvaDefault Disseminator

The committee made a first pass at defining the uvaDefault disseminator since it is a required disseminator on all objects. The idea behind the Default Disseminator is a disseminator that contains common behaviors that are desirable for every object. The current behaviors included in the Default Disseminator are:

  • getPreview – returns a view of the object in a preview context. There will be separate implementations (Behavior Mechanisms) for each type of object. In the case of texts, getPreview will return a bibliographic citation extracted from the descriptive metadata.
  • getDefaultView – returns the preferred default view of  an object. There will be separate implementations for each type of object. In the case of texts, getDefaultView will return a table of contents.

Additional behaviors may need to be added to this disseminator in the future, but these two behaviors should provide a functional starting point for the Phase I content.

uvaMeta Disseminator

The committee also made a first pass at defining the uvaMeta disseminator since it is a required disseminator on all objects. It may be desirable to merge the uvaMeta disseminator behaviors in with the uvaDefault disseminator behaviors in the future, but for now they are broken out into separate disseminators.

  • getDescMeta – returns the complete descriptive metadata record
  • getAdminMeta – returns the complete administrative metadata record
  • transDescMeta(format) – returns the complete descriptive metadata record transformed into the specified format (e.g., MARC, Dublin Core ). The specific number of formats that can be specified is still under discussion.
  • transAdminMeta(format) – returns the complete administrative metadata record transformed into the specified format (e.g., MARC, Dublin Core ). The specific number of formats that can be specified is still under discussion.

Initial discussions included a behavior that allowed extraction of a single or multiple descriptive and/or administrative metadata elements. These behaviors were removed to simplify the metadata object model and to allow for faster prototyping.  The primary use of the uvaMeta disseminator will be to allow objects to disseminate their metadata to an external search engine for indexing. Additional behaviors may be necessary for the uvaMeta disseminator once we have a clearer picture how metadata will be accessed and used within the repository. This minimal set of behaviors seems sufficient for the short term.

uvaGenText Disseminator

The uvaGenText disseminator is a new disseminator extracted from the original text disseminator proposed by the Text Object Model Committee. There are sufficient variations in the different types of texts that would make a single text disseminator problematic to design. The general text disseminator is a simplified version that hopefully can be applied to any type of text in the repository.

  • getPreview – returns a bibliographic citation that is extracted from the descriptive metadata using an XSL stylesheet. The stylesheet that performs the extraction needs to be written once metadata mappings for the citation are defined. This is a new behavior that does not currently exist as a standalone behavior in Phase I.
  • getTreeView – returns a table of contents view of the text by transforming the static table of contents XML datastream into html. This behavior currently exists as a dynamic behavior in Phase I, but there are problems with some documents that have minimal hierarchy that this new behavior will address. Implementation will require generating the static datastream for the table of contents from the raw xml and writing an XSL stylesheet to perform the rendering. (e.g.,  Table of Contents for Vol VII of Lewis & Clark )
  • getChunk(element) – returns an XML fragment of the specified element. The non-Fedora version exists in Phase I; Fedora-ized version needs to be developed.
  • getChunks(expression) – returns an XML fragment of the specified XPat expression. The non-Fedora version exists in Phase I; Fedora-ized version needs to be developed.
  • getStaticView – returns a static HTML-encoded translation of the text. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.
  • getDeliveryMaster – returns the original XML-encoded text. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.

uvaBook Disseminator

The uvaBook disseminator is a specialized text disseminator for texts that are books and that have transcribed text and/or page images. Since texts can exist that only have page images or that only have transcribed text, exceptions will need to be handled for these cases. i.e., if a text has no page images then trying to invoke the getImagePageTurner behavior would need to respond with an informative message indicating that no page images are available for that particular text.

  • getDynamicView – returns a dynamic view of the text. Currently in Phase I this represents a page with the table of contents displayed with hyperlinks on each major section, which will be the same behavior as the uvaGenText behavior of getTreeView. The non-Fedora version exists in Phase I; Fedora-ized version needs to be developed. (e.g., Table of Contents for Vol VII of Lewis & Clark )
  • getTextPageTurner – returns a view of the page images enabling the user to page through the pages images of the text one image at a time or to jump to any specified page. The non-Fedora version partially exists in Phase I; Fedora-ized version needs to be developed. (e.g., Transcription Page Turner )
  • getImagePageTurner – returns a view of the transcribed text enabling the user to page through the transcription one page at a time or to jump to any specified page. The non-Fedora version exists in Phase I; Fedora-ized version needs to be developed. (e.g., Page Image Turner )

uvaTextSearch Disseminator

Behaviors for the uvaTextSearch have not yet been defined. The non-Fedora version exists in Phase I; Fedora-ized version needs to be developed once the necessary behaviors are defined. (e.g., Full Text Search )

The creation and updating of full text indexes should also be a behavior of the uvaTextSearch disseminator, however, the XPat software does not provide a web services based interface for the creation and updating of indexes. As a result, the indexing process will have to be an external process that happens outside of Fedora. This will complicate the ingest process of texts since each ingest will also need to trigger an external process to automatically update the indexes. Keeping the objects and external text indices in sync will also be a challenge particularly as we begin developing aggregations (collections) of text and want to do things like automatically updating an index each time a new object is added to a collection.

uvaImage Disseminator

Although images are not part of the Text Delivery Committee’s charge, bitonal and/or color page images are an integral part of text objects. To begin testing the text object models, we adopted the early image object models proposed by the Image Object Model Committee as a prototype. When the final image object models are decided upon, these disseminators may change.

uvaImage Disseminator (bitonal TIFFs)

  • getPreview – returns the thumbnail resolution of the image. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.
  • getScreen  - returns the screen-sized resolution of the image. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.
  • getSizedImage(x,y)  - returns the specified size resolution of the image. The non-Fedora version does NOT exist in Phase I for specific x, y pixel dimensions; Fedora-ized version will require rewriting the tif2gif.pl perl script or changing the definition of getSizedImage.
  • getDeliveryMaster  - returns the bitonal TIFF master. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.
  • uvaImage Disseminator (color MRSIDs)
  • getPreview – returns the thumbnail resolution of the image. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.
  • getScreen  - returns the screen-sized resolution of the image. The non-Fedora version exists in Phase I; Fedora-ized version requires no changes.
  • getSizedImage(x,y)  - returns the specified size resolution of the image. The non-Fedora version does NOT exist in Phase I for specific x, y pixel dimensions; Fedora-ized version will require rewriting the get_image.pl perl script or changing the definition of getSizedImage.
  • getDeliveryMaster  - returns the TIFF/MRSID master or a page describing where the master is located. The non-Fedora version does NOT exist in Phase I; Fedora-ized version will require writing a new script to handle this behavior.

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, June 02, 2008
© The Rector and Visitors of the University of Virginia