Early American Fiction Project Workflow
NOTE: These procedures are taken from training guidelines used by EAF digitization staff during the project. These are not current imaging workflows or standards.
Book Handling
Prior to scanning, selected volumes were pulled from the stacks, inspected, and relocated to the digital lab. The books remained in the lab until they were digitized, had a TEI header created, and the jpeg derivatives checked.
Workflow Database
Each book was given a record in a FileMaker Pro database.
With the book in hand, the EAF staff recorded information into the database:
- bibliographical information
- EAF project number
- call number
- notes on form and condition
- digitization dates
- camera operators
Upon volume imaging completion, the FileMaker Pro record was filtered to a TEI header.
Parsing, tiff header integration and quality assurancewill were done by the Electronic Text Center staff. AACR-2 compliance and MARC record generation was coordinated by the UVa Special Collections Cataloging Department.
Digital Image Creation
File-naming convention: [xxx-001]
- For images, the suffix is always .tif-- supplied by PhotoShop (.jpg after the conversion) and for texts, .xml -- supplied by our web forms.
- The first three numbers are a project number assigned to the book, followed by a dash.
- In the event that a volume has more than 1000 pages the next four slots are free for sequential digital image numbers. This means that the image number does not reflect the pagination scheme but it overcomes the need to deal with unnumbered pages, preliminary numbering, repeated numbers due to printer error, etc.
- The eighth character remains blank as a safeguard against a missed image that needs to be numbered after the fact.
See the EAF Digital Image Scanning Procedures for a detailed description of camera operation, software settings, imaging, batching, and database tracking.
Conversion to JPEG
Batch-processing scripts were run in PhotoShop to produce a large JPEG file, used to generate gif thumbnails and two other levels of jpeg files..
From the large JPEG version:
- GIF: mogrify -format gif -interlace plane -geometry 5% *.jpg
- MEDIUM JPEG: mogrify -geometry 75% -quality 75% *.jpg
- SMALL JPEG: mogrify -geometry 50% -quality 75% *.jpg
The aim is to keep the jpegs a known and predictable percentage of the original, so that they maintain relative size differences (e.g. an image of a small book looks smaller than an image of a large one.)
Text Processing
JPG files were uploaded to vendor's FTP site for processing according to a "Data Conversion Design Document." The goal for the vendor was to reproduce the source in every aspect, including capturing line breaks and page breaks at the exact location as in the source.
Every <divx> has a <head> in the Chadwyck-Healey (C-H) scheme as in TEI, but the head is numbered along with the <divx> -- a <div0> takes a <comhd0>, a <div1> takes a <comhd1>, etc. At present, we think we will use the n= attribute to record this information : <head n="comhd1">. This will be easy to change to <comhd1> for C-H purposes.
The <text> tag in TEI cannot take a <head> itself, but its C-H equivalent needs a <head> and an <attrib> field. One solution is to add a <div1 type="chad"> at the top of every <front> before the real <front> matter, and move it up before teh <front> for the C-H format. Its <head> -- <head n=comhd0> -- contains a <bibl> containing the full, inverted author name (<author>) and the volume short title (<title>), including the date of publication in parentheses.
We still need to decide the precise form of the tags in the <text> that correspond to the C-H <attribs> group: <attauth>, <attgend>, <attgenre>, <attdate>, and <attbal> for full author name, author sex, genre of work, date of publication, and Bibliography of American Literature number. A <ref type="attribs"> containing a <bibl> is possible as a container for this information, within the <div1 type="chad">.
The end result was a parsed TEI document that could be automatically re-shaped to meet Chadwyck-Healey encoding standards.
Guide for image description : <figDesc>
Book illustrations and other figurative content will be described as to its content, for searching purposes, using the TEI <figure> tag.
Procedures for parsing, indexing, and testing completed texts when returned from the keyboarders
This process followed Etext Center practices of the time. In particular for this project, the process checked for unintentinally minimized tags during parsing. The TEI.DTD allows minimization, but this project did not allow the practice.