Digital Formats Working Group
Report
April 2002
|
Purpose:
1.
For all media types, what formats are saved as master copies and what formats
are delivered?
2.
How much standardization of format will we be doing?
3.
What are the minimum compatibility standards that new collections must meet
before we purchase them?
The Group
took as its point of departure the report created by the Digital Content
Production and Standards Planning Team as part of our Library of Tomorrow
(LoFT) planning effort. After reviewing that earlier report, the Digital
Library Working Group made recommendations that update and augment the LoFT
document. We have extended the scope to include a consideration of geographic
and statistical information, noticeably absent from the first report.
General Recommendations:
The
Digital Formats Group offers three general recommendations regarding the
determination of technical specifications for our digital library holdings:
1.
The Librarys efforts in building our central repository should,
for now, focus on collection development, not preservation or archiving.
Long-term preservation of digital master files requires a strategy of
identification, storage and migration to new media, as well as policies
for long-term use and access. We do not yet have such a comprehensive
plan for the preservation of Library resources, either digital or analog.
The determination of technical specifications for preservation is pointless
until the Library creates and is ready to implement such a plan. The
Digital Library Working Group does not attempt to recommend preservation/archiving
specifications.
2.
The type, quality, rights status and anticipated use of the data should
effect the determination of technical specifications for the capture of
digital material. This Group recommends the use of baseline measures
for the creation of content, but also recommends that the final determination
of technical specifications should rest on an assessment of the above
considerations. Productions and content specialists should jointly review
basline standards on a yearly basis.
3.
Decisions about technical specifications for digitization of resources of
various media types should be made in consultation with media or content
experts. The need for this is well demonstrated by this example: In the
digitization queue are two images of maps. One contains very little graphical
information the map consists of simple lines tracing political boundaries.
The other is a highly detailed map, rich with geographic information of
various kinds. After a brief consultation with the relevant media expert
that includes a review of the nature of the graphical information and an
assessment of the likely uses of the images the production specialist may
well decide to scan the first image at our baseline master copy
specifications and the second image at a much higher resolution to allow
for telescoping and the extraction of high resolution details. The first
image might be 25 MB and the second might be 200 MB.
QUESTIONS:
1. FOR ALL MEDIA TYPES, WHAT
FORMATS ARE SAVED AS MASTER COPIES AND WHAT FORMATS ARE DELIVERED?
The following tables provide
answers to those questions.
Definitions:
Master copy: The digital
copy from which other copies may be derived.
Service/Deliverable quality:
representative and output qualities are reduced; emphasis is on efficient
delivery to users of digital information. Multiple files may be created
at this level for multiple purposes or user communities. When Service
and Deliverable qualities are outlined separately, the distinctions resides
in the assumption that Service versions are editable by the user, but
Deliverable versions are not.
Preview/Thumbnail quality:
Highly reduced in quality and size or duration, functions as an identifier;
has little or no output or editing value.
IMAGES - BITMAP
|
Purpose
|
Type
|
Format
|
Compression
|
Resolution
-bitdepth
|
Resolution no. of pixels
|
Enhancements?
|
Comments
|
|
Master copy
|
Printed Text
|
TIFF
|
Lossless (ITU G4 or CCITT4? )
|
Bitonal
|
300-600
|
Cropping, rotating, sharpening,
descreening; deskewing, despeckling
|
|
|
Graphical content (photo-graphy,
painting, mss., etc.
|
TIFF
|
Lossless (LZW)
|
24 bit
|
300-600
|
Cropping, rotating
|
Include color reference whenever
appropriate and feasible
|
|
Service/
Deliverable
|
Mr. Sid, JPEG, GIF, PDF, PNG,
TIFF
|
Lossless or lossy, appropriate
to delivery needs
|
24 or 8-bit color; grayscale
|
Current screen dimensions or
use-based
|
Cropping, rotating, sharpening,
descreening; deskewing, despeckling, brightness and tonal adjustments
|
Include color reference whenever
appropriate and feasible
|
|
Thumbnail
|
Mr. Sid, JPEG, GIF, PDF, PNG,
TIFF
|
Lossless or lossy, appropriate
to delivery needs
|
24 or 8-bit color; grayscale
|
Appropriate for display of necessary
information
|
Any and all
|
|
IMAGES - VECTOR
|
Purpose
|
Format
|
Compression
|
Resolution
-bitdepth
|
Resolution no. of pixels
|
Enhancements?
|
Comments
|
|
Master copy
|
EPS, SVG, proprietary formats,
e.g. Adobe Illustrator
|
NA
|
NA
|
NA
|
NA
|
Include color reference whenever
appropriate and feasible
|
|
Service/
Deliverable
|
EPS,SVG, SWF, GIF, JPEG, PNG
|
NA
|
24 or 8-bit color, grayscale
|
Appropriate for display of necessary
information
|
Any and all
|
Vector images may be converted
to raster formats for delivery.
|
|
Thumbnail
|
GIF, JPEG, PNG
|
|
24 or 8-bit color; grayscale
|
Appropriate for display of necessary
information
|
Any and all
|
|
TEXT
|
Purpose
|
Format
|
Comments
|
|
Master copy
|
Plain text; XML encoded with
accompanying DTD and other dependent files
300-600
|
|
|
Service
|
Attributes: maintains most, if
not all, structural and formatting characteristics as well as "searchability"
of the archival file
File format(s): XML, RTF, PDF,
proprietary word processor formats
|
Open standards are preferred.
PDF in this context should have embedded, searchable text.
|
|
Deliverable
|
Attributes: maintains few, if
any, structural and/or formatting characteristics. Maintains "searchability"
of text, but not in conjunction with structure.
File format(s): HTML, PDF, unstructured
ASCII, e.g. OCR output
|
|
|
|
Thumbnail
|
NA
|
|
AUDIO
|
Purpose
|
Format
|
Compression
|
Resolution
Sample rate
|
Comments
|
|
Master copy
|
AIFF, WAV, SND
|
|
44.1 kHz, 16 bits per sample
|
Maintain channel pattern of original,
e.g. stereo, mono, multi-channel.
|
|
Service
|
AIFF, WAV, SND
|
|
11 or 22.05 kHz, 8 or 16 bits
per sample
|
Maintain channel pattern where
practical.
|
|
Deliverable
|
QuickTime, MPEG
|
Apply compression appropriate
for delivery needs of target community.
|
|
|
|
Preview
|
Quicktime, MPEG
|
|
|
Reduce duration to create a representative
sample: aclip
|
VIDEO
|
Purpose
|
Type
|
Format
|
Compression
|
Comments
|
|
Master copy
|
User-created
|
NTSC DV or DV-Cam tape
|
None
|
Tapes should be stored in environmentally
stable location, e.g. vault.
|
|
Purchased content
|
DVD
|
DVD
|
|
|
Service
|
Select as appropriate for task.
|
Appropriate to format
|
Service, i.e. editable, versions
produced as required by dubbing; implies change of storage medium
and/or format. Very large file sizes; not network distributable.
|
|
Deliverable
|
Quicktime, MPEG
|
Appropriate to format and use
|
Only highly compressed forms
network distributable.
|
|
Preview
|
Quicktime, MPEG
|
Appropriate to format and use
|
Reduce duration to create a representative
sample: aclip.
|
|
Thumbnail
|
Mr. Sid, JPEG, GIF, PDF, PNG,
TIFF
|
Any, appropriate to content and
use.
|
Representative frame: indication
of content.
|
STATISTICAL/NUMERIC
AND GEOSPATIAL DATA:
Definitions of Terms
|
Quality Level
|
Characteristics
|
|
Master
|
Preservation storage,
ASCII format where possible.
Data will probably
need further processing before use.
Ties to proprietary
software formats are minimal but exist.
|
|
Service
|
Storage formats for
near immediate use by patron or DL disseminator.
Probably a proprietary
software format.
|
|
Deliverable
|
Ready for immediate
use or use with minimal processing by end-user.
Probably a proprietary
software format.
|
|
Preview
|
Web-ready low quality
image or random sample of material
|
STATISTICAL/NUMERIC DATA
|
Purpose
|
Format
|
Comments
|
|
Master copy
|
ASCII columnar format
SPSS, STATA, SAS program code
and/or machine readable text based documentation to define data for analysis
|
ASCII delimited preferred
DDI standard metadata preferred
documentation format
Following the ICPSR standard
for data archiving and preservation.
|
|
Service
|
Data stored in some statistical
package format (SAS, SPSS, STATA) or in queryable SQL database system
|
Storage for access, retrieval
or extraction.
|
|
Deliverable
|
SAS, STATA, SPSS, Excel or delimited
ASCII format with data map or variable list.
|
Excel not advised for very large
files. All users get documentation built from DDI records.
|
|
|
Preview
|
Screen dump of 5% of records,
no more than 100
|
Practice not currently in place.
|
|
SPATIAL DATA - RASTER
|
Purpose
|
Format
|
Comments
|
|
Master copy
|
Photography
or remote sensing imagery:
Non compressed TIF+world file
or GeoTIFF (preferred), BIL, IMG (Erdas Imagine)
|
Also applicable for geo-referenced
maps. GeoTIFF retains geographic information in TIFF header; world file
does same as separate file.
|
Non-image
raster data:
ASCII based storage and exchange
format (Arc Exchange .e00; ArcGenerate .gen; Spatial Data Transfer Standard
(SDTS))
|
SDTS is federal standard, but
not widely adopted in commercial industry or government; format is cumbersome
for further processing.
|
|
Service
|
Photography
or remote sensing imagery:
GeoTIFF, BIL, IMG, SID + world
file
|
|
|
Non-image
raster data:
ArcExchange; GeoTIFF; native
data formats (.cdo); native software data models (ArcGRID)
|
Users will almost always need
to process stored data. Tiffs can store pixel value as color value and
be converted in GIS software; native data formats are common in federal
data. GRID data model is directory, not file-based but could be stored
for access purposes.
|
|
Deliverable
|
Photography
or remote sensing imagery:
GeoTIFF, BIL, IMG, SID+world
file, JPG+world file
|
|
|
|
Non-image
raster data:
Arc
Exchange, native formats or models, GeoTIFF
|
|
|
Preview
|
JPG, GIF, or SID
|
Sizes may need to be slightly
larger than those outlined for other types of images
|
|
SPATIAL DATA - VECTOR
|
Purpose
|
Format
|
Comments
|
|
Master copy
|
ASCII-based exchange format such
as SDTS, Arc Exchange (.e00), ArcGenerate (.gen), or delimited text.
|
Note that two of these are tied
to proprietary software formats and are not available for all data models.
SDTS is available but rarely used in data distribution.
|
|
|
Service
|
Industry standard formats such
as ESRI shape (.shp) or ArcInfo Coverage model, or CAD format such as Microstation
(.dgn) or AutoCAD (.dgw). Possible storage in SQL based system through
proprietary middleware (ArcSDE, Oracle Spatial)
|
Note that ESRIs shapefile
model consists of several related files. The ArcInfo Coverage model is
directory-based. RDBMS models are still relatively new.
|
|
Deliverable
|
Industry standard formats such
as ESRI shape, Arc Exchange, or CAD formats.
|
|
|
|
Preview
|
GIF, JPG or other raster image
format.
|
Preview graphics need to be large
enough to convey the general look of the data.
|
|
|
|
|
|
|
2. HOW MUCH STANDARDIZATION
OF FORMAT WILL WE BE DOING?
Our central repository should
allow for the integration of and access to collections across media types.
We can assume, therefore, that standardization of format will be an important
goal in the development of a coherent collection. However, we cannot make
blanket statements about technical requirements or criteria for reformatting.
These decisions will have to made on a case by case basis after evaluating
the relative worth of the resources on both content and technical grounds;
selectors and technical experts should be involved in this evaluation.
The following questions might help in making reformatting decisions:
Content issues:
1.
How valuable is this material to our user community when fully integrated
with other resources in the central repository? How valuable is it as
a standalone product?
2. Do we own it? What copyrights
do we hold?
3. Is it good quality (measurable
by various criteria relating to content and technical quality)?
4. Is it unique (or even distinctive)?
5. Is there a straightforward
conversion route, and will the necessary programming be useful to other
projects?
6. If the original format
is proprietary, can we envision the data being needed in other formats
in the future? Is the proprietary format likely to endure?
7. Will there be significant
data loss in the conversion?
8. IS IT WORTH THE TIME
AND EFFORT?
Reformatting should take place
if the answer to question no. 8 is YES.
Text:
a)
The "Decision Tree for Text," produced as part of the Digital
Content Selection Team Report, should be reviewed when making text
reformatting decisions (especially Q.11-A.14) (Appendix A).
b)
Files should be converted to XML with a standard public DTD whenever
feasible.
c)
Since PDF migration is difficult and time-consuming; PDF conversion should
be attempted only when the need for full functionality within the Digital
Library is pressing..
a)
If the images were created or purchased in a lossy, open standard format,
e.g. JPEG, that format should be stored with no further edits as the master
copy.
b)
- If the images were created or purchased in a proprietary format, regardless
of the nature of the compression, the images should be immediately converted
to TIFF and saved, unedited, as the master copy. E.g. PhotoCD's should
be converted and saved as unedited TIFFs.
a)
New datasets, purchased in a proprietary format, should be converted to
ASCII columnar to ensure maximum flexibility and backward compatibility.
a)
A single vendor, ESRI, rules the industry and has created a de facto standard
from its own proprietary format. However, long term viability of this
material depends on its availability in an open standard. DLPS should begin
experimenting with the conversion of spatial data to ASCII, working closely
with Geostat staff to determine priorities and to solve technical problems.
a)
The technical landscape for audio and video is unsettled: every year
sees the emergence of new formats, codecs and toolsConversion of highly
compressed files is neither desirable nor, in some cases, possible, so
the reformatting of most motion media materials should be approached with
the caution.
b)
We recommend that the technical specifications outlined in the audio and
video tables, above, be considered guidelines only, with the understanding
that we will need to regularly evaluate new solutions and options.
c)
We further recommend that production for these media types should remain
in the Digital Media Lab for the near future so that the technical aspects
of production can be closely monitored by motion media experts. This policy
should be frequently revisited, at least on a yearly basis.
3. WHAT ARE THE MINIMUM
COMPATIBILITY STANDARDS THAT NEW COLLECTIONS MUST MEET BEFORE WE PURCHASE
THEM?
Again there is no simple answer
to this question. The relative worth of the materials must be evaluated
on both content and technical grounds as described above. The "Decision
Tree for Purchase of Digital Media" (Appendix B) and the "Decision
Tree for Text" (Appendix A), parts of the Digital Content Selection
Team Report, should be used as assessment tools in this endeavor.
Technical standards should be
considered recommendations, NOT requirements. Some valuable materials simply
will not be available in the formats that fit best within our DL system.
However, they may well be valuable to our user community and so should
be included in our central repository despite their technical inferiority.
Selectors should, of course, be closely involved in this decision-making.
Text:
Text generally comes in the following
forms:
a)
PDF (electronic dissertations)
b)
page-images (e.g. a form of "digital microfilm" EEBO;
or documentation of texts from Special Collections)
c)
page images with hidden, uncorrected, but nonetheless searchable ASCII
text ( JSTOR, MOA);
d)
full-text products without page images necessary due to the accuracy of
the text transcription (OED);
e)
full-text products as in c) but with the addition of page images as well
(Early American Fiction; HarpWeek)
Options d) and e) are appropriate
for a central repository that is XML-based and fully searchable. However,
searchability is not always the desired outcome of text digitization. The
interests of Special Collections in object documentation argues for the
viability of page images without searchable ASCII. In the case of their
collections, Special Collections librarians should help assess, on a case
by case basis, the relative value of making searchable any specific text.
In the case of purchased or otherwise
acquired collections, options b) and c) should be considered minimally acceptable
if the resource is deemed valuable enough. We should attempt, whenever
feasible, to bring these resources up to standards.
PDF stands as the only clear
exception to this principle of inclusion. The obstacles to reformatting
PDF are so formidable that we should avoid, whenever possible, the purchase
of PDF-based collections. If we must purchase a PDF collection, we should
attempt to obtain the page images (e.g. TIFFs) from which the PDF files
were originally created.
Images:
Images should be evaluated on
the following formal characteristics: file format (proprietary or open
standard), bit depth (full color or indexed color/grayscale); pixel dimension;
compression (lossless or lossy). The ideal combination is open standard,
full color, many pixels, lossless compression.
However, other combinations should
be considered minimally acceptable if the content is deemed valuable enough
and if the formats can be converted without significant loss to the desirable
open standard. We should avoid purchasing or acquiring images that dont
successfully convert to a non-proprietary standard.
Statistical Data:
Because statistical data depends
for its usefulness on the analytical features of specific software packages,
the minimal acceptable standard should be that the data is exportable to
ASCII columnar format. This will ensure that the data can be manipulated
by different packages over time.
Spatial Data:
The ESRI suite of formats are
the proprietary standards in this area. New spatial data should be convertible
to a useful delivery format.
Audio and Video:
It is a sad fact that todays
wisdom is tomorrows folly in the world of digital dynamic media.
We should definitely attempt to collect materials in standard, non-proprietary
formats, for which players are readily available across platforms; however,
we also need to realize that redigitization of motion media materials will
occasionally be inevitable. We should strenuously avoid creating or
purchasing motion media materials that depend on the tools or players of
a single operating system.
APPENDIX A
Decision Tree for Text
A decision tree for building
or buying a trade publication or journal.
Q1: Does this item meet our
general selection criteria?
If yes -- Q4
If no B>
Q2: If no -- has it
nonetheless been requested by a faculty, staff, or student member?
If no -- and no other special
case -- end.
If yes B>
Q3: If yes -- does
the combination of need plus price make it an item we can afford to consider
for purchase or local creation?
If no -- end.
If yes B>
Q4: Can we afford it and
are we willing to buy it or to build it?
If yes -- Q5
If no -- end
Q5: What forms is it available
in?
1) print only
2a) print plus 2b) electronic
3) electronic only
If (1) purchase -- end.
If (2) choose [a]or [b] or both
If (2b) or (3) -- electronic
form --:
Questions for consideration
of electronic content.
Q6: Is it offered as something
we can take delivery of and use/aggregate locally, or is it a remote service
only? That is, do we have a local/remote choice (this could also be
expressed as an own/rent choice)?
Q7: if desirable to have locally
B for aggregation with other products and use through a common interface
B will the vendor entertain such an arrangement on request, or is it already
offered in this fashion?
If yes, Q11
If no B>
Q8: if remote service only,
do we pay by the year (subscription service) or do we buy it outright even
though we do not ever house the product locally (e.g. EEBO, Netlibrary,
etc).
Q9: if we buy a remote service
outright, is there any other charge (an annual "maintenance" fee?)
and what happens if the vendor goes out of business or cancels the product
after a year or two? Is the data in escrow with a third party data vault,
for example?
Q10: Buy outright despite
lack of guarantee that it will be available over the long term?
If no, repeat Q7, then ask for
data escrow guarantee, and then end.
If yes, purchase and end.
Q11: Standard or non-standard
data?
If we are building or buying
content for local use, is it long-term (standardized or non-proprietary)
content or is it in a proprietry data format (a Folio database, for example)?
That is, is it a standalone product -- on a CD perhaps with a proprietary
data format and custom-made search/display software -- or will it "aggregate"
-- will it be able to work interactively as part of a larger collection?
[There is a need to decide if
we consider PDF as a robust non-standard or as a standard that can be migrated
by us and therefore that will survive as part of a permanent library]
Q12: If proprietary,
non-standard content, is it nonetheless worth having for a short number
of years (for support of a particular research or teaching need) until it
ceases to work? Example" customised software on CD products in particular
often die when one updates an operating system -- cpm to ms-dos; win3.1
to win95/8.
Q13: If proprietary, non-standard
content, is it worth getting the publisher's permission to re-format it
at our own cost into something more permanent?
Q14: If standardized data,
what do we prefer in each data type?
TEXT: STRUCTURE
Important questions to ask:
Is it encoded in XML or SGML
markup according to a known DTD (e.g. EAD, TEI)?
If a publisher-specific DTD,
can we see the range and type of tags ahead of the purchase?
If SGML, are
tags left unclosed (minimization)? This is highly undesirable.
TEXT: TRANSCRIPTIONAL ACCURACY
Most important questions to ask:
What transcriptional error
rate should we expect in the text? Does the publisher guarantee a minimum
average error rate?
Does their error count include
layout and typography as errors?
How has the text been created
B OCR or keyboarding?
As a guide, uncorrected or
lightly corrected OCR is often quoted as 99.9% (1 error in 500)
TEXT: FORM
Important questions to ask:
Is it one of the following:
14a) a page-image collection
B a form of "digital microfilm" -- EEBO
14b) a page image collection
with hidden, uncorrected, but nonetheless searchable ASCII text B JSTOR,
MOA
14c) a full-text product without
page images necessary due to the accuracy of the text transcription B
OED
14d) a full-text product as
in 14c) but with the addition of page images as well B Early American
Fiction; HarpWeek
APPENDIX B
APPENDIX C
Data Formats Working Group
Recommendations
for Statistical and Spatial Data Collection and Storage
Mike Furlough, November
2001
Overview of Statistical and
Spatial Data Formats
Characteristics of Use
Users of electronic texts or
images of objects are usually well served by delivery through a web-based
medium. While in some cases users may wish to transfer these objects to
another medium or incorporate them into other digital objects, most of the
tools for analysis of these objects are in the users head, or can
be fairly readily delivered via the web.
Delivery of spatial and statistical
data sets, however, is somewhat more complicated. In many cases a user
will be satisfied with a table of numbers reporting statistics, an image
of an aerial photograph, or basic map. But most frequently these data are
retrieved for further analysis in a proprietary software tool. In this
tool users may resample, subset, edit, recode, recalculate or otherwise
manipulate the data to answer their research questions.
It has been increasingly possible
to perform these analyses on the web for statistical data sets, and, more
recently, spatial data. However, it is still not possible, or even desirable,
to provide a complete range of even the most basic analysis functions for
statistical or spatial data on the web. This means that for digital library
production, acquisition, dissemination and archiving we must rely more on
proprietary software formats and industry driven standards than we might
ordinarily choose to do.
Data Models
STATISTICAL DATA
Data models for statistical data
are relatively simple consisting of tabular data stored in rows and columns,
and sharing most of the characteristics familiar with relational database
models. However, the primary tools for analysis are not RDBMS software
tools but SAS, SPSS or STATA, which typically handle only one file and one
table at a time. Statistical data sets are likely to include multiple files
describing related data; these are analogous the multiple tables in a database
storing related data. Usually one row in a table corresponds to one record
or point of observation. However, older data from the punch card and mainframe
days may contain multiple observations per row and require special processing
in contemporary tools.
Spatial data is frequently multi-format,
industry driven, and data models are still evolving. It draws on relational
data models to link the graphical representation of geographic features
with their geographic coordinates and a host of attribute data stored in
tabular form. Typically a single spatial data object in the
repository will consist of multiple files, many of which are interpretable
or knowable only by the proprietary software.
The simplest spatial data are
raster models: digital geo-referenced images of the earths surface
or paper maps. Coordinates are stored in the header of the image, as in
the GeoTIFF variation of TIFFs, or as a separate world file sharing
the same filename prefix as the image and a suffix based on the image extension.
For example, an aerial photograph of Charlottesville might be called Charlottesville.tif
and the world file would be Charlottesville.tfw.
Other raster models exist to
store attributes of regularly spaced point observations. These are often
called lattices, grids, or surfaces. The values
of each cell might be elevation, precipitation, or population density.
Vector based models are the most
complicated and most efficient means of storing and representing spatial
information in a GIS software tool. Because vector data draws points, lines,
and polygons based on coordinate location, it is well suited to representing
geographically referenced cartographic information. Tabular data stores
attributes for the features represented by vector graphics.
Emerging data models for storing
and representing geographic features are based on relational database models,
although this has been implemented by one software company (ESRI) and very
recently. Development around emerging XML models for storing and representing
geographic information is taking place, but these models are not stable
enough for software development.
Different GIS software tools
may implement the vector and non-image raster models in unique ways.
Proprietary Software and Formats
STATISTICAL ANALYSIS
Contemporary software for statistical
analysis has made the transition from the mainframe to the PC. Packages
like SAS and SPSS, which were once entirely driven by program coding and
command line syntax, now offer GUIs. There is no obvious advantage to one
format over another, and most of the major packages for statistical analysis
will export data from one format to another, and translation tools are available.
GEOGRAPHIC INFORMATION SYSTEMS
(GIS)
Geographic Information Systems
software has been around since the 1960s, but unlike statistical packages,
none of the early tools have survived. ESRI has been the primary driver
of the market for GIS software since the introduction of ArcInfo in the
early 1980s, followed by ArcView in the 1990s, and now ArcGIS. With each
of these software packages has come a new data format for representing the
vector and raster models outlined above. ESRI formats are the industry
standard for GIS, and no competing software is on the market that cant
import at least one of ESRIs file formats. Within ESRIs universe
of products, succeeding generations of software can read older data models,
although the older software generally cannot read the newer models.
Most of the users of GIS at the University of Virginia use ESRI software,
so this discussion focuses on ESRI formats.
ArcInfo: Coverage and Grid
Models
The ArcInfo coverage is
a directory based data structure that bears the imprint of its age. The
coverage itself represents all features of a certain type in a geographic
area (e.g., roads in Charlottesville might be called cvroads).
Data for the coverage is stored in multiple files in at least two different
directories, one named for the coverage, and one called info
storing the attribute data. The coverage and its two directories are stored
in a directory called a workspace. The workspace may contain
more than one coverage, and the info directory would in such cases contain
the attribute data table files for all the coverages in the workspace. The
Grid format for raster data is similar in its complexity. Because of the
complexity of the data model, ESRI developed text-based exchange formats
for data sharing.
ArcView: Shapefile Model
The shapefile is a file-based
model that incorporates at least three, and as many as seven, separate files
that share the same filename prefix. Shapefiles store vector graphics in
the official shape file with a .shp extension; attribute data
is stored in the tabular format developed for dBase software in the 1980s
(.dbf); indexes linking these two files are stored in the shape index file
(.shx); other indexes to the attribute data may be stored in other index
files with the extension .sbn, .sbx. The shapefile is an open format described
on in a white paper available at: http://www.esri.com/WhitePapers/XXX
ArcGIS: GeoDatabase model
The GeoDatabase
model is heavily dependent upon yet more proprietary software, relying on
storage in a commercial relational database software system such as Oracle,
DB/2, SQL Server, or Microsoft Access. ESRIs has developed middleware
to enable its software to speak to these tools, called ArcSDE. All attribute
and coordinate data is stored in relational tables. Theoretically storage
in open source packages like MySQL or PostGress is possible, but ESRI has
no definite plans to develop for these tools.
Status of Data Archiving and
Preservation
STATISTICAL DATA
Within the Social Science data
community, there is a strong ethic of data preservation for public access,
most obviously exemplified in the United States by the Inter-university
Consortium for Political and Social Research (ICPSR). This member-based
consortia has been archiving data for long-term preservation and access
since the 1960s. Their method is to convert data to ASCII formats, produce
machine-readable documentation, and write program code for SPSS and SAS
to read these files. Some working files are delivered in these formats
already. Other significant data archives exist at the US National Archives
and Records Administration (NARA) and in various European organizations.
On the whole, digital library research activity related to statistical data
discovery, access, and analysis is strong and widespread.
SPATIAL DATA
In a research community so strongly
driven by the commercial sector, it is no surprise that there is a weaker
ethic for public data preservation and access. While ICPSR has archived
some spatial data, and a few collections are found in NARA, archiving and
preservation of spatial data are less active research areas. A federally
endorsed standard for archiving and translation known as the Spatial Data
Transfer Standard (SDTS) was endorsed first in the early 1990s and has since
been emended. While it could be considered the primary archival and platform
independent format for spatial data distribution, vendors do not make it
easy to convert this data to their working formats. Converting the Librarys
collections to SDTS format is not a useful avenue for storage, access and
delivery. On the other hand, the widespread adoption of ESRI software ensures
that any text-based exchange format created for ESRIs data models
will be readily useable, even if they are not true platform independent
formats.
Methods of Acquisition
Except in rare cases, Geostat
has acquired spatial and statistical data for the Library from vendors,
government agencies, or consortia, rather than producing data on its
own. This is not to say that data might not be produced in-house, given
the right set of circumstances, but there is a wealth of useful data being
produced that can usually be acquired more easily by some means than by
producing it.
This means that proprietary data
formats tend to dominate the collection, especially for spatial data. Government
produced data, in particular, is produced for immediate use at that agency,
and long term distribution and archival costs. At some point the Library,
and the entire research community, will be faced with the problem of migrating
that data to extend its usefulness to future scholars. But in the meantime,
we have little choice but to accept data in the formats that are most useful
for our patron base, or that the suppliers offer that to us.
|
|