Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Reports

Digital Formats Working Group Report

April 2002

Purpose:

The Digital Formats Working Group was charged with answering the following questions

1.        For all media types, what formats are saved as master copies and what formats are delivered?
2.        How much standardization of format will we be doing?
3.        What are the minimum compatibility standards that new collections must meet before we purchase them?

The Group took as its point of departure the report created by the Digital Content Production and Standards Planning Team as part of our Library of Tomorrow (LoFT) planning effort.  After reviewing that earlier report, the Digital Library Working Group made recommendations that update and augment the LoFT document. We have extended the scope to include a consideration of geographic and statistical information, noticeably absent from the first report.

General Recommendations:

The Digital Formats Group offers three general recommendations regarding the determination of technical specifications for our digital library holdings:

1.        The Library’s efforts in building our central repository should, for now, focus on collection development, not preservation or archiving.  Long-term preservation of digital master files requires a strategy of identification, storage and migration to new media, as well as policies for long-term use and access. We do not yet have such a comprehensive plan for the preservation of Library resources, either digital or analog.  The determination of technical specifications for preservation is pointless until the Library creates and is ready to implement such a plan.  The Digital Library Working Group does not attempt to recommend preservation/archiving specifications.

2.        The type, quality, rights status and anticipated use of the data should effect the determination of technical specifications for the capture of digital material.  This Group recommends the use of baseline measures for the creation of content, but also recommends that the final determination of technical specifications should rest on an assessment of the above considerations. Productions and content specialists should jointly review basline standards on a yearly basis.

3.        Decisions about technical specifications for digitization of resources of various media types should be made in consultation with media or content experts. The need for this is well demonstrated by this example:  In the digitization queue are two images of maps.  One contains very little graphical information – the map consists of simple lines tracing political boundaries.  The other is a highly detailed map, rich with geographic information of various kinds. After a brief consultation with the relevant media expert that includes a review of the nature of the graphical information and an assessment of the likely uses of the images the production specialist may well decide to scan the first image at our baseline “master copy” specifications and the second image at a much higher resolution to allow for telescoping and the extraction of high resolution details.  The first image might be 25 MB and the second might be 200 MB.

QUESTIONS:

1.  FOR ALL MEDIA TYPES, WHAT FORMATS ARE SAVED AS MASTER COPIES AND WHAT FORMATS ARE DELIVERED?

The following tables provide answers to those questions.

Definitions:

Master copy: The digital copy from which other copies may be derived.

Service/Deliverable quality: representative and output qualities are reduced; emphasis is on efficient delivery to users of digital information.  Multiple files may be created at this level for multiple purposes or user communities.  When Service and Deliverable qualities are outlined separately, the distinctions resides in the assumption that Service versions are editable by the user, but Deliverable versions are not.

Preview/Thumbnail quality: Highly reduced in quality and size or duration, functions as an identifier; has little or no output or editing value.

IMAGES - BITMAP

Purpose

Type

Format

Compression

Resolution

-bitdepth

Resolution –no. of pixels

Enhancements?

Comments

Master copy

Printed Text

TIFF

Lossless (ITU G4 or CCITT4? )

Bitonal

300-600

Cropping, rotating, sharpening, descreening; deskewing, despeckling

 

Graphical content (photo-graphy, painting, mss., etc.

TIFF

Lossless (LZW)

24 bit

300-600

Cropping, rotating

Include color reference whenever appropriate and feasible

Service/

Deliverable

Mr. Sid, JPEG, GIF, PDF, PNG, TIFF

Lossless or lossy, appropriate to delivery needs

24 or 8-bit color; grayscale

Current screen dimensions or use-based

Cropping, rotating, sharpening, descreening; deskewing, despeckling, brightness and tonal adjustments

Include color reference whenever appropriate and feasible

Thumbnail

Mr. Sid, JPEG, GIF, PDF, PNG, TIFF

Lossless or lossy, appropriate to delivery needs

24 or 8-bit color; grayscale

Appropriate for display of necessary information

Any and all

 

IMAGES - VECTOR

Purpose

Format

Compression

Resolution

-bitdepth

Resolution –no. of pixels

Enhancements?

Comments

Master copy

EPS, SVG, proprietary formats, e.g. Adobe Illustrator

NA

NA

NA

NA

Include color reference whenever appropriate and feasible

Service/

Deliverable

EPS,SVG, SWF, GIF, JPEG, PNG

NA

24 or 8-bit color, grayscale

Appropriate for display of necessary information

Any and all

Vector images may be converted to raster formats for delivery.

Thumbnail

GIF, JPEG, PNG

 

24 or 8-bit color; grayscale

Appropriate for display of necessary information

Any and all

 

TEXT

Purpose

Format

Comments

Master copy

Plain text; XML encoded with accompanying DTD and other dependent files

300-600

 

Service

Attributes: maintains most, if not all, structural and formatting characteristics as well as "searchability" of the archival file

File format(s): XML, RTF, PDF, proprietary word processor formats

Open standards are preferred. PDF in this context should have embedded, searchable text.

Deliverable

Attributes: maintains few, if any, structural and/or formatting characteristics.  Maintains "searchability" of text, but not in conjunction with structure.

File format(s): HTML, PDF, unstructured ASCII, e.g. OCR output

   

Thumbnail

NA

 

AUDIO

Purpose

Format

Compression

Resolution

Sample rate

Comments

Master copy

AIFF, WAV, SND

 

44.1 kHz, 16 bits per sample

Maintain channel pattern of original, e.g. stereo, mono, multi-channel.

Service

AIFF, WAV, SND

 

11 or 22.05 kHz, 8 or 16 bits per sample

Maintain channel pattern where practical.

Deliverable

QuickTime, MPEG

Apply compression appropriate for delivery needs of target community. 

   

Preview

Quicktime, MPEG

   

Reduce duration to create a representative sample:  a“clip”


VIDEO

Purpose

Type

Format

Compression

Comments

Master copy

User-created

NTSC DV or DV-Cam tape

None

Tapes should be stored in environmentally  stable location, e.g. vault.

Purchased content

DVD

DVD

 

Service

Select as appropriate for task.

Appropriate to format

Service, i.e. editable, versions produced as required by “dubbing”; implies change of storage medium and/or format.  Very large file sizes; not network distributable.

Deliverable

Quicktime, MPEG

Appropriate to format and use

Only highly compressed forms network distributable.

Preview

Quicktime, MPEG

Appropriate to format and use

Reduce duration to create a representative sample:  a“clip”.

Thumbnail

Mr. Sid, JPEG, GIF, PDF, PNG, TIFF

Any, appropriate to content and use.

Representative frame:  indication of content.


STATISTICAL/NUMERIC AND GEOSPATIAL DATA:

Definitions of Terms

Quality Level

Characteristics

Master

Preservation storage, ASCII format where possible.

Data will probably need further processing before use.

Ties to proprietary software formats are minimal but exist.

Service

Storage formats for near immediate use by patron or DL disseminator.

Probably a proprietary software format.

Deliverable

Ready for immediate use or use with minimal processing by end-user.

Probably a proprietary software format.

Preview

Web-ready low quality image or random sample of material

STATISTICAL/NUMERIC DATA

Purpose

Format

Comments

Master copy

ASCII columnar format

SPSS, STATA, SAS program code and/or machine readable text based documentation to define data for analysis

ASCII delimited preferred

DDI standard metadata preferred documentation format

Following the ICPSR standard for data archiving and preservation.

Service

Data stored in some statistical package format (SAS, SPSS, STATA) or in queryable SQL database system

Storage for access, retrieval or extraction.

Deliverable

SAS, STATA, SPSS, Excel or delimited ASCII format with data map or variable list.

Excel not advised for very large files.  All users get documentation built from DDI records.

 

Preview

Screen dump of 5% of records, no more than 100

Practice not currently in place.

 

SPATIAL DATA - RASTER

Purpose

Format

Comments

Master copy

Photography or remote sensing imagery:

Non compressed TIF+world file or GeoTIFF (preferred), BIL, IMG (Erdas Imagine)

Also applicable for geo-referenced maps. GeoTIFF retains geographic information in TIFF header; world file does same as separate file.

Non-image raster data:

ASCII based storage and exchange format (Arc Exchange .e00; ArcGenerate .gen; Spatial Data Transfer Standard (SDTS))

SDTS is federal standard, but not widely adopted in commercial industry or government; format is cumbersome for further processing.

Service

Photography or remote sensing imagery:

GeoTIFF, BIL, IMG, SID + world file

 

Non-image raster data:

ArcExchange; GeoTIFF; native data formats (.cdo); native software data models (ArcGRID)

Users will almost always need to process stored data.  Tiffs can store pixel value as color value and be converted in GIS software;  native data formats are common in federal data.  GRID data model is directory, not file-based but could be stored for access purposes.

Deliverable

Photography or remote sensing imagery:

GeoTIFF, BIL, IMG, SID+world file, JPG+world file

   

Non-image raster data:

Arc Exchange, native formats or models, GeoTIFF

 

Preview

JPG, GIF, or SID

Sizes may need to be slightly larger than those outlined for other types of images

 

SPATIAL DATA - VECTOR

Purpose

Format

Comments

Master copy

ASCII-based exchange format such as SDTS,  Arc Exchange (.e00), ArcGenerate (.gen), or delimited text.

Note that two of these are tied to proprietary software formats and are not available for all data models. SDTS is available but rarely used in data distribution.

 

Service

Industry standard formats such as ESRI shape (.shp) or ArcInfo Coverage model, or CAD format such as Microstation (.dgn) or AutoCAD (.dgw).  Possible storage in SQL based system through proprietary middleware (ArcSDE, Oracle Spatial)

Note that ESRI’s shapefile model consists of several related files.  The ArcInfo Coverage model is directory-based.  RDBMS models are still relatively new.

Deliverable

Industry standard formats such as ESRI shape, Arc Exchange, or CAD formats.

   

Preview

GIF, JPG or other raster image format.

Preview graphics need to be large enough to convey the general “look” of the data.

 

2.  HOW MUCH STANDARDIZATION OF FORMAT WILL WE BE DOING?

Our central repository should allow for the integration of and access to collections across media types.  We can assume, therefore, that standardization of format will be an important goal in the development of a coherent collection.   However, we cannot make blanket statements about technical requirements or criteria for reformatting.  These decisions will have to made on a case by case basis after evaluating the relative worth of the resources on both content and technical grounds; selectors and technical experts should be involved in this evaluation.  The following questions might help in making reformatting decisions:

Content issues:

1How valuable is this material to our user community when fully integrated with other resources in the central repository? How valuable is it as a standalone product? 

2.  Do we own it? What copyrights do we hold?

3.  Is it good quality (measurable by various criteria relating to content and technical quality)?

4.  Is it unique (or even distinctive)?

Technical issues:

5.  Is there a straightforward conversion route, and will the necessary programming be useful to other projects?

6.  If the original format is proprietary, can we envision the data being needed in other formats in the future?  Is the proprietary format likely to endure?

7.  Will there be significant data loss in the conversion?

8.  IS IT WORTH THE TIME AND EFFORT?

Reformatting should take place if the answer to question no. 8 is YES.

Text:  

a)       The "Decision Tree for Text," produced as part of the Digital Content Selection Team Report, should be reviewed when making text reformatting decisions (especially Q.11-A.14) (Appendix A).

b)        Files should be converted to XML with a standard public DTD whenever feasible. 

c)       Since PDF migration is difficult and time-consuming; PDF conversion should be attempted only when the need for full functionality within the Digital Library is pressing..

Images:

a)       If the images were created or purchased in a lossy, open standard format, e.g. JPEG, that format should be stored with no further edits as the master copy.

b)       - If the images were created or purchased in a proprietary format, regardless of the nature of the compression, the images should be immediately converted to TIFF and saved, unedited, as the master copy.  E.g. PhotoCD's should be converted and saved as unedited TIFFs.

Statistical Data:

a)       New datasets, purchased in a proprietary format, should be converted to ASCII columnar to ensure maximum flexibility and backward compatibility.

Spatial Data:

a)       A single vendor, ESRI, rules the industry and has created a de facto standard from its own proprietary format.  However, long  term viability of this material depends on its availability in an open standard.  DLPS should begin experimenting with the conversion of spatial data to ASCII, working closely with Geostat staff to determine priorities and to solve technical problems.

Audio and Video:

a)       The technical landscape for audio and video is unsettled:  every year sees the emergence of new formats, codecs and toolsConversion of highly compressed files is neither desirable nor, in some cases, possible, so the reformatting of most motion media materials should be approached with the caution. 

b)       We recommend that the technical specifications outlined in the audio and video tables, above, be considered guidelines only, with the understanding that we will need to regularly evaluate new solutions and options.

c)       We further recommend that production for these media types should remain in the Digital Media Lab for the near future so that the technical aspects of production can be closely monitored by motion media experts.  This policy should be frequently revisited, at least on a yearly basis. 

3.  WHAT ARE THE MINIMUM COMPATIBILITY STANDARDS THAT NEW COLLECTIONS MUST MEET BEFORE WE PURCHASE THEM?

Again there is no simple answer to this question.  The relative worth of the materials must be evaluated on both content and technical grounds as described above.  The "Decision Tree for Purchase of Digital Media" (Appendix B) and the "Decision Tree for Text" (Appendix A), parts of the Digital Content Selection Team Report, should be used as assessment tools in this endeavor.

Technical standards should be considered recommendations, NOT requirements. Some valuable materials simply will not be available in the formats that fit best within our DL system. However, they may well be valuable to our user community  and so should be included in our central repository – despite their technical inferiority.  Selectors should, of course, be closely involved in this decision-making. 

Text:

Text generally comes in the following forms:

a)       PDF (electronic dissertations)

b)       page-images (e.g. a form of "digital microfilm" – EEBO; or documentation of texts from Special Collections)

c)       page images with hidden, uncorrected, but nonetheless searchable ASCII text ( JSTOR, MOA);

d)       full-text products without page images necessary due to the accuracy of the text transcription (OED);

e)       full-text products as in c) but with the addition of page images as well (Early American Fiction; HarpWeek)

Options  d) and e) are appropriate for a central repository that is XML-based and fully searchable.  However, searchability is not always the desired outcome of text digitization.  The interests of Special Collections in object documentation argues for the viability of page images without searchable ASCII.  In the case of their collections, Special Collections librarians should help assess, on a case by case basis, the relative value of making searchable any specific text. 

In the case of purchased or otherwise acquired collections, options b) and c) should be considered minimally acceptable if the resource is deemed valuable enough.  We should attempt, whenever feasible, to bring these resources up to standards.

PDF stands as the only clear exception to this principle of inclusion.  The obstacles to reformatting PDF are so formidable that we should avoid, whenever possible, the purchase of PDF-based collections.  If we must purchase a PDF collection, we should attempt to obtain the page images (e.g. TIFFs) from which the PDF files were originally created.

Images:

Images should be evaluated on the following formal characteristics:  file format (proprietary or open standard), bit depth (full color or indexed color/grayscale); pixel dimension; compression (lossless or lossy).  The ideal combination is open standard, full color, many pixels, lossless compression. 

However, other combinations should be considered minimally acceptable if the content is deemed valuable enough and if the formats can be converted without significant loss to the desirable open standard. We should avoid purchasing or acquiring images that don’t successfully convert to a non-proprietary standard. 

Statistical Data:

Because statistical data depends for its usefulness on the analytical features of specific software packages, the minimal acceptable standard should be that the data is exportable to ASCII columnar format.  This will ensure that the data can be manipulated by different packages over time. 

Spatial Data:

The ESRI suite of formats  are the proprietary standards in this area.  New spatial data should be convertible to a useful delivery format.

Audio and Video:

It is a sad fact that today’s wisdom is tomorrow’s folly in the world of digital dynamic media.  We should definitely attempt to collect materials in standard, non-proprietary formats, for which players are readily available across platforms; however, we also need to realize that redigitization of motion media materials will occasionally be inevitable.  We should strenuously avoid creating or purchasing motion media materials that depend on the tools or players of a single operating system.


APPENDIX A

Decision Tree for Text

A decision tree for building or buying a trade publication or journal.

Q1: Does this item meet our general selection criteria?

If yes -- Q4

If no B>

Q2:         If no -- has it nonetheless been requested by a faculty, staff, or student member?

If no -- and no other special case -- end.

If yes B>

Q3:         If yes -- does the combination of need plus price make it an item we can afford to consider for purchase or local creation?

If no -- end.

If yes B>

Q4:  Can we afford it and are we willing to buy it or to build it? 

If yes -- Q5

If no -- end

Q5: What forms is it available in?

1) print only

2a) print plus 2b) electronic

3) electronic only

If (1) purchase -- end.

If (2) choose [a]or [b] or both

If (2b) or (3) -- electronic form --:

Questions for consideration of electronic content.

Q6: Is it offered as something we can take delivery of and use/aggregate locally, or is it a remote service only?  That is, do we have a local/remote choice (this could also be expressed as an own/rent choice)?

Q7: if desirable to have locally B  for aggregation with other products and use through a common interface B will the vendor entertain such an arrangement on request, or is it already offered in this fashion?

If yes, Q11

If no B>

Q8: if remote service only, do we pay by the year (subscription service) or do we buy it outright even though we do not ever house the product locally (e.g. EEBO, Netlibrary, etc). 

Q9: if we buy a remote service outright, is there any other charge (an annual "maintenance" fee?) and what happens if the vendor goes out of business or cancels the product after a year or two?  Is the data in escrow with a third party data vault, for example?

Q10: Buy outright despite lack of guarantee that it will be available over the long term?

If no, repeat Q7, then ask for data escrow guarantee, and then end.

If yes, purchase and end.

Q11:       Standard or non-standard data?

If we are building or buying content for local use, is it long-term (standardized or non-proprietary) content or is it in a proprietry data format (a Folio database, for example)?  That is, is it a standalone product -- on a CD perhaps with a proprietary data format and custom-made search/display software -- or will it "aggregate" -- will it be able to work interactively as part of a larger collection? 

[There is a need to decide if we consider PDF as a robust non-standard or as a standard that can be migrated by us and therefore that will survive as part of a permanent library]

Q12:       If proprietary, non-standard content, is it nonetheless worth having for a short number of years (for support of a particular research or teaching need) until it ceases to work?  Example" customised software on CD products in particular often die when one updates an operating system -- cpm to ms-dos; win3.1 to win95/8. 

Q13:  If proprietary, non-standard content, is it worth getting the publisher's permission to re-format it at our own cost into something more permanent?

Q14: If standardized data, what do we prefer in each data type?

TEXT: STRUCTURE

Important questions to ask:

Is it encoded in XML or SGML markup according to a known DTD (e.g. EAD, TEI)?

If a publisher-specific DTD, can we see the range and type of tags ahead of the purchase?

If SGML, are tags left unclosed (minimization)?  This is highly undesirable. 

TEXT: TRANSCRIPTIONAL ACCURACY

Most important questions to ask:

What transcriptional error rate should we expect in the text?  Does the publisher guarantee a minimum average error rate? 

Does their error count include layout and typography as errors? 

How has the text been created B OCR or keyboarding?

As a guide, uncorrected or lightly corrected OCR is often quoted as 99.9% (1 error in 500)

TEXT: FORM

Important questions to ask:

Is it one of the following:

14a) a page-image collection B a form of "digital microfilm" -- EEBO

14b) a page image collection with hidden, uncorrected, but nonetheless searchable ASCII text B JSTOR, MOA

14c) a full-text product without page images necessary due to the accuracy of the text transcription B OED

14d) a full-text product as in 14c) but with the addition of page images as well B Early American Fiction; HarpWeek


APPENDIX B

 

 

APPENDIX C

Data Formats Working Group
Recommendations for Statistical and Spatial Data Collection and Storage
Mike Furlough, November 2001 

Overview of Statistical and Spatial Data Formats

Characteristics of Use

Users of electronic texts or images of objects are usually well served by delivery through a web-based medium. While in some cases users may wish to transfer these objects to another medium or incorporate them into other digital objects, most of the tools for analysis of these objects are in the user’s head, or can be fairly readily delivered via the web.

Delivery of spatial and statistical data sets, however, is somewhat more complicated.  In many cases a user will be satisfied with a table of numbers reporting statistics, an image of an aerial photograph, or basic map.  But most frequently these data are retrieved for further analysis in a proprietary software tool.  In this tool users may resample, subset, edit, recode, recalculate or otherwise manipulate the data to answer their research questions.

It has been increasingly possible to perform these analyses on the web for statistical data sets, and, more recently, spatial data.  However, it is still not possible, or even desirable, to provide a complete range of even the most basic analysis functions for statistical or spatial data on the web.  This means that for digital library production, acquisition, dissemination and archiving we must rely more on proprietary software formats and industry driven standards than we might ordinarily choose to do. 

Data Models

STATISTICAL DATA

Data models for statistical data are relatively simple consisting of tabular data stored in rows and columns, and sharing most of the characteristics familiar with relational database models.  However, the primary tools for analysis are not RDBMS software tools but SAS, SPSS or STATA, which typically handle only one file and one table at a time.  Statistical data sets are likely to include multiple files describing related data; these are analogous the multiple tables in a database storing related data.  Usually one row in a table corresponds to one record or point of observation.  However, older data from the punch card and mainframe days may contain multiple observations per row and require special processing in contemporary tools.

SPATIAL DATA

Spatial data is frequently multi-format, industry driven, and data models are still evolving.   It draws on relational data models to link the graphical representation of geographic features with their geographic coordinates and a host of attribute data stored in tabular form.   Typically a single spatial data “object” in the repository will consist of multiple files, many of which are interpretable or knowable only by the proprietary software.                                                                                   

The simplest spatial data are raster models:  digital geo-referenced images of the earth’s surface or paper maps.  Coordinates are stored in the header of the image, as in the GeoTIFF variation of TIFFs, or as a separate world file sharing the same filename prefix as the image and a suffix based on the image extension.  For example, an aerial photograph of Charlottesville might be called “Charlottesville.tif” and the world file would be “Charlottesville.tfw.”

Other raster models exist to store attributes of regularly spaced point observations.  These are often called “lattices,” “grids,” or surfaces.  The values of each cell might be elevation, precipitation, or population density. 

Vector based models are the most complicated and most efficient means of storing and representing spatial information in a GIS software tool.  Because vector data draws points, lines, and polygons based on coordinate location, it is well suited to representing geographically referenced cartographic information.   Tabular data stores attributes for the features represented by vector graphics.  

Emerging data models for storing and representing geographic features are based on relational database models, although this has been implemented by one software company (ESRI) and very recently.  Development around emerging XML models for storing and representing geographic information is taking place, but these models are not stable enough for software development.

Different GIS software tools may implement the vector and non-image raster models in unique ways.

Proprietary Software and Formats

STATISTICAL ANALYSIS

Contemporary software for statistical analysis has made the transition from the mainframe to the PC.  Packages like SAS and SPSS, which were once entirely driven by program coding and command line syntax, now offer GUIs.  There is no obvious advantage to one format over another, and most of the major packages for statistical analysis will export data from one format to another, and translation tools are available. 

GEOGRAPHIC INFORMATION SYSTEMS (GIS)

Geographic Information Systems software has been around since the 1960s, but unlike statistical packages, none of the early tools have survived. ESRI has been the primary driver of the market for GIS software since the introduction of ArcInfo in the early 1980s, followed by ArcView in the 1990s, and now ArcGIS.    With each of these software packages has come a new data format for representing the vector and raster models outlined above.   ESRI formats are the industry standard for GIS, and no competing software is on the market that can’t import at least one of ESRI’s file formats.  Within ESRI’s universe of products, succeeding generations of software can read older data models, although the older software generally cannot read the newer models.      Most of the users of GIS at the University of Virginia use ESRI software, so this discussion focuses on ESRI formats.

ArcInfo:  Coverage and Grid Models

The ArcInfo coverage is a directory based data structure that bears the imprint of its age. The coverage itself represents all features of a certain type in a geographic area (e.g., roads in Charlottesville might be called “cvroads”).   Data for the coverage is stored in multiple files in at least two different directories, one named for the coverage, and one called “info” storing the attribute data.  The coverage and its two directories are stored in a directory called a “workspace.”  The workspace may contain more than one coverage, and the info directory would in such cases contain the attribute data table files for all the coverages in the workspace. The Grid format for raster data is similar in its complexity.  Because of the complexity of the data model, ESRI developed text-based exchange formats for data sharing.

ArcView:  Shapefile Model

The shapefile is a file-based model that incorporates at least three, and as many as seven, separate files that share the same filename prefix.  Shapefiles store vector graphics in the official “shape” file with a .shp extension; attribute data is stored in the tabular format developed for dBase software in the 1980s (.dbf); indexes linking these two files are stored in the shape index file (.shx); other indexes to the attribute data may be stored in other index files with the extension .sbn, .sbx.  The shapefile is an open format described on in a white paper available at:  http://www.esri.com/WhitePapers/XXX

ArcGIS:  GeoDatabase model

The GeoDatabase model is heavily dependent upon yet more proprietary software, relying on storage in a commercial relational database software system such as Oracle, DB/2,  SQL Server, or Microsoft Access.  ESRI’s has developed middleware to enable its software to speak to these tools, called ArcSDE. All attribute and coordinate data is stored in relational tables.  Theoretically storage in open source packages like MySQL or PostGress is possible, but ESRI has no definite plans to develop for these tools. 

Status of Data Archiving and Preservation

STATISTICAL DATA

Within the Social Science data community, there is a strong ethic of data preservation for public access, most obviously exemplified in the United States by the Inter-university Consortium for Political and Social Research (ICPSR).  This member-based consortia has been archiving data for long-term preservation and access since the 1960s.   Their method is to convert data to ASCII formats, produce machine-readable documentation, and write program code for SPSS and SAS to read these files.  Some working files are delivered in these formats already.   Other significant data archives exist at the US National Archives and Records Administration (NARA) and in various European organizations.  On the whole, digital library research activity related to statistical data discovery, access, and analysis is strong and widespread.

SPATIAL DATA

In a research community so strongly driven by the commercial sector, it is no surprise that there is a weaker ethic for public data preservation and access.  While ICPSR has archived some spatial data, and a few collections are found in NARA, archiving and preservation of spatial data are less active research areas.  A federally endorsed standard for archiving and translation known as the Spatial Data Transfer Standard (SDTS) was endorsed first in the early 1990s and has since been emended.  While it could be considered the primary archival and platform independent format for spatial data distribution, vendors do not make it easy to convert this data to their working formats.  Converting the Library’s collections to SDTS format is not a useful avenue for storage, access and delivery.  On the other hand, the widespread adoption of ESRI software ensures that any text-based exchange format created for ESRI’s data models will be readily useable, even if they are not true platform independent formats.

Methods of Acquisition

Except in rare cases, Geostat has acquired spatial and statistical data for the Library from vendors, government agencies, or consortia, rather than producing data on it’s own.  This is not to say that data might not be produced in-house, given the right set  of circumstances, but there is a  wealth of useful data being produced that can usually be acquired more easily by some means than by producing it. 

This means that proprietary data formats tend to dominate the collection, especially for spatial data.  Government produced data, in particular, is produced for immediate use at that agency, and long term distribution and archival costs.  At some point the Library, and the entire research community, will be faced with the problem of migrating that data to extend its usefulness to future scholars.   But in the meantime, we have little choice but to accept data in the formats that are most useful for our patron base, or that the suppliers offer that to us. 

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, August 03, 2009
© The Rector and Visitors of the University of Virginia