Building an American Studies Information Community
A Proposal to The Andrew W. Mellon Foundation
By The University of Virginia Library
Libraries of the twenty-first century must take on the challenge of making
sense of the flood of print and digital information now overwhelming students
and scholars. The University of Virginia Library has just scratched the surface
of a great potential to build and sustain a comprehensive program to support
excellence in teaching and research. Our objective in creating the Library
of Tomorrow is to make all forms of information easily accessible to our faculty
and students as well as an international audience of academic and non-academic
users.
Information Communities
Building on its strengths in user service, special collections,
and digital initiatives, the Library is working to create the model University
research library for the twenty-first century. The foundation of this model
is a concept we are calling Information Communities.
An Information Community is a group of scholars, students,
researchers, librarians, information specialists and citizens from similar
or dissimilar fields, whose common link is a shared information need. This
information need can be oriented around a subject, a field, a methodology,
or a data type. The information can include text, data, digitized media, images,
and formal and informal scholarly exchanges of ideas. Information Communities
exist as a medium for bringing people together and making them aware of opportunities
and resources. Community is fostered by personal communication, shared interests,
shared research materials, shared tools, and shared standards. Information
Communities add value to information, and offer opportunities for using information
in new and different ways. Activities of the community can include creation
of web-based materials, development of portable tools for enhancing access
to the materials, and managing of conferences and publications. Information
Communities foster innovation and spark new areas of research, and usually
result in a tangible body of knowledge for consumers.
UVa Religious Studies professor David Germano, who has worked
with the Library to build the Tibetan and Himalayan Digital Library, has described
the basic structure: "an Information Community consists of the people
(authors, publishers, and users), the collections (texts, images, videos,
audio, and maps), and the tools provided for interacting with those collections.
The Library is providing the technological, administrative, and organizational
infrastructure for these collections, but relies on individual scholars and
collaborative projects. Multimedia digital publication is at the heart of
the Tibetan Information Community, which includes providing scholars and scholarly
groups with digital tools as a framework enabling collaborative research that
can then be published within the Library."
The Library is working to reshape its existing library services
to support Information Communities and the expanded capabilities they will
bring to instruction and research. The initial Information Communities will
support the development of the initiatives recommended by the Virginia 2020
Commissions on the future of the University of Virginia. Our planning calls
for developing the underlying structure and portals for up to five Information
Communities in the coming year.
American
Studies
The UVa Library holds one of the world's finest collections
of rare books and manuscripts in American Literature and History, has been
an international leader in digitizing documents and images from these collections,
and hosts several innovative academic centers which are busy advancing this
work. Our next major Information Community will be a cooperative endeavor
across all the disciplines central to American Studies, gathering collections
close at hand and building online scholarship with University faculty. When
fully developed, the American Studies Information Community will bring together
collections, students, and researchers from around the world, both online
and onsite, reshaping the future of scholarship.
The Community initially will be based on existing local physical
collections such as the Clifton Waller Barrett Library of American Literature,
the Tracy W. McGregor Library of American History, and a host of other rare
book, manuscript, and archive collections in the Special Collections Department.
A number of digital text and image collections and projects
have already been created, such as:
There are also these map and dataset collections that will
be valuable to the community:
We expect to partner with the University's Institute for Advanced
Technology in the Humanities and the Virginia Center for Digital History,
both located in Alderman Library, and work closely with allied departments
and centers, including an Institute still the planning stages that is designed
to bring American Studies scholars from around the world to Charlottesville
for research and educational development.
Other possible early-stage collaborators include the Thomas
Jefferson Foundation, the University's Bayly Art Museum, Virginia Tech , the
Smithsonian Museum of American Art, SOLINET, and the Digital Library Federation.
While we have no illusions that what we are creating will be the only portal
into the world of American Studies, there is no reason at this point to set
a limit to the numbers and range of eventual participant members of the Community.
Collaborative expansion is one of the prime strategies for growth for each
Information Community.
Phase I Proposal: Creating
an American Studies Portal to an Integrated Collection
While an Information Community will grow according to the
nurturing efforts of its participants, the first stage must be the gathering
of collections. In order to investigate possibilities for harvesting metadata
and using the resources that they describe we must first create an integrated
collection of digital resources from which an American Studies community can
be served. The whole concept of information communities depends on having
a large integrated general collection upon which rule-based software "lenses"
can be focused, giving a specific community its best access to the resources.
We have long been creating and acquiring appropriate digital collections that
should provide a firm foundation upon which to build. What has been missing
until recently is the infrastructure that will make true integration of collections
possible and the necessary provision of customized views practical.
We have built a working digital object repository based on
the Flexible Extensible Digital Object Repository Architecture (Fedora)
protocol. This protocol, developed by Carl Lagoze and Sandy Payette at Cornell
University, can be used to manage and deliver a broad variety of digital resources.
For details of our architecture and implementation of the system, see our
article "Virginia Dons Fedora: A Prototype for a Digital Repository"
in the July/August 2000 issue of DLIB magazine (http://www.dlib.org/dlib/july00/staples/07staples.html).
Currently, we have 250,000 digital objects in the repository, including many
electronic texts. Some of these are fully transcribed and marked up, and others
are sets of page image objects bound together by a more minimal structural
metadata object. We also have a collection of EAD-encoded finding aids, some
of which contain references to etext image and text objects previously described.
In addition to the images of text pages, we have several collections
of images in our testbed. These include documentary photographs from our special
collections and art, archeology, and architecture images that we purchased
from vendors. The purchased images came with databases that we have been experimenting
with processing and reformatting into a set of XML objects that provide the
structural metadata to organize access in the digital library. The General
Descriptive Modeling Scheme (GDMS) XML DTD that we are using is one that we
developed. It is intended to model collections of digital resources in ways
that are natural to the collections: architecture and archeological site image
collections are organized around the structure of the site; collections of
art images are organized by creator or other classification in the case of
unknown creators.
After our catalogers took the original databases and massaged
them a bit, we processed the data into the XML form with a script. We are
working towards a model where we can pre-process data from outside sources,
apply the human judgment needed, then post-process it into a usable form.
Using an XML format makes it easy to retain the source of the information,
so we can refresh the metadata from the original source while retaining our
added information. We are very interested in experimenting with pre-processing
techniques that allow a human to make judgments about large groups of resources
that can be applied mechanically, working towards supporting a "bionic
cataloger" who can process large collections of resources efficiently
by using a variety of software aids.
Plan of Work
Our proposed plan is to proceed on three fronts: expanding
our digital library testbed, developing a set of tools and procedures for
a American Studies community, and developing the portal for that community.
It is critical to activities on all three fronts that we have a community
coordinator to support the activities of the community. We see teams of library
staff forming around communities to provide various expertise, as well as
advisory groups of faculty and other community members, and know that we will
need one person to act as the coordinator. In this startup phase we think
that we need someone who is conversant both with the content and with at least
some of the technical issues of digital content development and delivery.
The activities of the grant project will be coordinated by the Digital Library
Research and Development group, with Thornton Staples, the director of the
group, acting as the principal investigator.
Expanding the Testbed
To date, our digital library activities have centered on building
a testbed that demonstrates solutions to particular technical problems. Our
next priority is to apply what we have learned from our testbed activities
to bring in a large collection that can support an American Studies community
view. We plan to bring our Early American Fiction and Modern English etext
collections into the repository and, where necessary, enhance metadata to
ensure they are recognized as specifically American. This involves converting
the texts to XML and refining the XSL stylesheets and other scripts from our
initial testbed. It also involves restricting access for some of the collection,
which requires that we develop an authentication and rights management infrastructure.
Our main focus for further repository system development for the next year
will be to do this by adding policies to objects and enforcing them through
the system.
Our entire collection of EAD finding aids was included as
a testbed, so those will require minimal work on the metadata. Our image collections
from Special Collections are currently organized using Filemaker Pro databases.
We propose to create EAD finding aids for those collections, add to the metadata,
and develop some new XSL stylesheets and scripts. We have a number of architectural
image collections developed by faculty that we would like to consolidate,
normalize and enhance the accompanying databases, and convert them to GDMS
XML objects.
To build up the art side of our collection we would turn to
some outside sources. We are members of AMICO and we have had preliminary
conversations with both Jennifer Trant of AMICO and Ricky Erway of RLG about
including that collection in our repository. We would convert the database
to GDMS objects and register the images as formal objects in our repository
that would point to the files resident on the RLG server. We would do the
same for the collections data from the Bayly museum here at UVA. For both
collections, this would mean that we were integrating a broad collection of
art from around the world into our digital library from which the community
could draw on for American studies.
We have also had conversations with Elizabeth Broun, at the
Smithsonian American Art Museum (SAAM), and Rachel Allen, the director of
the Research and Scholars Center at SAAM, about establishing a relationship
with our American Studies community. From the digital collection point of
view, that would include integrating their collection records as well as their
Inventories of American Painting and Sculpture. The collections data would
be handled in the same way as AMICO. The Inventory data would provide us with
an interesting research opportunity to figure out how to integrate Z39.50
databases.
Developing Procedures and Collecting Tools
The second front for our work would be to take the first steps
to develop procedures and tools to be used in developing a community view
of our digital library collection. This requires: 1) developing a profile
of a subject area that can be used to recognize a resource as related to the
community in question and 2) testing methods of analyzing the resources and
enriching the metadata for that community.
We would continue working on the "bionic cataloger"
model that we have started using for our purchased resources. This work would
concentrate on mechanically pre-analyzing data, presenting a relatively compact
summary to a human for judgment, then feeding the results back to enhance
the original data, storing the final results in our repository such that the
original metadata record could be updated from its source while retaining
our additional data. We definitely plan to provide some simple tools that
the cataloger could use to do this, based on sorting and boiling down data
so that judgment could be made and applied to large sets of metadata.
One possibility that we will explore is to apply natural language
expert systems software to the problem. The Alembic Workbench software ("http://www.mitre.org/technology/alembic-workbench/")
developed by the Mitre Corporation (and available to be used for free for
internal purposes) appears to provide the basic functionality that we would
need. Essentially, the system uses natural language processing to analyze
textual data and build a knowledge base. A human user "trains" the
system in steps by giving feedback, which is then used to enhance the knowledge
base. We would like to exploit such a system to develop the profiles for each
community to identify resources with a high probability of relevance. We will
then add another profile for each community that could be used to enhance
the metadata and increase its usefulness.
The bionic cataloger will be an important tool for continuing
our work with enhancing the databases that accompany the digital collections
that we buy. But it will be even more interesting to use these techniques
on the pool of metadata being developed by the Open Archives Initiative (OAI).
The OAI will finally give us a systematic way of locating publicly available
digital resources, and we believe that our system will give us a way to integrate
those resources and make good use of them in a manageable, scalable process.
In this manner, we expect to be able to establish access to a large collection
of publicly available electronic texts that we assume will be made available
through OAI metadata from other Digital Library Federation members. We should
also be able to tap into a rich lode of information about the current location
of museum objects, and in some cases digital representations thereof, by harvesting
the metadata that will be made available by the Consortium for the Interchange
of Museum Information. In both of these cases, the harvested data would provide
a very interesting application for our bionic cataloging system to recognize
and classify resources that are of interest to an American Studies community.
Building the Portal
The third area that our work will concentrate on is building
a portal for the community. We see a web-based portal as being the primary
tool that the Library can use to host each of many information communities.
Each community will have communication services, access to relevant collections
and tools, and reference services provided through a portal. American Studies,
by its very nature, is a decentralized field.
Much of the work in this area will center on making the portal
as manageable as possible for the community coordinator. We plan to use a
portal software package as the basis for our work, but we will need to build
processes and databases that help automate some of the community services.
For example, we have been experimenting with an XML database of bibliographic
objects that represent on-line reference resources that can be used to give
the most up-to-date list when people go to our reference page on the web.
We will explore associating those reference resources with a community, making
it not only possible to provide community specific reference pages automatically,
but to feature new references in a special section that automatically updates
after a certain time period.
Some of the most important services that we would like to
develop around the portal are those that support collaborative work of various
kinds. Members of the community would be able to log in and be given the appropriate
access and tools for adding objects to the collection, adding to metadata,
posting notices, etc. We would like to provide specialized tools for teaching
and research that are relevant to the member of the community. We envision
the portal as a primary conduit for sharing resources among scholars from
various departments, here at Virginia, and among members of the Information
Community worldwide.
|