The Fedora Andrew W. Mellon Foundation Grant
Phase 1 of the Open-Systems Fedora Repository Development Project (funded December, 2001)
IntroductionThe University of Virginia Library has been building digital collections since 1992. We have amassed a large collection that includes a variety of SGML encoded etexts, digital still images, video and audio files, and social science and geographic data sets that are being served to the public from a collection of independent web sites that have very little cross-integration. We began searching in 1998 for a digital library management system that could effectively meet both our current and future digital content needs. Like many other libraries, we initially sought a vertical vendor solution that provided a complete, self-contained package for delivering and managing all digital content needs. We investigated a number of commercial solutions, including IBM's Digital Library Software system (later renamed Content Manager) and SIRSI's Hyperion digital media archive system. We started our investigation with the requirement for a digital content repository with a wide variety of features, including scalability to handle hundreds of millions of digital resources, flexibility to handle the ever expanding list of digital media formats, and extensibility to facilitate the building of customizable tools and services that can interoperate with the repository. Our view is that such repository functionality is the core of a digital library system providing a means of uniquely identifying each piece of digital content as well as identifying groups of related content or collections. The remaining services and functionality of a digital library system would then be built on top of this core. Our investigations revealed a number of shortcomings in commercial digital library products:
Based on these investigations we decided to embark on an in-house development effort. Modularity and use of open-system standards is fundamental to our design strategy. Such modularity is essential for future evolution through component replacement. We are convinced that an object-oriented design is most appropriate, allowing us maximum flexibility, scalability and, eventually, interoperability with other repositories. We are also convinced that the Library should be providing tools to our users to give them sophisticated access to our collections and to help them manage their own collections. In the summer of 1999, early in our design process, we discovered a paper about the Flexible Extensible Digital Object Repository Architecture (Fedora) written by Carl Lagoze and Sandra Payette at Cornell's Digital Library Research Group, describing the architecture that they had designed. Fedora is a modular architecture built on the principle that interoperability and extensibility is best achieved by the clean separation of data, interfaces, and mechanisms (i.e., executable programs). A Fedora Repository provides a general-purpose management layer for digital objects. In their simplest form, digital objects are containers that aggregate mime-typed streams of data (e.g., digital images, XML files, metadata), known as datastreams. It should be noted that datastreams can be references to external data - either disseminations of other Fedora digital objects, or service requests to remote data sources. This capability allows Fedora digital objects to serve as aggregators and value-added surrogates for existing on-line digital content. In addition to behaving in a generic manner, digital objects must be able to mirror real-world entities by providing access methods that make an object behave in a content-specific manner. For example, a natural behavior for a book would be "Get Table of Contents." Fedora allows the association of rich and extensible behaviors with digital objects by "plugging in" generic components known as disseminators. Each disseminator aggregates references to: (1) a formally defined behavior interface that defines a set of methods for a particular kind of digital library resource (e.g. a Book interface), (2) an executable mechanism that runs these methods, and (3) the datastreams that the execution mechanism should use to fulfill specific method requests. These interfaces and mechanisms can, themselves, be stored as digital objects, laying the foundation for unlimited extensibility of the architecture. A major strength of the Fedora extensibility model is that clients can use the generic methods (of the default API) to discover and invoke content-specific methods defined on disseminators. The digital object facilitates the invocation of these extended methods, returning customized disseminations of content to the client. With the Cornell group's help, we installed their research reference software version of Fedora and began experimenting with some of our digital collections. We pretty quickly found that the reference implementation, elegant piece of software that it is, was not what we needed for a large-scale digital library. But we were convinced that Fedora was exactly the conceptual framework that we were looking for. So with the authors' help, we reinterpreted the architecture and implemented it using an SQL database as the backend. Since that time we have built a testbed that includes 500,000 data objects including digital images and a wide variety of XML objects. We have developed a variety of disseminators that provide a rich set of functionality for electronic finding aids, TEI-encoded etexts of letters and books, and for XML-encoded structured collections of art, architecture and archeology images. We have also implemented three different object models for images, one for multiple files for the various resolutions of a single scan, one for single-file wavelet-encoded images and one for page images that uses a single compressed TIFF file. In all three cases, the user sees the images from one abstract point of view and is spared the requirement of knowing their format. Most recently, we have begun to do some stress testing of our implementation using software that simulates simultaneous users requesting a realistic mixture of different requests. We have been quite pleased with the results. On a Sun Ultra80 two-processor workstation, simulating 20 simultaneous users making requests with an average delay of 300 milliseconds, response time averages are approximately one half second per request. Note that for most of the XML object transactions this includes a server-side rendering of the XML into HTML, a relatively processor-intensive action. We are in the process of moving the repository to a four-processor, dedicated server, where we will continue our testing. We plan to start scaling our testbed up by duplicating the existing objects repeatedly, running the user tests at 1,000,000 and 10,000,000 objects. We believe that a repository that provides fast access to 10,000,000 objects is a very good starting point for a practical digital library.
Project DescriptionWe believe that it is time to both start developing a practical implementation of the work that we have been prototyping, and to explore and prototype some of the more complex issues related to the more complete implementation. We propose to do that with input from other members of our community (see Phase 1), so that we develop a good general solution as quickly as possible. We also would like to get the repository software that we produce into the hands of other people who are ready to use and evaluate it. We are convinced that we are on the right track with our implementation of Fedora. It gives us the basic approach that we need to manage all of the digital resources that we are accumulating, while delivering a very high level of service to our users. And we believe that the extensibility of the architecture will allow us to adapt to the rapid technological changes and new content forms that are inevitable. We request funding for this project from the Andrew W. Mellon Foundation based on our belief that the project is closely aligned with the mission of the Foundation of promoting the broad dissemination of scholarly content. In the digital age, such broad dissemination is dependent on core technical developments, the roots of which lie in the research community. The original research and development related to Fedora was undertaken under the auspices of DARPA and NSF funded research at Cornell University. Per the general understanding in such research projects, the funding was available for initial concept development, prototype demonstration, and reportage in the form of conference and journal papers. On the other hand, such government-sponsored research funding is not available for the subsequent stages necessary for moving from research to deployable implementation and other aspects of technology transfer including packaging and support. The Foundation's possible funding of Fedora work by Virginia and Cornell would consequently leverage several years of successful government funded basic research and facilitate the availability of the fruits of that research to the broadest community. Such funding would also benefit from the fact that the NSF funding continues at Cornell and would dovetail into the project as it matures. We are confident that such pairing of funding mechanisms is the best possible model for fostering state-of-the-art advances in digital libraries and scholarly communication. This project also would build from and directly support Mellon-funded projects already underway at Virginia. The Supporting Digital Scholarship (SDS) project, which concentrates on collecting the digital scholarly projects that are being created by humanities scholars in the Institute for Advanced Technology in the Humanities at Virginia, is built around our prototype Fedora implementation. That project has already informed the design that underlies the project described in this proposal and will continue to do so. The version of the repository that results from phase 1 of this project (described below) should become available right at the time that the SDS project delivers policy and technical guidelines for collecting digital projects. This should allow us to implement those policies immediately and begin formally collecting scholarly projects into the digital library. In the same manner, the basic working repository created with this grant will deliver a full suite of management utilities to other Mellon-funded projects underway or envisioned at Virginia. Our work will dovetail with the digital imprint project approved for the University Press of Virginia, and will be immensely useful for the American Studies Information Community project (itself bringing in the Mellon-funded Early American Fiction collection) that is one of the Open Archives Initiative projects recently funded by the Foundation. The startup phases of these two projects coincide with the detailed design and implementation phases of the repository project, providing an opportunity for influencing the details of the initial product by providing different content collection and delivery issues to resolve. Then both projects would be able to move directly to concentrating on using the repository to meet the specific needs of publishers and American Studies scholars, respectively. We believe this project is best undertaken in collaboration with our colleagues at Cornell. We find the missions of our two groups to be synergistic, spanning a continuum from basic research, through prototyping, to eventual deployment of reference implementations. The Cornell group works mainly in the basic research and prototype mode. Fedora was originally developed within this research framework, and NSF DLI2 funding currently supports the basic work on policy enforcement and context sensitive behaviors, which we will leverage as described later in this proposal. The Virginia group sees itself functioning as a bridge between the computer scientists doing digital library research and the libraries that are building large digital collections. We believe that the collaborative activities of this project will effectively demonstrate how digital library research can be more immediately deployed in the libraries that it is intended to serve. We propose that we will form a research and development team composed of
people from Virginia and Cornell, with 1.0 FTE added at Cornell to Lagoze's
group, and 2.5 FTEs added at Virginia. The principal investigators will
be Thornton Staples, the director of the Digital Library Research and Development
group in the Library at Virginia, and Carl Lagoze, co-director of the Cornell
Digital Library Research Group. Also, people from the Institute for Advanced
Technology in the Humanities and from the Advanced Technology Group in the
Information Technology and Communications Department, both at Virginia,
will continue their work with Fedora as members of this team. The team will
pursue a three-phase project, as detailed below, with the goal of producing
an open-source reference implementation, which will be available to other
libraries and practitioners as they construct digital library systems. The
first phase involves taking a strong proof of concept (already done) and
producing a package that can be distributed and used in a variety of settings.
The later phases propose extending the results of ongoing research in order
to fill out the system with important functions that a sophisticated digital
library system needs.
Phase 1This phase will involve finalizing the specifications of the basic Fedora system, implementing that system, and testing it in a variety of deployment scenarios. The resulting product will be an efficient, scalable reference implementation that can be the basis for many different development efforts, one that libraries with a reasonably sophisticated technical staff can use to begin to build their digital library systems. It will include a set of generic modular tools that provide a full set of basic repository management functions. The time period for this phase is assumed to be one year, probably from the time that the programming effort begins. We will continue to build our testbed at Virginia and we anticipate having at least 1 million digital objects of a variety of types ready to test the system that we will deploy in phase 1. An essential part of this phase will be the participation of a select deployment group (distinct from the development group) that will deploy testbeds of their own materials at the same time. In each case, the participants listed below either heard our presentations at the Association for Computing in the Humanities and Digital Library Federation conferences or read the article in DLIB Magazine where we described our work and came to us to find out more. Two of the participating institutions are also digital library groups which will be evaluating the system from that point of view. The rest of the participants are project-oriented humanities groups who will be testing the repository system as a basis for supporting projects rather than for building a general digital library. As a group we will be evaluating the system specifications and planning the system evaluation as the programming is being carried out. At the end of this phase we expect to have at least six implementations of working digital object repositories to evaluate. We also expect that many if not all of these repositories will continue to give us a rich testbed for later phases of the project. The success of phase one will be determined by the success of the deployment group (consortium participants) in deploying separate testbeds in each of their institutions. Feedback from the consortium members and other users after the public release of the software will be used to evaluate version 1.0 of the software and will guide future enhancements to the repository software. We would like to keep the number of participants to ten or fewer, to make the process more manageable. Currently, the participants include:
Following initial drafting of the specifications by the development group and dissemination to the deployment group in the summer of 2001, the work in this phase will include:
Phase 2The second phase of this project will concentrate on adding the functionality needed for a repository that supports large-scale digital content creation, storage and delivery efforts. This will involve enhancing and extending the management utilities developed in phase 1, in addition to concentrating on the development areas listed below. We expect that some of the participants from phase 1 will be interested in the problems associated with large scale production and will be interested in developing a new testbed definition as this phase develops and deploying the new version when it is complete. We will solicit new participants who are well situated to evaluate the work as it progresses. We will also be interested in continuing to work with groups that are interested in deploying smaller repositories to evaluate how these additional functionalities can be used effectively in those settings. Security and Policy Enforcement We assume that each digital object in the repository should be able to have a variety of policies associated with it. First among these policies must be those associated with access control. But many other policies are possible, for example preservation policies that describe the events and actions necessary to maintain objects over time. In the area of access control, we recognize the need to specify policies that are both general-purpose and object-specific. Some policies may be defined at the repository level and may address high-level operations such as who can create or delete objects. Other policies may be tailored to the nature of individual objects in the repository. Initially, we will focus investigation in two areas:
Collection Objects We believe that item-level granularity is not appropriate for all the functionality that we want to build into our system. Indeed, there are a variety of repository functions that should be implemented at an aggregated level, within what we call collection objects. These objects would represent a group of related digital objects and provide a place to describe and document a collection as a whole, as well as to attach computer programs to be used for manipulating and analyzing it. The relationship of collection objects with related items will be either rule-based (for which criteria associated with the collection object are used to locate the objects which are members of the collection) or explicit (in which objects that are members of the collection are enumerated). Collection objects might be used to generalize a specific function across a class of digital objects; for example, a collection object might be used to implement a function such as metadata searching and indexing across its set of constituent objects by accessing a specific datastream in those objects. We will also develop collection objects that can act as templates for large classes of objects, providing a way to streamline the process of updating large classes of objects. Storage Management We will develop a storage management system that would allow the repository to control access to one or more file systems that house local datastreams. The processes that create or update a local datastream would, in addition to updating the repository, be responsible for accepting the contents of the datastream from the user and passing it to the file server. The goal of phase two is to use the results of evaluations of version 1.0
of the software (conducted in phase one) to add new functionality to make
the software usable in large- scale production environments. The features
outlined here are a first impression of what those additional features may
need to include. We expect many of the repositories deployed by consortium
participants in phase one will provide valuable testbeds to conduct additional
testing of the new enhancements. We also expect these testbeds to provide
valuable feedback for both evaluating the new features and for suggesting
additional enhancements for the future.
Phase 3The third phase of this project will concentrate on extending the facilities in the repository that provide more sophisticated delivery of end-user experiences in a large scale digital library. This will include extending the functionality of disseminators, adding services that are important for collecting scholarly projects and publications, as well as overall optimization of the system. As with phase 2, we hope that the deployers from earlier phases of the project will be interested in continuing with this phase. We will also solicit new participants who are well situated to evaluate the work as it progresses. Editions and Versioning of an Object The repository must make it possible to retain and provide on demand every version of an object if desired. We propose to offer a standard way to make a new edition of an object available as a separate object in the repository, as well as to make it possible to track every change to an object within the object itself. A new edition of an object will be a completely new object. It will have a new PID, it's own metadata, etc. There will be a field in the system metadata that contains the PID of the object from which it was derived. Versions of the components of an object will be kept in the object. The create date for each version of each component of the object will indicate the date and time that the version became current. The version of the whole object on a given date and time could be disseminated by giving the date and time as an extension of the PID. The version of each component in the object with a create date and time most nearly previous to the given date and time would be used in the dissemination. Dynamic, Context Sensitive Behaviors We envision scenarios where the predefined disseminations on an object will not be appropriate to a given usage context. In certain cases, our collaborators may wish to reuse each other's objects in new ways. One option is for repository managers to create new disseminators on objects to meet such needs as they arise. Another interesting approach is to provide a mechanism for exposing a special kind of structural metadata about an object that enables a 3rd party to: (1) learn about the nature of the object's raw content, and (2) access relevant parts of that content in a format that facilitates reuse. In a way, we can think of this as enabling "just-in-time" disseminators for an object. We envision implementing this scheme by introducing a new service into the repository architecture: a context broker service. We anticipate a time when all of our collaborators are running repositories using the repository software we develop. Each site can also run a context broker service whose purpose is to contextualize the experience of objects in other collaborator's repositories. Efficiency and Scale Optimization Though we will be attempting to optimize each module of the system as we develop it, we believe that we need to devote part of the last phase of the project to optimizing the integrated system. We need to ascertain that the repository can support hundreds of million of objects with 50 simultaneous users, in a realistic combination of user requests and repository management processes. If the proposed scale proves to be impossible, we will investigate other strategies, such as coordinated, multi-repository installations. The goal of phase three is to continue to evaluate and enhance the software,
building upon input received from consortium participants and others in
the open source community who are actively using the repository software.
We expect the version of software that emerges at the end of phase three
to be capable of supporting large-scale digital content and delivery efforts.
We also expect the software to be capable of providing the necessary services
that are important for collecting scholarly projects and publications and
provide tools for end-users to discover and manipulate content in the repository.
We anticipate the success of phase three and the project as a whole will
be judged by the experiences of consortium members and others in implementing
the software to manage large scale digital collections. We also envision
that the various implementations of the software will offer rich testbeds
for future projects.
Implementation PlanThe project will extend over a three-year period anticipating approximately a year to complete each of the three phases outlined above. Evaluative input from the consortium participants about software features and performance may necessitate changes to the planned Phase Two and Three activities. Obviously, delays in hiring or unplanned technical issues may require adjustments, but an approximate timeline of events includes the following:
BudgetThe costs associated with this project will predominately be the costs of personnel. We are requesting $1,000,128 from the Mellon Foundation to provide 3.5 full time equivalent staff, including a Technical Coordinator and staff to work on the design and programming of the system proposed, plus funding for equipment for those people and funding to provide travel expenses for 4 meetings of the development team per year for each of three years. The Technical Coordinator position will be in the Digital Library Research and Development (DLRD) and will report to Thornton Staples. We expect this person to participate in design and implementation discussions with the research and development team and to be the primary point of contact for the deployment group as they begin to deploy the software. He or she will test the software as it evolves using the Virginia testbed. This person will not necessarily be a high-level programmer but will need to be very technically sophisticated, as well as a good communicator and organizer. We see this position as key to organizing the project and keeping all of the participants in sync. This person will coordinate all of the activities associated with the project, including organizing meetings, conference calls and other communications, as well as overseeing any administrative needs. The 2.5 FTEs of programming time will be divided between Virginia and Cornell. These will be high-level programming positions that we believe are necessary to design and develop the software required for the proposed system. 1.5 FTE programmers/Virginia: The 1.5 FTE at Virginia will be divided between the DLRD (1.0 FTE to be supervised by Ross Wayland) and the Advanced Technology Group (ATG) (.5 FTE to be supervised by Tim Sigmon, director of the group). We believe that by placing these positions in these three software development groups we strengthen the connections among them to continue the collaboration that has produced the prototype. Note that 25% of Staples' time and 50% of Wayland's that has been committed to developing the prototype will continue to be devoted to this project. Also, the ATG will match the .5 FTE from the grant, plus Tim Sigmon will continue to dedicate 10% of his time to the project. The commitment that has been made by the participants in the deployment
group was to cover expenses of their participation themselves. We certainly
will use the Technical Coordinator position to make that commitment as easy
as possible. Also, we have a verbal commitment from Daniel Greenstein, the
director of the Digital Library Federation, to cover the expenses of the
fall meeting.
Appendix A.
|