Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Research and Development

The Fedora™ Phase 2 Andrew W. Mellon Foundation Grant

Phase 2 of the Open-Systems Fedora™ Repository Development Project (funded June, 2004)

1. Motivation

The Fedora project has been devoted to the goal of providing open-source repository software that can serve as a foundation for many types of information management systems including institutional repositories, digital libraries, content management systems, and others. Positioned as a general-purpose repository service, Fedora is not constrained by the particulars of any one type of institution or use-case. Instead, the system’s goal is to be flexible and extensible enough to serve multiple purposes. We believe that we have succeeded in this goal, as evidenced by the array of institutions who have been early adopters including academic computing groups, research libraries, archives, publishing societies, government agencies, and commercial vendors. We have been fortunate to receive valuable feedback and suggestions from many of these institutions, and now seek to incorporate their insights into a second phase of development.

The major goal in the next phase of our project is to evolve the Fedora software to the state which makes the following guarantees:

  1. Fedora will make it easy for institutions to create and maintain digital objects in the context of current and future workflows.

  2. Fedora will ensure performance and longevity for repositories of 10 million objects

  3. Fedora will facilitate cross-institutional collaboration by providing the ability to federate repositories and create distributed, virtual collections (“Fedorations”).

  4. Fedora will provide improved searching for individual repositories and multiple repositories. This will include the ability to use external search engines to selectively index content from datastreams and disseminations within objects.

These areas of focus are proposed as the result of careful consideration of what we have learned from our Fedora partners, from the recent Mellon retreat, and from workshops and discussion at major forums including DLF, CNI, OAI, and Library of Congress NDIIPP. The resounding conclusion is that Fedora is a well-designed and flexible system that can solve a broad range of problems. Also, it is observed that Fedora’s extensibility features position it extremely well, relative to other solutions, to be adaptive to changing requirements over time. This is particularly notable in the area of content repurposing, content interoperability, and associating new services with content.

With great success thus far, Fedora can still better target the needs of three very important constituencies: institutional repositories, research libraries and archives with very large, complex collections, and providers of educational software (e.g., E-Portfolio, Lionshare, Sakai, VUE). Fedora Phase 2 is targeted to eliminate barriers of entry for these contexts and to provide additional functionality to meet advanced requirements of each. We discuss the requirements for each context below.

Institutional Repositories:

In the area of institutional repositories, Fedora is gaining recognition as one of several viable open-source solutions. In January 2004, Fedora was reviewed in the influential Open Source Society’s Guide to Institutional Repository Software. In February 2004, Fedora was represented at a major workshop on institutional repository software at OAI3 in Geneva. Although DSpace has made the most significant impact in this area by breaking through social and political barriers to adoption, we now see institutions evaluating their technical options for implementing institutional repositories. Fedora is one of the most robust open-source repository systems currently available; however, to date, it has not been as widely adopted for this use case as was early players such as DSpace and Eprints.org. Fedora has now begun generating great interest due to its extensibility, web-services orientation, and general forward-thinking design.

To solidify Fedora’s position in the institutional repository domain, we must focus on a few missing pieces:better content submission interfaces, better support for searching, and new preservation features. The institutional repository context demands front-end workflow interfaces and tools to enable easy end-user submission of content into a repository. Fedora Phase 2 proposes the development of tools, utilities, and new interfaces for creating objects and submitting content. One thread of work is to develop a DSpace -like workflow on top of a Fedora repository. Other work is proposed in the area of tool development, especially configurable “data workbenches” to assist scholars and other users in creating and submitting more complex objects. Once objects are stored in an institutional repository, search and discovery becomes a major requirement. Fedora Phase 2 proposes better integration of Fedora with 3rd party search engines. Also, we propose supporting alternative search interfaces to the native Fedora indexes that conform to emerging international standards such as SRW. Finally, long-term preservation of content is a fundamental requirement of institutional repositories. Fedora Phase 2 proposes to develop a set of integrity and management features that will directly address this requirement (described in section 2).

Research Libraries:

Early adopters of Fedora have recognized its unique capabilities in being able to “normalize” heterogeneous digital resources in one system, re-purpose content for different audiences, and integrate information that was previously disaggregated. Building an integrated digital library upon Fedora requires an institution to engage in information modeling, object design, and development of tools and utilities for migrating existing digital collections into Fedora. In the long-run the payoff is substantial; however, the effort requires devoting technical resources that are in short supply in research library organizations. To ease the barrier of entry for the average research library, Fedora Phase 2 proposes to develop a suite of tools to support migration of digital collections into Fedora, and to support the large-scale production of new data by these institutions. The complex, heterogeneous nature of digital collections in research libraries demand flexible content submission processes that handle batches of content and metadata to form the basis of multiple classes of objects with a variety of inter-relationships.

Migrating existing collections into a managed repository is the first step towards building large, virtual collections among collaborating institutions. In November 2003, the Digital Library Federation (DLF) announced its intent to commit to the vision of a collective and collaborative digital library under the auspices of the “Distributed Open Digital Library” initiative:

Participants in the Digital Library Federation's (DLF's) 2003 Fall Forum heard Michael Keller, chair of the DLF Steering Committee, announce the organization's intent to create something first envisioned at the time of the DLF's founding in 1995—a collaborative digital library that will provide global electronic access to collections in multiple institutions. The collaborative library—tentatively called the Distributed Open Digital Library, or DODL—will begin with public-domain materials in the humanities and social sciences and will incorporate numerous service layers, including an extensive finding service. [Report by Jerry George, CLIR, http://www.clir.org/pubs/issues/issues36.html#dlf]

The Fedora repository system is excellently positioned to become a major component of the DLF envisioned architecture. For example, a plan for a multi-institutional federation of image collections is currently being discussed by several Fedora partners (UVA, Tufts, Northwestern, and ArtSTOR). As part of Fedora Phase 2, we propose enhancements to the base Fedora system to better enable the creation, management, and delivery of distributed, virtual collections. A virtual collection is seen by the end-user as a seamlessly integrated collection of digital objects; behind the scenes the content and services that make up the objects are stored and hosted by separate repositories, but are sharable among them all. Although there have been prior attempts to integrate content from different institutions, the technical approach has often been ad hoc, and mostly in the focused in the area of distributed searching (e.g., Z39.50 searching over distributed catalogs). The Fedora approach provides a deeper infrastructure for true federation where content and services can be flexibly shared, composed, and re-composed to form complex objects made up of distributed components, and ultimately entire collections made up of distributed objects.

Content Sharing and Education Applications:

The recent Mellon Retreat (February 2004) unveiled synergistic opportunities between Fedora and a variety of projects focused on content sharing and education. The VUE project at Tufts has already made significant headway in using Fedora as an underlying repository for storing educational content as “content maps.” Two other Mellon-funded projects, E-Portfolio and LionShare, have expressed direct interest in Fedora as an underlying content store. We anticipate further opportunity in combining Fedora with a wide range educational and courseware applications, for example Sakai and UPortal.

The E-Portfolio vision is to create an application architecture for storing and managing life-long portfolios of personal and professional records. E-Portfolio presents three key requirements that are relevant to Fedora: (1) distributed, interoperable repositories, (2) portability, and (3) content integrity and preservation. Not only will E-Portfolios comprise information from multiple institutions and systems, E-Portfolio content must be able to move seamlessly among such institutions and systems. Also, since the goal of E-Portfolio is maintain records from “K to grey,” longevity of content is a key success factor. The E-Portfolio project provides an excellent context for demonstrating distributed Fedora repositories. Fedora can provide a repository infrastructure to enable the creation of virtual E-portfolios that are composed of pieces of content stored among many repositories. Fedora can also provide a standard, interoperable means of exchanging E-portfolio content. The E-Portfolio project can inform the Fedora project by defining requirements in the area of preservation and content longevity.

The Lionshare project also presents opportunity for collaboration. Lionshare is developing a peer-to-peer architecture for sharing educational materials in a secure manner. Digital repositories are identified as a major source of content to be shared in Lionshare networks. Fedora’s open APIs can be exploited within the Lionshare network to provide secure sharing of content datastreams and disseminations of Fedora digital objects. Also, the OKI-to-Fedora adaptor developed by the Tufts VUE project can provide a bridge to Fedora repositories from within the network. The Fedora and Lionshare project teams have already begun discussions on how to build authentication and authorization schemes that can enable Fedora and Lionshare to work together in a secure manner. Furthermore, both projects have similar requirements for security within peer-to-peer federations, particularly in how peer services authenticate to each other. In Fedora Phase 2, we propose increased technical sharing and coordination of the development of security architectures between Fedora and Lionshare. Also, there is work to be done in the area of searching across peer-to-peer networks - a problem shared by Lionshare and Fedora federations.

Summary:

In Phase 2 of the Fedora project, we propose continued development of the Fedora technical architecture including: adding new services, tools, and utilities; optimizing for scale and performance; adding new integrity and preservation features; and enabling the creation of peer-to-peer networks of repositories (“Fedorations”). This work will be motivated the specific requirements of institutional repositories, extremely large digital collections and archives, and distributed educational applications. In the next two sections we describe the proposed technical work in more detail, and a development plan for a three year period. The development plan is prioritized to first focus on functionality that will make it easier for institutions to get a jump start in using Fedora – specifically by easily loading heterogeneous digital collections into Fedora repositories. As the project proceeds we will add functionality that helps Fedora users move towards building large-scale, highly dependable repositories. This will provide the basis for a shared, seamless information space in which virtual collections and networked objects can be fully realized.

2. Proposed Technical Work

In the three-year project described here, we propose to serve our targeted constituencies by focusing our efforts on core functionality in four main areas: (1) object creation utilities and workflow support, (2) repository federation and distributed collections, (3) performance and longevity for content and services, and (4) enhanced search and index. The anticipated deliverable of this effort will be a robust and mature Fedora that will be ready for a permanent home in the open-source community. The $1.4 million that we will be requesting will make it possible to fund the existing Fedora development team, with the addition of one full-time programmer, bringing the programming staff up to 3.5 FTE. Additionally we request funds to expand the server to be capable of hosting a test repository of the necessary size to support the performance testing and tuning aspects of the project. A sketch of work to be done in the three areas follows.

Object Creation and Workflow Support

This area covers some of the biggest missing pieces for easy startup and use by research libraries with complex collections and relatively sophisticated academic users. As we said above, we are aiming to provide broad, generic functionality and we believe that we are in a position to make major progress on this front. We have been developing the idea of content models which are, in essence, class descriptions for objects that share the same kinds of datastreams and the same set of disseminators. We plan to formalize these content models and build the submission and modification processes around them. The work that we have done on object relationships also provides an advantage here. Relationship metadata provides a formal, rule-based way to refer to embedded references from a parent object to its children. There are a variety of approaches to organizing data that use an XML file containing references to data in other media, as well as to other XML files, that lend themselves to a general framework for workflow.

Our proposed work will focus in two areas: content submission interfaces and industrial-strength batch processing. The first would include a range of functions, from the submission of simple documents with no disseminators to submitting objects with custom disseminators. This functionality will be delivered through a web interface that allows the user to easily define a content model for their object or set of objects and to upload their data. Simple, configurable workflows will be enabled, most notably a multi-stage submission process with review points (similar in concept to the D-Space workflow). As a general principle, Fedora interfaces and tools will be built with the metaphor of a “workbench” where users compose information entities in a familiar way, and Fedora objects are transparently created behind the scenes. Consider the following examples:

  • Simple document submission: there are many situations where it is desirable to submit individual documents with a single Dublin Core record to a Fedora repository. An actual use case that has been identified is an administrative archive where all electronic communications of a university office are to be submitted into a Fedora repository. A general-purpose content model can be pre-defined for this purpose that sets up a template for selecting a document from a user’s local file system, and adding simple metadata fields. A web form can be developed that prompts a user for this basic information and creates a Fedora object behind the scenes.


  • Simple workflow: there are many cases in which a simple document submission is not sufficient to meet organizational requirements. A simple workflow model must be supported that allows a document to go through multiple stages of review and approval before it is submitted into Fedora. This is a notable requirement of the institutional repository use case, as has been observed in the DSpace project. DSpace has a simple workflow system, which models the workflows as 5 steps: SUBMIT, three intermediate steps (STEP1, STEP2, STEP3), and ARCHIVE. The DSpace workflow model is useful and simple. Unfortunately, the DSpace code is not constructed in a manner that it can be lifted out and plugged into another system like Fedora. However, it will be a relatively straight-forward task to implement equivalent functionality that will interface with a Fedora repository. We plan to develop a workflow module that that is configurable and pluggable. Ideally, it can work with Fedora, and be easily adapted to use with other systems.


  • Data workbenches: the are many interesting cases where scientists and scholars want to create complex objects composed of data, scripts, images, text, and dynamic behaviors. The information assets that are produced by the scholarly process are typically rich and multi-dimensional. A typical example is an object that represents the results of an experiment. Fedora is well-positioned to accommodate such complex objects. The challenge is to make it easy for scholars to create them. We envision a special set of tools and interfaces that can be used to assemble multiple forms of data and content, establish the relationships among them, and define different presentations of the content. Data workbenches are the means of presenting the scholar with an intuitive workspace to assemble their work and have Fedora objects transparently created for them. The underlying modeling scheme will be based on the formalization of Fedora content models.

The above examples focus on how to get a single object of a certain nature into Fedora. The generalization of this problem is how to get many such objects into a repository. Batch processing will also be designed with a “workbench” approach. Utilities and tools will be built around the assumption that a content model is the basis for defining a batch. A batch workbench will provide a control station for: (1) defining content models, batch inputs, outputs, (2) declaring relationships among objects, (3) running batch submissions and updates to groups of objects, and (4) monitoring and tracking progress of the batch. We will build the system around a basic concept of macro workflow that will leave hooks that can be used to plug in elaborate workflow processes for specific applications.

“Fedoration”: Federated Repositories and Distributed Collections

The Fedora architecture, as originally designed by Payette and Lagoze, was intended to support distributed, federated repositories. This vision has already been described in the Fedora Specification and placeholders exist in the current software to build this functionality.  Repository federation is important for several reasons. First, federation is a natural requirement for delivering integrated access to digital resources that are owned or managed by several institutions. Second, federation makes it easy for digital library and other applications to interface with multiple information sources in a seamless manner. Third, federation can help with scalability or performance issues for very large repositories. Specifically, a local federation of repositories can be established as a means of distributing load and object storage among several running repository instances. Together these separate instances can be treated as one ‘virtual repository.’

There are two basic technical approaches to achieving federation. The first would be to take a peer-to-peer approach. This would entail implementing a set of features that would allow repositories to have co-awareness, and the ability to forward requests to each other for content and services. In this scenario, distributed collections could be defined through the use of “collection objects” that enumerate the PIDs of member objects – irrespective of what repository the member objects were hosted by. New technical developments required for this scenario include: (1) a shared name resolution service, (2) a secure trust scheme that enables repositories to authenticate to each other (e.g., via PKI certificates), and (3) better support for external search services, including enhancements that make it easier for external indexers to disseminate or harvest metadata from a set of federated repositories.

A second approach to federation is the development of an explicit federation service that would aggregate several Fedora repositories and act as a proxy or mediator to a set of distributed Fedora repositories. The service would directly handle distributed searching over the set of repositories in the federation, and would be responsible for routing incoming requests to the appropriate repository. To support distributed searching it may be desirable to enhance the existing search API for Fedora repositories, or consider implementing an alternative search interface such as SRW/Z39.50. Since the development of a federation service is a significant undertaking (i.e., it introduces a major new component into the Fedora architecture), we propose first pursuing the peer-to-peer approach, and later developing an explicit federation service if the features of the peer-to-peer federation prove incomplete for meeting the requirements of our Fedora partners.

Power Server: Performance and Longevity

The next phase of work on the Fedora server should directly address the issue of robustness for large-scale operations, as well as preservation. The current repository software has been designed to perform for 1 million objects. Many Fedora partners and evaluators expect their repositories to grow to levels significantly above this threshold. Therefore, we will conduct a series of simulations and controlled experiments to establish benchmark data for repositories containing 10 million objects or greater.

Our performance testing will focus in three critical areas: (1) the ingest process, and (2) the dissemination process, and (3) searching. As part of this project, we will analyze data collected from performance measurements and simulations, detect bottlenecks, and set goals for performance improvement. We will then choose from several strategies for meeting the performance goals. For code-oriented bottlenecks, we will selectively re-engineer modules as necessary. We do not anticipate major re-engineering but are open to the possibility. More likely strategies will include database tuning and caching strategies. If run-time performance suffers due to concurrency issues during times of high-stress on the system (i.e., many concurrent requests), we will introduce a load balancing mechanism into the architecture. A load balancer would distribute requests among several Fedora server instances that act as a single repository.

In the realm of longevity, we propose the introduction of a set of preservation features targeted at maintaining the integrity of digital objects. A basic requirement is the development of a format registry to record the identity and key attributes of known content formats. Ideally, we can position Fedora to take advantage of the emerging Global Digital Format Registry (GDFR) that is currently being specified by the DLF. A format registry can form the basis for a suite of utilities for monitoring digital assets. We propose integrating 3rd party tools when available, for example JHOVE which is an open source utility for validating the integrity of byte streams based on format identity and attribute profiles. Another important area of work is the exploitation of the Fedora audit trail (event history). We propose further development of the Fedora event history schema, and the integration of an event monitor into the base Fedora architecture. To support preservation management, we propose developing a set of analysis tools to help detect patterns of decay, obsolete or at-risk content formats, and referential integrity problems among related objects and services.

Support for Search and Index Services

In the first Fedora project we delivered a default search service. The index for this service includes Fedora system metadata fields and OAI-compliant Dublin Core metadata records. The Fedora search API provides simple field-based searching, and also supports the built-in OAI Provider module. By the conclusion of Phase 1, we will enhance the default index to include relationship metadata and to we will support extended OAI-PMH features that include harvesting “sets” of items, harvesting by metadata format, and harvesting incrementally by last modification date.

All of the target use cases discussed in section 1 will require advanced searching beyond the default search that is currently provided. It is essential to enable indexing of metadata and content beyond the fields in the default Fedora index. Rich metadata can exist in any Fedora datastream, as can content such as narrative text. It may also available dynamically as the result of a Fedora dissemination. The proposed work of Fedora Phase 2 will provide direct support for indexing these arbitrary datastreams and disseminations using external 3rd party search engines. Fedora will support a plug-in capability so that indexing software can be easily configured with a Fedora repository. A notification module will be built to receive and dispatch metadata and content change events. The notification module will be responsible for sending messages to subscribing search services to indicate that certain datastreams or disseminations have changed and should be re-indexed. The proposed design provides both a loose coupling of external search services with Fedora, as well as an event-based communication model to enable incremental updates of external indices.

Although some applications require a single index of all objects in a repository, many other applications may require multiple indices supporting different types of queries or different collections in a given repository. In Fedora Phase 2, each plug-in search service will be able to request the particular kinds of datastreams or disseminations that it should index. We intend to provide one out-of-the-box plug-in search service that will index XML metadata. This service will be built upon an open-source XML search engine (e.g., Lucene or eXist).

An important area of work for Phase 2 is providing searching against RDF-based relationship metadata. Fedora relationship metadata (planned for Fedora version 1.3) is the means of expressing a variety of object-to-object relationships such as member/collection, part/whole, parent/child, equivalence, and derivation. The National Science Digital Library (NSDL) will be working with the Fedora team to exploit the potential of relationship metadata. We will investigate the use of an RDF-based indexing system (i.e., Jena) to provide advanced search capabilities to relationship metadata expressed in RDF. This will enable a range of interesting questions to be answered pertaining to both explicit and implied relationships among objects. For example, in the NSDL context we may want to ask: “What objects that are members of the 9th-grade Math Collection and are certified by Authority A, and also members in the Best Curriculum Collection.

Searching across a federation of repositories is also an important requirement in the next phase of work. We will enhance the index building functionality such that it functions appropriately across a federation, allowing indexes to be built from distributed content and metadata, as well as supporting the federation of indices supported by each of the members in the federation.

Federation searching becomes more interesting when relationship information is available federation-wide. As a final piece of work, we propose the development of a search service that represents the entire federation as a graph of objects. This work can build upon the RDF-based search service described above and can provide an integrated view of all objects in all repositories of a federation. This would provide a means of running queries to support complex analysis and inferencing about the objects that are members of a federation, ultimately making it possible to provide interesting browsing choices based on the results of the query. It would also make it easier to examine dependencies among distributed objects and services, which is important for many curatorial and preservation activities.

Sustainability

A goal of Fedora Phase 2 will be to devise a sustainability model for the Fedora open-source software. To guarantee sustainability of Fedora, we will initiate a two-pronged approach, one being a maintenance organization, the other a development consortium.

Maintenance:

To ensure that Fedora has a home for the foreseeable future, the University of Virginia Library will commit to continuing basic maintenance of the open-source code for five years after the proposed project. This would entail making sure the Fedora software is available to the public on a well-managed server, maintaining the Fedora web site, and fixing bugs as they arise. At CornellUniversity, researchers and developers involved in Fedora are part of a grant-funded research group in the Information Science program. All members of the group are 100% grant-funded. Therefore, Cornell Information Science is not able to commit a line to long-term maintenance of Fedora.  However, Cornell intends to remain active in the Fedora project by continuing to seek funding for ongoing research and development around Fedora, the results of which will be contributed back to the Fedora open-source project. It should also be said here, that VTLS has committed to their customers (and publicly stated) that, if needed, they would pick up the maintenance of the code and keep it open source.

Development Consortium:

The ideal long-term sustainability model for Fedora is a development consortium which would take shared responsibility in maintaining the software, evolving the code base, defining and prioritizing new requirements, and coordinating development. We plan to investigate a variety of models for a development consortium that is balanced among academic and corporate partners. In devising the Fedora sustainability model, we will examine several existing organizations, including Sakai, W3C, OAISIS, alphaWorks. We will also consult with Kevin Guthrie of Ithaka for guidance. We have already begun discussions with several Fedora partners who have expressed an interest in offering code, services, or seed money to support the long-term maintenance of Fedora. Both Cornell and the UVA Library intend to provide leadership in the design and incubation of the envisioned organization, and to continue as active participants upon completion of the Fedora Phase 2 development project. We have already had a number institutions indicate their interest in being part of a development consortium, including Northwestern, University of Delaware, the Australian ARROW affiliates, VTLS, Harris Corporation, Cornell Information Science, and UVA. We envision the development consortium as being the long-term sustainability solution for Fedora.

Fedora Open-Source Philosophy:

For organizations such as libraries and archives that are developing systems to manage and preserve information over very long timeframes, open-source software is invaluable. Academic institutions with a moderate level of technical expertise can use Fedora to build a variety of information management applications. Fedora provides well-designed, well-integrated architectural substrate that would be expensive and difficult for these organizations to develop for themselves. By providing an open-source system, Fedora enables institutions can gain a substantial leg-up on development, and focus their software development effort on specialized services, custom applications, and end user tools.

The Fedora software has been made freely available under the Mozilla 1.1 open-source license. This arrangement allows the Fedora code base to be freely used by any person or organization for any purpose. The license very clearly states that any enhancements to the software must be shared just as freely, even if they are developed as part of a for-profit product. Commercial software vendors are free to develop proprietary products that can be used in conjunction with Fedora, but they must contribute their work on the core system back to the public.

Fedora provides both an open specification and public, non-proprietary software. For academic institutions, our intent is to mitigate the effects of heavy dependence on commercial software packages. These include migration problems due to rigid propriety technologies, information problems due to vendors hoarding the technical details of their proprietary solutions, and integration problems due to closed APIs. For commercial vendors, our intent is to influence them to build products around an open-source core, thereby making their products more appealing to libraries, archives, museums, and educational institutions. For such systems that are designed to manage and preserve information over very long time scales, such as archives and libraries, it is hard to overvalue having a core of public, non-proprietary software for which the specifications are clearly discernable.

Potential for Commercial/Corporate Contributions and Support:

A side benefit of making Fedora freely available has been that several corporations have adopted Fedora, either as a basis for their products, or as a piece of software for which they can provide value-added services. As mentioned above, the Mozilla license obligates commercial entities to make their changes and enhancements to the Fedora core software available to the public (i.e., as open-source). We are already seeing examples of corporate users of Fedora that are both selling products and developing large scale custom solutions that embrace the open-source nature of our product.

Currently, our experience with corporations who are interested in Fedora gives us good reason to believe that there will be increasing interest by commercial entities in ensuring the long-term evolution and sustainability of Fedora.Already, we are in active conversations with vendors who want to contribute additional software to the Fedora open-source offering. Others have indicated their interest in contributing seed money toward the creation of a development consortium to maintain the Fedora software in the long term.

One of the corporations, VTLS Inc., is a long-time vendor of first-generation online systems for libraries. They are already selling both services and software built around our product for next-generation digital library solutions. They will help organizations install our software, plan how to use it locally and provide training, as well as optionally selling them workflow utilities that allow them to get started quickly. The open-source core makes it possible for them to sell these service and utilities at an affordable price. We know of one academic user of Fedora (the University of Delaware, led by Carl Jacobsen, the PI of the U-Portal Project) that has already purchased the VTLS package as a way to get a jump start for their own planned use of Fedora. There are several other academic projects that intend to develop their own local solutions but see the VTLS solution as a relatively inexpensive way to get started.

The Harris Corporation is a government contractor bids to develop large-scale information management systems, for which the open-source core is a selling point for agencies that are not interested in tying themselves to proprietary solutions. Harris personnel have already contributed research and development back to our project on a casual basis. They have also begun conversations with us about contributing software to the open-source package and about funding a more permanent home for Fedora, possible along the lines of the alphaworks project that IBM funded to encourage open-source development of XML tools.

We plan to formalize of these existing conversations later this year and begin to explore the requirements, governance model, operating procedures, and funding models with interested organizations. This will position us well to begin putting a consortium in place over the course of Fedora Phase 2, thus ensuring the long-term custodianship of Fedora after the completion of the Phase 2 development project.

Fedora Source Code Standards Practices

To promote flexibility and adaptability, the Fedora source code has been written in a modular manner. The overall system is intended to be simple and straightforward, and is based on 100% open-source technology (both the code we have written and included 3rd party libraries). From an external perspective, Fedora is exposed as a set of web services. The web services layer is defined using the standard Web Services Description Language (WSDL). Internally, the system is constructed as a set of modules written in Java. Each module represents a core piece of functionality (e.g., management, access, storage, validation, etc.) and is designed to be configurable. Each module is exposed via a well-defined java interface. Fedora delivers an implementation of each internal module and institutions are free to replace the default implementations with new ones that implement the pre-defined java interfaces.

The Fedora software comes packaged with extensive system documentation. Additionally, the source code is documented internally. Javadocs are generated so there are descriptions of all classes in the code base. While the code base is evolving, developers have been documenting major classes using javadoc. By the end of Phase 1 and the proposed Phase 2 project, the development team will make a final pass over all source code to make sure documentation is clear and complete for all methods in all classes.

The goal of our open-source effort is to provide simple, clean, easily adaptable code. As of Fedora 1.2, all of the Fedora source code achieves its functional goals and has been well-tested. By the end of Phase 1 (ending October, 2004), the development team will have made several enhancements to improve the ease of maintaining the code by abstracting the generation of XML responses to Fedora services, and SQL calls to the underlying Fedora database. These modifications will not change the functionality of the code, but will improve the overall packaging of the code, and make it even easier for other programmers to modify and extend it. We also plan to upgrade all 3rd party libraries to their latest versions. Also, we will offer an alternative deployment package of the Fedora software which enables institutions to install Fedora into different pre-existing application servers. (Currently Fedora is bundled together with Tomcat/Axis as default out-of-box installation). These changes will position the code base optimally for the new development that will take place in Fedora Phase 2.

As part of our quality control process, the Fedora team continually monitors its own development, and also solicits independent code reviews. The recommendations of our internal and external reviewers are incorporated into the open-source code base. This process will continue throughout the proposed Phase 2 project, with the goal of completing the project with source code that remains elegant, simple, and consistent with well-known coding practices.

3. Implementation Plan

Work on the proposed project will be carried out over a three year period to begin on October 1, 2004. The lag time give us the opportunity to create, post and hire the new position, a process that is always very time consuming. In the past we have not built in the lag time and the realities of university hiring procedures have always caused us to ask for an extension. We hope to eliminate that need on this project.

Below is a list that roughly details the percentages of the teams’ time to be devoted to work in specific areas in each of the three years. As always, this plan will be continuously adjusted based on experience as the project develops.

Year 1

5% Sustainability/Development Consortium

  • Investigate sustainability models
  • Solicit participants and nurture partnerships for development consortium
  • Plan for cooperative development under consortium
  • Create initial mechanism for code contribution and distribution via consortium
  • Open CVS to selected members of consortium

35% Object Creation

  • Formalize expression of content models
  • Adapt and augment current batch processes for creating and updating objects and create workbench utility to serve as batch control station.
  • Develop an user-oriented content submission interfaces configurable for different content models
  • Develop D-Space-like workflow support that can be configured for use with new content submission interfaces (e.g., multi-stage process, review points)

15% “Fedoration”

  • Shared name resolution service for federated repositories.
  • Support for multiple copies of objects among federated repositories (e.g., for backup, preservation redundancy, mirror service)

30% Power Server

  • Performance: benchmark testing
  • Performance: tuning for at least 5 million objects
  • Publication of performance data with recommendations;
  • Preservation: enhance audit trail (event history) schema; event history database; event monitoring module.

15% Search Support

  • Enhance existing search capabilities and provide plug-in capability for associating external 3rd party search service with Fedora
  • Develop means for selectively indexing datastreams or disseminations
  • Reference implementation of search plug-in for XML datastreams (e.g., Lucene)

Year 2

5% Sustainability/Development Consortium

  • Put development consortium governance body in place
  • Conduct consortium meetings under UVA and Cornell leadership
  • Draft consortium operating procedures for prioritization, planning and co-development
  • Open Fedora CVS to all members of consortium

15% Object Creation

  • Improving the batch processes to support more complex batches of objects.

35% “Fedoration”

  • Security architecture for peer-to-peer repository federations.
  • Peer-to-peer operations where any repository can fulfill any dissemination request to federation.
  • Full access functionality for virtual, distributed collections

30% Power Server

  • Preservation features including format registry, content format validation, checksums
  • Preservation analysis tools

15% Search Support

  • Generalize the indexing capability developed in year 1.
  • Develop notification module to send messages to subscribing services of changes to trigger incremental reindexing.

Year 3

5% Sustainability/Development Consortium

  • Continue to nurture development consortium
  • Conduct development consortium meetings under UVA and Cornell leadership
  • Move Fedora code base to permanent home (UVA server, SourceForge, other)
  • Finalize initial 5-year maintenance plan and ratify long-term consortium plan

20% Object Creation

  • Refine base functionality of year 1 and 2
  • Provide workflow services that can be used to support a variety of specific workflow scenarios.

25% “Fedoration”

  • Refine base functionality of year 1 and 2
  • Develop Fedora federation proxy service

25% Power Server

  • Performance: final testing and tuning for 10 million objects
  • Preservation: Integration with Global Digital Format Registry (GDFR) if available
  • Preservation: Content validation (e.g., JHOVE)

25% Search Support

  • Refine base functionality of year 1 and 2
  • Provide for distributed search services across repositories in a federation
  • Searching of relationships, and graph-based search of distributed, federated repositories.

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, August 03, 2009
© The Rector and Visitors of the University of Virginia