Fedora Implementation Model
Our implementation of the FEDORA architecture model using a relational database deviates slightly from the original FEDORA model developed by Carl Lagoze's group at Cornell, but does not violate any of the key principles of the FEDORA architecture. Our deviation from the original architecture was to impose the additional constraint that for a given digital object, a signature can have only one disseminator. Imposing this constraint makes it possible to automatically resolve the disseminator name given only the signature name for a given object. Hence, a generic signature (e.g., web_default) with a set of generic methods associated with the web_default signature can be created that is shared by all objects that defines a default web behavior for any object. Because of the object-oriented nature of the FEDORA architecture, a database management system that was object oriented and also XML-savvy would be the ideal choice for implementing the repository. However, we were unable to find an existing database management system with all the XML features we required and with licensing/pricing to fit within our constraints at the time of our initial implementation. We chose instead to implement the system using a relational database to store the object representations and a java servlet to provide a web-based communication interface to the repository. We chose the MySQL relational database management system because of its widespread acceptance in the user web community, its ease of use, its "academic friendly" licensing, and its robustness in handling very large tables. Repository System Configuration We wanted a communication protocol that was familiar to our web designers that did not require any knowledge of the underlying relational database table structure. Making requests from the repository should only require knowledge of an object's primitive FEDORA components such as the object identifier (URN), signature name, and method name. We chose to use a java servlet using JDBC to process the SQL queries with the underlying relational database. We chose to use a java servlet instead of a java applet because we wanted the bulk of the processing to be handled on the server-side rather than at the client and for performance reasons. Repository requests are encoded as URLs which invoke the java servlet with a number of parameters specifying the type of request to be made. The syntax of the encoded URLs consist of the path to the java servlet executable followed by a question mark ('?') and one or more argument keyword/value pairs (e.g., keyword=value) separated by an ampersand ('&'). The argument keyword named "action" defines the type of communication request and is required on every request. The value of the action keyword then determines the number of additional arguments (if any) that are required to complete the request. Currently, only three action types are available. We plan to expand the types of requests to include a full range of functions that enable an end user to easily ascertain information about any FEDORA primitive component including object, signature, disseminator, method, and datastream. For example: http://dl.lib.virginia.edu/servlets/ObjectServlet? <action-type> [ & <argument> ] <action-type> (required) action = list - list all objects in the repository; requires no additional arguments select - list all behaviors of a given object; requires the following additional argument: doid = <URN of object> dissem - execute a specific behavior of a given object; requires the following additional arguments: doid , sigName , methName , and parameter name=value pairs <argument> (optional/required depending on action type) doid = <URN of object> sigName = <signature name> methName = <method name> parameter_name=parameter_value (zero or more pairs) Consider for example an object with a URN of 1007.lib.dl.test/text_ead/viu00003 which subscribes to a signature named "web_ead". The method named "get_web_default" which requires no additional parameters is the method used to disseminate this object's default web behavior. To invoke a dissemination that produces the default web behavior for this object, the encoded URL would appear as: To execute the dissemination, the java servlet parses the incoming URL to extract the various arguments and then issues the appropriate SQL query to the repository database using JDBC to extract the desired information. In the case of a dissemination request, the SQL query returns the value of the "Action" column in the Servlet table and the names of the object's associated datastreams. The java servlet then assembles the outgoing URL based on the action and associated datastreams and redirects the servlet's output stream to the newly assembled URL. Currently, an action is limited to be in the form of an HTTP request (i.e., the action could be an HTTP GET, HTTP POST, call another java servlet , call a perl cgi script, etc.). In the example above the "Action" returned from the database looks like: http://dl.lib.virginia.edu/cgi-dl/eaddoc.pl?file=<datastream> The java servlet replaces <datastream> with the appropriate datastream of the object based on the results of the SQL query and then executes the fully formed URL. If the action executable had additional parameters other than the location of the datastream, these would be passed as method parameters and appended to the outgoing URL. In this example, there are no additional parameters required by the perl script name eaddoc.pl. Relational Database FEDORA Model The goal of the relational database implementation of FEDORA was to preserve the FEDORA object model without sacrificing performance in the relational model. The current implementation consists of 10 relational tables that comprise the repository. The first five tables (URN_ID, SIG_ID, DISS_ID, DS_ID, and METHOD_ID) are simple lookup tables that enforce referential integrity by providing a unique numeric identifier for each primitive FEDORA component including object name, signature name, disseminator name, datastream name, and method name. Given a name for any of these primary components, you can determine its associated numeric identifier that is used as an index into the other tables.
The SIG table defines each signature, its associated methods, and a high-level description of each method. Given a signature name , you can find all of the methods available to that signature. This table is relatively small with a maximum size defined by the number of signatures times the maximum number of methods per signature.
The DISS table uniquely defines each disseminator given the urn name, signature name, and disseminator name. This is one of the three larger tables in the database whose size is determined by the number of objects (O) times the maximum number of signatures (S) for each object. The maximum number of signatures per object is anticipated to be relatively small (currently less than 5). If the maximum number of signatures is allowed to grow large it could have adverse effects on performance. [Note: If we had not posed the additional restriction that for a given object a signature can have only one disseminator, then this table would have a maximum size of O x S x D where D is the maximum number of disseminators per signature. ]
The Servlet table provides the pairing between a disseminator and its signature/servlet pair. In our implementation, the column called "Action" in the Servlet table represents this executable chunk of code in the form of an HTTP request. Given a signature name, disseminator name and a method name one can determine the action (the program that implements the method) to be performed, the return type of the method, and a description of the methods implementation. The method description in the Servlet table describes how this particular method is implemented. The size of this table is bound by the number of signatures times the maximum number of disseminators for each signature times the maximum number of methods for each signature all of which should be relatively small numbers (e.g. less than 20).
The PARM table defines any parameters that are required by a given method and provides a description of the parameter and how it is used. The primary purpose of the description fields in this and other tables is to provide descriptive information that can be used by an end user to assist in selecting an appropriate behavior. The size of this table is bounded by the number of methods times the maximum number of parameters per method.
The DS table defines the mapping between an object and its associated datastreams. This table is one of the three larger tables. Its size is bound by the number of objects (O) times the maximum number of datastreams (D) per object. The number of datastreams per object is anticipated to be small (less than 5). If the number of datastreams per object is allowed to grow large, it could impact performance. Typically, most objects will have only a single datastream. An example of an object with multiple datastreams is an image object where one might have multiple datastreams representing different sizes and/or resolutions of the same image. All of these datastreams are part of the single image object but would exist as separate datastreams associated with that object.
The relationship between the various tables is shown The figure below:.
Database Table Relationships The size of a digital repository is governed primarily by the number of objects included. With the exception of the URN_ID, DISS, and DS tables, the other table sizes are independent of the total number of objects in the repository so those table sizes are relatively small. The URN_ID table has a maximum size determined by the number of objects(O) and the Disseminator and Datastream tables are bound by a small multiple(currently 5) of the total number of objects. One obvious concern was performance in doing table joins on large tables as the number of objects grew large. We conducted a crude benchmark by loading a million dummy objects into the tables and found that performance with table sizes of a million rows in the three larger tables did not show any significant degradation in performance. The benchmark did not test concurrent user load nor did it test the effect of what happens when the number of signatures per object or number of datastreams per object becomes large (i.e., much greater than 5). Further testing will be needed to determine the upper limit at which performance may begin to seriously degrade, but we are confident that this implementation will allow us to include all of the digital objects that we now have on hand. This should give us a sufficient testbed to determine the feasibility of investing the resources needed to implement this architecture for a collection of millions of digital objects. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
