Rutgers University Digital Repository
Operating Assumption # 8 - Fedora Web-based Management Functions for the Rutgers Digital Library Repository
Fedora (Flexible Extensible Digital Object and Repository Architecture) is an open-source digital repository management system now being tested and deployed at the Rutgers University Libraries. It is the basis for the Rutgers Digital Library Repository currently under development.
The current Fedora distribution includes a Java-based administrator client and a search function based on a MySQL database of cached descriptive metadata (Dublin Core). The administrator client allows a Fedora administrator to search or browse the repository, create new digital objects, modify existing digital objects, ingest objects into the repository, export objects, and create or ingest batches of objects. The administrator client, however, is a standalone program that runs only on the machine hosting the particular Fedora installation. (In the case of the SCC, this is the Linux machine known as sallie.) Similarly robust web-based administrative functions can be built, however, using the Fedora management API to interact with the repository via SOAP (the Simple Object Access Protocol). Robust, XML-aware search functionality is not provided by the Fedora implementation, but can also be built using an XML search engine such as amberfish or XPat. The following document will describe basic operating assumptions necessary to implement web-based management and XML search functionality for the Fedora implementation at Rutgers.
Fedora makes use of a flat file system for storing and retrieving XML digital objects and associated datastreams in various formats, as well as an ancillary MySQL database for caching Dublin Core descriptive metadata about each object.
Fedora's digital objects are flat XML-encoded files that conform to an extension of the Metadata Encoding and Transmission Standard (METS) schema, which is located at http://www.fedora.info/definitions/1/0/mets-fedora-ext.xsd. Fedora's digital objects act as containers for datastreams (bitstreams of content or metadata) and Disseminators (linkages to services for transformation of content or computation).
A datastream is the component of a Fedora digital object that represents content and metadata. The content may physically reside inside the repository, or it may reside outside the repository and be pointed to (by-reference). The Fedora repository system supports content of any MIME type. (Currently we are assuming that all datastreams will reside inside the repository.
Fedora comes with a default disseminator for every object providing a generic set of service methods for viewing the contents of a digital object, but Fedora also provides the ability to associate custom disseminators with digital objects to expose a set of behaviors beyond those offered by the default disseminator.
Fedora comes with three open APIs that are exposed as web services:
Of these, the first is primarily important for the administrative interface, and the second two for the search interface.
The Fedora content versioning system is enabled with the release of Fedora 1.2. Any modifications made to a Datastream or Disseminator through the Fedora management interface (API-M) will automatically result in a new version of that Datastream or Disseminator being created by Fedora. The Fedora repository maintains all versions of all Datastreams and Disseminators, thereby creating a history of how objects change over time. Additionally, Fedora maintains an audit trail record of the nature of the object change events (e.g., who, what, when, why).
The integrity of ingestible or edited Fedora objects is maintained by the Saxon XML parser. Objects are parsed when they are first ingested and at any time when edited sections are being checked in.
A web-based Fedora management interface will tie together native Fedora functionality (e.g., ingesting and purging objects, editing object metadata, adding or purging datastreams) and additional functionality needed for the Rutgers implementation, e.g., XML search engine indexing, handle and digital signature management. It will include a method for authorizing and appropriately limiting the functional access of different management users, e.g., administrative superusers, individual collection managers, metadata editors.
Such a management web site1 should include at least four main functions:
sha1sumutility, must be managed outside Fedora because sha1sum can only read and sign an object after it has been ingested or modified. The signatures of archival datastreams must also be managed outside Fedora, as they are included in the objects themselves. These signatures will typically be created as part of the pipeline function. Creation of digital signatures of the objects should be transparent to the user. We need to establish authentication procedures for collection managers and others who will be ingesting or editing objects and indexing collections. We assume that for the time being at least the authentication database to be used by the Workflow Management System (WMS) will also be adequate for handling Fedora management permissions.
At present there is no means of editing the METS header and other wrapping elements in Fedora without editing the entire object directly. This can be done by exporting the object, editing it with a text editor such as vi or an XML editor such as Morphon, re-ingesting the object, and only then purging the older version of the object. This order needs to be followed so that the new object can inherit the external datastreams associated with the old object from within the repository before the older object and its datastreams are purged. When edits like this are done, there is no audit trail of the change, though the new object can inherit whatever audit trails were in the old object.
At present there is no means of selectively turning off regular audit trails, which means that any edits of, say, descriptive metadata fields, no matter how trivial, will generate a new copy of the complete metadata section. It is possible to purge these sections by hand, but only if they have no dependencies. It is not really clear yet if the versioning and audit trails provided by Fedora are really useful. Further research and discussion of this issue is desirable.
There should be a function to create management reports about the objects that will include information about digital signatures.
We may wish to consider exporting XML from Fedora to the WMS format for external editing and re- ingestion, though it should be noted that important differences between the external and internal structures of Fedora objects make such a function problematic. It would entail a re-translation of the Fedora METS to LOC METS as well as certain Fedora-specific attributes of various internal XML elements.
For the time being at least, we assume that editing during the period of a project's preparation will take place within the WMS database system. This may include edits made for a short period of time after initial ingestion. After a certain time, however, the WMS database and the work area for datastreams will need to be flushed, and any edits should be done on the Fedora objects themselves. We assume that such editing in Fedora will be "archival editing" and relatively rare once an objects has been ingested. We need to create a policy document to formalize these procedures.
For editing in Fedora we need a mechanism for syncing MODS and DC elements, so that edits made to one metadata section will appear in the other. Edits should be done on MODS and reflected in the DC, which will always be based on the MODS elements.
The prototype Fedora management website may be found at: http://www2.scc.rutgers.edu/dlr/EDIT/.
The collection objects for host collections will contain structure maps that look something like this:
<METS:mets><METS:structMap TYPE="logical"> <METS:div ORDER="1" TYPE="collection" LABEL="NJDH"></METS:div> <METS:div ORDER="2" TYPE="collection" LABEL="NJHS"></METS:div> <METS:div ORDER="2" TYPE="collection" LABEL="NJSA"></METS:div> <METS:div ORDER="2" TYPE="collection" LABEL="Roosevelt"></METS:div> <METS:div ORDER="2" TYPE="collection" LABEL="SBFarms"></METS:div> <METS:div ORDER="2" TYPE="collection" LABEL="Swedesboro"></METS:div> <METS:div ORDER="2" TYPE="collection" LABEL="SWFarm"></METS:div> </METS:structMap></METS:mets>
Here is an example of a Fedora Audit section:
<METS:digiprovMD ID="AUDIT1" CREATED="2004-07-29T09:22:23" STATUS="A"> <METS:mdWrap MIMETYPE="text/xml" MDTYPE="OTHER" OTHERMDTYPE="FEDORA-AUDITT RAIL" LABEL="Audit record for 'modifyDatastreamByReference' action by fedoraAdmin at 2004-07-29T09:22:23"> <METS:xmlData> <fedoraAudit:record> <fedoraAudit:process type="Fedora API-M"/> <fedoraAudit:action>modifyDatastreamByReference</fedoraAudit:action> <fedoraAudit:responsibility>fedoraAdmin</fedoraAudit:responsibility> <fedoraAudit:date>2004-07-29T09:22:23</fedoraAudit:date> <fedoraAudit:justification>Edited DC at 2004-07-29T12:07:22-05:00</fedoraAudit:justification> </fedoraAudit:record> </METS:xmlData> </METS:mdWrap> </METS:digiprovMD>