Rutgers University Digital Repository
Operating Assumption # 8 - Fedora Web-based Management Functions for the Rutgers Digital Library Repository
August, 2004

1. Introduction

Fedora (Flexible Extensible Digital Object and Repository Architecture) is an open-source digital repository management system now being tested and deployed at the Rutgers University Libraries. It is the basis for the Rutgers Digital Library Repository currently under development.

The current Fedora distribution includes a Java-based administrator client and a search function based on a MySQL database of cached descriptive metadata (Dublin Core). The administrator client allows a Fedora administrator to search or browse the repository, create new digital objects, modify existing digital objects, ingest objects into the repository, export objects, and create or ingest batches of objects. The administrator client, however, is a standalone program that runs only on the machine hosting the particular Fedora installation. (In the case of the SCC, this is the Linux machine known as sallie.) Similarly robust web-based administrative functions can be built, however, using the Fedora management API to interact with the repository via SOAP (the Simple Object Access Protocol). Robust, XML-aware search functionality is not provided by the Fedora implementation, but can also be built using an XML search engine such as amberfish or XPat. The following document will describe basic operating assumptions necessary to implement web-based management and XML search functionality for the Fedora implementation at Rutgers.

2. Definitions and Descriptions

Fedora makes use of a flat file system for storing and retrieving XML digital objects and associated datastreams in various formats, as well as an ancillary MySQL database for caching Dublin Core descriptive metadata about each object.

Fedora's digital objects are flat XML-encoded files that conform to an extension of the Metadata Encoding and Transmission Standard (METS) schema, which is located at http://www.fedora.info/definitions/1/0/mets-fedora-ext.xsd. Fedora's digital objects act as containers for datastreams (bitstreams of content or metadata) and Disseminators (linkages to services for transformation of content or computation).

A datastream is the component of a Fedora digital object that represents content and metadata. The content may physically reside inside the repository, or it may reside outside the repository and be pointed to (by-reference). The Fedora repository system supports content of any MIME type. (Currently we are assuming that all datastreams will reside inside the repository.

Fedora comes with a default disseminator for every object providing a generic set of service methods for viewing the contents of a digital object, but Fedora also provides the ability to associate custom disseminators with digital objects to expose a set of behaviors beyond those offered by the default disseminator.

Fedora comes with three open APIs that are exposed as web services:

Of these, the first is primarily important for the administrative interface, and the second two for the search interface.

The Fedora content versioning system is enabled with the release of Fedora 1.2. Any modifications made to a Datastream or Disseminator through the Fedora management interface (API-M) will automatically result in a new version of that Datastream or Disseminator being created by Fedora. The Fedora repository maintains all versions of all Datastreams and Disseminators, thereby creating a history of how objects change over time. Additionally, Fedora maintains an audit trail record of the nature of the object change events (e.g., who, what, when, why).

The integrity of ingestible or edited Fedora objects is maintained by the Saxon XML parser. Objects are parsed when they are first ingested and at any time when edited sections are being checked in.

3. Managing and Interacting with Fedora

A web-based Fedora management interface will tie together native Fedora functionality (e.g., ingesting and purging objects, editing object metadata, adding or purging datastreams) and additional functionality needed for the Rutgers implementation, e.g., XML search engine indexing, handle and digital signature management. It will include a method for authorizing and appropriately limiting the functional access of different management users, e.g., administrative superusers, individual collection managers, metadata editors.

Such a management web site1 should include at least four main functions:

  1. To manage object ingesting, including the automatic creation of persistent IDs, such as CNRI Handles and digital signatures. Ingesting is a generic Fedora function, and thus is not limited to a particular collection. The Fedora distribution comes with a command line program for ingesting which accesses objects in the file system and an API function that accesses objects via web sockets and SOAP. The current ingest function of the management website uses the API function, and thus requires that objects to be ingested reside in a web accessible area. This ingest function is flexible enough to access non-Fedora routines in a seamless fashion, and thus is able to create CNRI Handles and digital signatures for the object automatically. The creation of persistent IDs is done through a separate CNRI Handle server running Java, but should be transparent to the user. There is also a separate page to manage the creation or deletion of CNRI Handles apart from the context of Fedora ingestion. The digital signatures of objects, created using the sha1sum utility, must be managed outside Fedora because sha1sum can only read and sign an object after it has been ingested or modified. The signatures of archival datastreams must also be managed outside Fedora, as they are included in the objects themselves. These signatures will typically be created as part of the pipeline function. Creation of digital signatures of the objects should be transparent to the user. We need to establish authentication procedures for collection managers and others who will be ingesting or editing objects and indexing collections. We assume that for the time being at least the authentication database to be used by the Workflow Management System (WMS) will also be adequate for handling Fedora management permissions.
  1. To manage search engine indexing with an open source, XML-aware search engine such as amberfish. Unlike Fedora ingestion, which is a generic process applicable to all objects, the search engine indexes both for Fedora objects and full text datastreams are limited by individual "surrogate" collections, which are identified by metadata in the objects and datastreams. Note: the aggregate "host" collections will not need to indexed separately, but will be searched through combinations of these indexes as determined by information in the collection object2. The "dynamic" collections will also make use of these same indexes based on individual search algorithms. The collection indexes may be refreshed all at once or individually, which is expected to be the common practice. For indexing purposes, we make use of special search objects, which are filtered versions of the complete Fedora objects containing only the most recent descriptive metadata sections, as well as source and technical metadata. The filtered objects are also given a special tag that allows null searches to return all the objects in a collection. To insure the integrity of search objects, amberfish uses the Xerces XML parser. We recently tested amberfish on a set of 200,000 search objects, and were able to search the test set without an appreciable performance penalty. For the time being at least, ingesting and indexing should remain uncoupled, as it may be advantageous to ingest multiple objects before refreshing the index for a particular collection. It may also be desirable to re-index collections automatically on a regular schedule using the cron utility.
  1. To allow the searching, browsing, and viewing of existing objects with links allowing full metadata editing, datastream additions or purges, and the export and purging of full objects. In place editing of Fedora objects involves the ability to alter descriptive or other metadata in an existing object, the ability to change existing datastreams or add new datastreams to an object, the ability to purge metadata sections and datastreams from an object, and the ability to export or purge the object itself. These functions all currently use the management API via SOAP. Metadata sections within the object are modified by value, and typed within SOAP as base64Binary data. External datastreams are modified by reference and typed within SOAP as a string referencing a datastream ID or an external URL. Whenever an object is changed in any way, a new digital signature for the object is created. If an object is purged, its Handle and digital signatures are deleted. These processes, which access systems maintained outside of Fedora, are coupled with the editing functions but must be separate from native Fedora functionality. Access to all of the object editing functions is given through hypertext links generated by the management searching, browsing, and viewing functions. The search function uses amberfish to allow robust queries of information in the XML fields of an object or full text datastream with options to limit searches by metadata field, collection, type, and format. The browse function returns lists of Fedora objects with options to limit output by collection, type, and format. The browse function allows flexible sorting of output by most recent or by various metadata fields.
  1. To allow ongoing management, including versioning, usage tracking, migration, digital provenance, and exporting for OAI harvesting. By default Fedora creates an internal audit trail in the digital provenance section for each section of metadata that has been changed, and also preserves the older versions as part of the object. Older versions of metadata that do not have dependencies may be purged completely if desired using the purge metadata function. Fedora's internal object info function currently allows users to view audit trails that include version numbers, modification times, and descriptive labels. It may be desirable at some point in the future to create a special audit trail interface keyed more specifically to the needs of Rutgers users3. The Fedora management page should also provide a mechanism for viewing and testing the digital signatures of objects and archival datastreams. At some time, the Fedora management page should allow for the display of suitable usage statistics. These may include statistics about search queries and the number of hits a given query has generated. It should also include statistics about the number and IDs of objects and datastreams that have been downloaded. If possible the statistics should be organized by individual collection. These statistics will need to be generated from a number of different sources, e.g., the tomcat log files, which record all Fedora transactions, the regular apache log files, which record search and browse transactions, and the search programs themselves, which will need to record information about the numbers of hits returned for each query. By default Fedora supports OAI harvesting of Dublin Core records from the repository. Such harvesting is not currently active, as the repository has not yet been registered, but it will become an issue in the future. Fedora does not currently provide a mechanism for exposing selective records for harvesting, but a number of functions of this sort are under discussion for future releases.

4. Some Issues with Fedora

At present there is no means of editing the METS header and other wrapping elements in Fedora without editing the entire object directly. This can be done by exporting the object, editing it with a text editor such as vi or an XML editor such as Morphon, re-ingesting the object, and only then purging the older version of the object. This order needs to be followed so that the new object can inherit the external datastreams associated with the old object from within the repository before the older object and its datastreams are purged. When edits like this are done, there is no audit trail of the change, though the new object can inherit whatever audit trails were in the old object.

At present there is no means of selectively turning off regular audit trails, which means that any edits of, say, descriptive metadata fields, no matter how trivial, will generate a new copy of the complete metadata section. It is possible to purge these sections by hand, but only if they have no dependencies. It is not really clear yet if the versioning and audit trails provided by Fedora are really useful. Further research and discussion of this issue is desirable.

There should be a function to create management reports about the objects that will include information about digital signatures.

We may wish to consider exporting XML from Fedora to the WMS format for external editing and re- ingestion, though it should be noted that important differences between the external and internal structures of Fedora objects make such a function problematic. It would entail a re-translation of the Fedora METS to LOC METS as well as certain Fedora-specific attributes of various internal XML elements.

For the time being at least, we assume that editing during the period of a project's preparation will take place within the WMS database system. This may include edits made for a short period of time after initial ingestion. After a certain time, however, the WMS database and the work area for datastreams will need to be flushed, and any edits should be done on the Fedora objects themselves. We assume that such editing in Fedora will be "archival editing" and relatively rare once an objects has been ingested. We need to create a policy document to formalize these procedures.

For editing in Fedora we need a mechanism for syncing MODS and DC elements, so that edits made to one metadata section will appear in the other. Edits should be done on MODS and reflected in the DC, which will always be based on the MODS elements.

1
The prototype Fedora management website may be found at: http://www2.scc.rutgers.edu/dlr/EDIT/.

2
The collection objects for host collections will contain structure maps that look something like this:

<METS:mets><METS:structMap TYPE="logical">
<METS:div ORDER="1" TYPE="collection" LABEL="NJDH"></METS:div>
<METS:div ORDER="2" TYPE="collection" LABEL="NJHS"></METS:div>
<METS:div ORDER="2" TYPE="collection" LABEL="NJSA"></METS:div>
<METS:div ORDER="2" TYPE="collection" LABEL="Roosevelt"></METS:div>
<METS:div ORDER="2" TYPE="collection" LABEL="SBFarms"></METS:div>
<METS:div ORDER="2" TYPE="collection" LABEL="Swedesboro"></METS:div>
<METS:div ORDER="2" TYPE="collection" LABEL="SWFarm"></METS:div>
</METS:structMap></METS:mets>

3
Here is an example of a Fedora Audit section:

<METS:digiprovMD ID="AUDIT1" CREATED="2004-07-29T09:22:23" STATUS="A">
      <METS:mdWrap MIMETYPE="text/xml" MDTYPE="OTHER" OTHERMDTYPE="FEDORA-AUDITT
RAIL" LABEL="Audit record for 'modifyDatastreamByReference' action by fedoraAdmin at
2004-07-29T09:22:23">
        <METS:xmlData>
          <fedoraAudit:record>
            <fedoraAudit:process type="Fedora API-M"/>
            <fedoraAudit:action>modifyDatastreamByReference</fedoraAudit:action>
            <fedoraAudit:responsibility>fedoraAdmin</fedoraAudit:responsibility>
            <fedoraAudit:date>2004-07-29T09:22:23</fedoraAudit:date>
            <fedoraAudit:justification>Edited DC at 2004-07-29T12:07:22-05:00</fedoraAudit:justification>
          </fedoraAudit:record>
        </METS:xmlData>
      </METS:mdWrap>
    </METS:digiprovMD>
Back to Top of Page
Posted Sepetmber 13, 2004
URL: http://www.libraries.rutgers.edu/rul/staff/groups/dig_infrastructure/reports/assumption_08.shtml
Libraries website maintained by the Libraries Webmaster
© Copyright 1996-2006, Rutgers University Libraries   (Further Copyright Information)