Report from Digital Preservation Sub-Committee - August 17, 2005

1. Introduction

To date, our focus has been largely on preserving brittle books (i.e. books that are out of copyright and yet not considered rare). Much of what we have done will also work for rare books and born-digital objects. If the subcommittee is to continue, we should decide what our priorities should be.

2. Process and Workflow

Our objective is to create a digital preservation process that will digitally preserve materials across many human generations and technological epochs (more than 100 years). To do this, well-defined human processes and workflow are needed. The chart in Figure 1 illustrates the highest levels for the basic digital preservation process. We have decomposed each of the major steps (P1.0 - P3.0) in Figure 1 below into more detailed steps. The more detailed process steps will need to be refined for the specific area of focus, e.g. rare artifacts, brittle books, or born-digital objects.

Figure 1 - The Digital Preservation Process

3. Infrastructure

Much of the technological infrastructure is in place to do digital preservation at RUL. The infrastructure consists of Fedora and software developed by RUL to assist in life cycle management. Some of the critical features include: automatic assignment of persistent IDs, creation and automatic checking of digital signatures, collection manager alerting services, audit trails/versioning, and a capability to validate each ingested object according to our metadata standards.

4. Book Preservation Experiment

For a controlled experiment, we used four books from the RUL general collection. The books were sent to IDI so we could take advantage of their Kirtas page-turning scanner. We also provided IDI with RUL requirements for digitizing books (we have updated these requirements based on our experience with IDI). Our recommendation at this point is that vendors provide us with the archival master (a tiff file), a pdf file, a djvu file, and the ocr-ed text. For these four books, the cost was $0.37 per page to provide these files and to ftp the files to our Linux server. Note, this cost corresponds to process P1.0 in Figure 1. The second step in the process (P2.0) includes preparing metadata via the Workflow Management System (WMS), including the structure map, ingesting the object, automatic generation of the signature and PID, re-indexing the collection and doing basic sanity testing. These steps averaged out at about $0.16 per page resulting in a total for the two major steps of $0.55 per page. It should be noted that we were using a prototype WMS which did not yet handle books and the structure map had to be created directly in xml with a text processor. This step and others related to books are now part of the WMS to be released in September and should reduce the $0.16 per page significantly. Also, volume book digitizations should reduce vendor costs, so the committee believes that the $0.55 per page is an upper-bound on the costs. These costs should also not be compared to other published analyses that just include the digitization step.

5. Book Printing and Binding

In this controlled experiment, we wanted to examine the quality and costs of printing and binding a book that has been cut and digitized. All four books are currently being run through this process and we estimate costs, on average, to be about $25 per book.

6. Links in IRIS

Each of the books that we have digitized will have an 856 link to the PID in the IRIS record. This step has not yet been completed.

7. Getting Started with Digital Preservation at RUL

To get started with digital preservation at RUL, we should decide that digital preservation is part of our strategy and then proceed to make decisions in the following areas:

8. Recommendations

There is still much work to do however the committee feels that it is appropriate to publish our work as an interim report and begin the discussion of the major areas identified above. At this point, the committee would like to make three recommendations: 1) The committee should continue its work and specifically examine the process and supporting technology for born- digital objects. The RUL repository will shortly contain three types of born-digital objects: Eagleton Public Opinion Polls, electronic journals, and digital dissertations. Of the three, the first two are already being ingested into the repository. Under the assumption that these types of materials are a high priority for preservation, we recommend that the process and infrastructure be further defined and that RUL allocate staff and budget to digitally preserve these materials, 2) the committee recommends that we trial the brittle book digital preservation process. In addition to staff resources, RUL should allocate an initial book digitization budget of $6000. This budget will cover the outsourcing of digitization for approximately 100 books with an average 200 pages per book and 3) we should examine what level of preservation we want to provide for the collections that are now being ingested into the repository, specifically the NJ Digital Highway and the Institute of Jazz Studies Oral History collection.

The Sub-Committee - Ian Bogus, Tom Glynn, Ron Jantz, Darryl Voorhees. Consultants included Isaiah Beard, Anne Marie Lyons, and Jeffery Triggs.

Last updated: August 30, 2005
Website Feedback  |  Privacy Policy
© Copyright 1997-, Rutgers University Libraries