Rutgers University Digital Repository
Operating Assumption # 7 - Persistent Identifiers - Operating Assumptions
May 12, 2004
It is surprising that the scholarly community has not risen up to loudly protest citation failures in the digital realm. For scholars to confidently cite a digital object, they must be assured that the object will be accessible via the citation for many years, sometimes referred to as "referential integrity" or "citation persistence". As Spinellis' (2002) data demonstrates, web references are distinctly untrustworthy. The evolving and rapidly changing digital environment in which digital objects reside suggests that references to these objects have a high probability of becoming inoperable in a few short years.
In an attempt to address this problem, the concept of persistent identifiers has been developed. The concept is fairly simple. We would like to be able to assign a globally unique name to a digital object, a name that can be used, in perpetuity, to refer to and retrieve the digital object. Uniform resource names (URNs) refer to a generic class of persistent identifiers and are defined by the Internet Engineering Task Force (http://www.ietf.org/rfc/rfc1737.txt) as follows:
"It is intended that the lifetime of a URN be permanent. That is, the URN will be globally unique forever, and may well be used as a reference to a resource well beyond the lifetime of the resource it identifies or of any naming authority involved in the assignment of its name."
However, there are several real world practicalities which suggest that this vision of a persistent identifier is achievable only in a limited fashion. Ultimately, we must rely on the persistence of organizations and their ability to implement sustainable preservation policies, human workflow processes and to maintain continuity in the computing infrastructure. The practical obstacles in our digital environment are twofold. First, the digital object is created and resides in an environment with many technological dependencies including computing hardware, operating systems, file systems, and servers. Ultimately, any persistent identifier has to be resolved to a specific address (e.g. a URL). Servers and file systems are routinely moved, renamed, or retired resulting in the familiar link failure syndrome we see all too often. Secondly, some organization has to take on the responsibility of managing the assignment of names. Although much of the routine work of assigning names can be automated, we must realize that this organization must ultimately do its work on behalf of the world community and must be able to do this work in perpetuity.
Having raised these fundamental issues, it should be noted that similar systems are working today albeit not perfectly. Telephone numbers, IP addresses, and social security numbers share some of the same properties as persistent identifiers. Active management and organizational persistence is required in all these systems. Given these caveats, there are two approaches available today which allow us to get started with the process of providing persistent identifiers. (ARK and CNRI Handle). Before exploring these two approaches, there are two conventions that must be clarified.
The PID should point to a "place" that will provide the user some context. For example, if it points directly to an image, the user will not have any descriptive metadata to explain what the image is about. For this reason, we believe the PID should either point to a bibliographic record which, in turn, allows one to access the digital object or, alternatively, the PID should carry some descriptive metadata with it.
Many of our digital objects will have several representations for presentation on the Web (e.g. djvu, pdf, jpeg, mp3, etc) in addition to having an archival, uncompressed format such as a tiff or wav file. If the PID is used for citation and is pointing at the specific object, then we must implement a system that has PIDs for all representations. There are different methods of doing this: a) the PID can point to a single bibliographic record which in turn allows access to all formats. This approach inserts more structure (e.g. the application displaying the bib record) and is therefore more subject to being lost due to infrastructure changes and b) there can be multiple PIDs, each pointing to a separate rendition of the object. Note that the ARK format allows this by identifying variants using "." (period) in the name.
Although the PID will need to be embedded in a URL in order to serve it on the Web, the PID naming convention should generally be free of technology dependencies, protocols, and local naming conventions. The premise behind this convention is that these technology and naming conventions will change over time so the name should be chose to be perpetual.
We first present some simple examples to familiarize readers with the syntax.
The CNRI Handle syntax is of the following form: prefix/suffix where the prefix (1782.3) is assigned by the CNRI Global Registry and is globally unique. The suffix can be any user-assigned character string. For the DLR in general, we have chosen a user-assigned string with the following syntax: [collection].[format].[unique-id within collection]. (How much intelligence do we want in the name?). An example, embedded within a url, is shown below.
In the sample ARK below, the number "15230" is a globally unique number assigned by the folks who maintain the ARK infrastructure at the California Digital Library. On the surface, the "15230" is roughly equivalent to the "1782.3" of the CNRI Handle. The ARK name (i.e. njdhPhotograph492) cannot have embedded periods used as separators since the periods have meaning in the ARK syntax.
Handle: Corporation for National Research Initiatives, Library of Congress, Digital Object Identifier, Department of Defense, MIT's DSpace
ARK: National Library of Medicine, California Digital Library, MIT's DSpace
Handle: Original public implementation in 1994. There is fairly stable and long standing organization supporting the Registry. Responses to our questions and for setting up a Handle server have been quite good.
ARK: Original implementation in 2001. Though the CDL has been responsive and eager to work with us, the concern here is that there is no mature, supported organization to operate the registry for ARKs.
Handle: Very simple, no structure.
ARK: Allows for representation of hierarchy, variants and versioning in structure. For example:
Handle: The Naming Authority prefix, a necessary part of the Handle, is used both as part of the identifier and as the access point. Transferring content necessitates a change of PID, so institutions must think about using multiple NAs, and thus running multiple instances of the same software, in order to facilitate future transfers. For example, if we transferred NJDH to the U. of Michigan, a new prefix would have to be used and the corresponding name would be something like the followng:
Whereas the ARK could retain the object name as follows:
ARK: Name Assigning Authority is separate from the Name Mapping Authority, which is the access point, though RUL will be both the NAA and NMA initially. Transferring content from one NMA to another is thus simpler, as the NAA is fixed and needn't change.
ARK: Failover techniques would look at a mapping of NAA to NMAs, and determine how to locate another access point. NAAs are separate from NMAs in the ARK scheme, so one could investigate partnerships with other departments and institutions whereby they will host a failover NMA for one's NAA. The failover functionality will eventually be built into a browser plug-in for ARK resolution.
Handle: If content has been erroneously deleted or moved, PIDs result in 404s.
ARK: In the same situation, PIDs would still result in errors. But with additional services, e.g. simple metadata & institutional commitment statement, users could be redirected to more information as to what exactly the link was associated with, and who is responsible for the lost content.
If Shibboleth rules are applied at the web service level, whether we use Handle or ARK might not matter, since both would be redirecting to the same content, making identical requests to the web server software.
With respect to MPEG-21, the workplan states the following: "Digital Items and their parts within the MPEG-21 Framework are identified by encapsulating Uniform Resource Identifiers into the Identification DS. A Uniform Resource Identifier (URI) is a compact string of characters for identifying an abstract or physical resource, where a resource is defined as 'anything that has identity'. The requirement that an MPEG-21 Digital Item Identifier be a URI is also consistent with the statement that the MPEG-21 identifier may be a Uniform Resource Locator (URL). See the online document at: http://www.chiariglione.org/mpeg/standards/mpeg-21/mpeg-21.htm#_Toc23297969. We believe that both Handles and ARKs follow the URI conventions and are therefore compatible with MPEG-21.
We believe at this juncture that the ARK has some definite conceptual and technical advantages, especially related to items 3, 4, 5, and 6 above. However, the ARK also suffers from being a relatively immature technology and process. Our recommendation therefore at this point is to go forward with the CNRI Handle with the possibility of switching to the ARK after it matures a bit. Note, switching from Handles to ARKs would require redirection for any Handles that have been made publicly available.
To summarize, our CNRI Handle approach includes a unique global registry for each major collection. At this point these registries are: a) Rutgers University (e.g. for institutional repositories): 1782, b) Rutgers University Libraries (e.g. for digital materials not related to a specific grant): 1782.1, c) Institute of Jazz Studies (for the oral histories grant): 1782.2, and d) the NJ Digital Highway Project: 1782.3. We will likely add other registries as we get other large grants. Projects like electronic journals published by RUL would come under the RUL registry of 1782.1. To date, we have also been placing items from Special Collections under the RUL registry however we may want to add a registry for Special Collections.
Now we must address the three questions posed above for Handles: 1) naming conventions, 2) what does the PID point to, and 3) multiple display representations.
We continue to favor inserting some intelligence into the Handle name although this practice should be governed by the application For example, in the journal platform the recommended syntax is: [journal abbreviation].][vol].[issue].[article number] (e.g. for PCSP, the handle might be 1782.1/pcsp.4.2.1M). For an object in NJDH, the recommendation is: [collection abbreviation].[format].[unique identifier] (e.g. 1782.3/njdh.photograph.192). Note that these Handles can be generated automatically from the bibliographic data.
With Handles, this question becomes a little more difficult. The Handle does not allow one to point to multiple objects (unless you use two or more Handles). Also, the Handle does not carry any metadata with it (as does the ARK). Again, the best practice is for the application to govern what the PID points to. There are basically two approaches: 1) the Handle points to the bibliographic record within the application which will allow a user to select which presentation format they prefer and to review the metadata. Being positioned within the application, also allows the user to search and view other items. You can try this out by using one of the Eagleton handles at http://hdl.rutgers.edu/1782.1/eagleton.poll.031. This approach also works well for objects that might have lots of datastreams. The obvious issue with this approach is that we have inserted another layer of application structure between the Handle and the object. 2) the Handle points directly to the object. Many times the object will be self-identifying and this approach can work quite well. However, the Handle can, in some cases, point directly to an object such as a photograph which has little identifying information with it. This may cause some user confusion. For multiple presentation formats, more than one handle would have to be used to point to each separate format.
Since the Handle approach can point to the bibliographic record, it will inherently provide the capability to access the available display options. If the Handle is to point directly to an object, then multiple Handles will have to be used.
From our discussions in the May 11 meeting, we have decided to proceed with using the CNRI Handle system for our projects. For the Eagleton project, the journal platform, and NJDH, we will generally use the concept of pointing to a bibliographic record to provide context. It was noted that this approach is very similar to how OpenUrl is used. There will undoubtedly be opportunities in other applications to point directly to the object.
May 12, 2004