Minutes of November 19, 2003 Meeting
Present: Kalaivani Ananthan (KA), Isaiah Beard (IB), Anne Butman (AB), Tom Frusciano (TF), Judy Gardner (JG), Michael Giarlo (MG) (Recorder), Dave Hoover (DH), Sam McDonald (SM), Ann Montanaro (AM) (Chair), Lynn Mullins (LM), Bob Nahory (BN), Yang Yu (YY).
Excused: Patrick Huey, Ron Jantz, Linda Langschied, Karen Wenk.
AM: The Request For Purchase (RFP) for the RUL "Mass Storage System" (MSS) has gone out to bid. There is an RFP pre-bid conference scheduled for Monday, December 1st, from 11am-12pm at ASB III, during which we will have a chance to meet with vendors, elucidate any requirements not perfectly clear to them, and ask questions about their relevant products.
The remainder of the timeline is as follows: we expect there to be a decision on which vendor will provide the MSS by the first or second week of January; we should be in possession of the entire MSS by April 1st; and the MSS should be installed and running by June 1st.
TF: The New Jersey Digital Highway (NJDH) Steering Committee has chosen the New Jersey Homesteads website, which is part of the Electronic New Jersey project hosted at the SCC, as a curriculum unit for the NJDH project. The Homesteads website will be redone, images will be rescanned, new metadata will be applied, and it will serve as an example for NJDH.
IB: Using the recommended scanning standards – images should be scanned as 600 DPI TIFFs, alongside a ruler, label, and a color calibration bar – the Roosevelt Oral History collection, one of many collections to be entered into NJDH, is consuming disk space at an alarming rate. Specifically, scanning a single 4”x6” photograph consumes approximately 177MB. Multiply this by the 140 scans or so we can produce in a week, this equals nearly 25GB of disk space per week. At 1TB per month, we would theoretically fill up the MSS in just five months, and this would only accommodate part of one of NJDH’s many potential collections. While lowering the DPI from 600 to 300 reduces the image size by one-quarter, this is probably not an acceptable deviation from image standards. Alternatively, scanning each photo independent of the label, ruler, and calibration bar, results in an 18MB image, consuming one-tenth less disk space.
AM: The RUL Systems department, in addition to the SCC, will run the FEDORA server. Though the details still have to be ironed out, it is expected that the two instances of FEDORA will synchronize with one another. DH will work with MG on this.
The instance of FEDORA running at the SCC will be considered the developmental server, and as such will be used for testing upgrades, new disseminators, and other architectures to be integrated with the core FEDORA product (e.g. search engines, integrity-checking daemons, etc.). Content hosted at the SCC will not typically be "for public consumption." The FEDORA running at Systems will be RUL’s production-level digital preservation platform, hosting all the publicly-accessible content. Updates to the Systems FEDORA, new disseminators, and other underlying architectures will be tested first at the SCC before being introduced to the production server. As previously mentioned, the servers will implement the necessary utilities to synchronize with each other, should synchronization be needed at any point. (MG has already prototyped some synchronization logic, which is being tested at the SCC, in addition to an object- and datastream-integrity daemon.)
AM: KA will resume working on the Workflow Management System, picking up with Patrick Huey left off before going on disability, with the assistance of Vince Colonna, a "student" programmer employed by the SCC. The issues that are yet to be completed include repeatable fields for the metadata forms and logic for inputting and editing structure for complex objects. What needs the most attention is the metadata, which will be worked out once a definitive decision is made on what schema(s) will be supported.
YY: The MIC interface is comprised of three main modules: 1) a directory (http://depts.washington.edu/ast/mic/mic_directory_version_one.html ?), containing general information about organizations and collections, which is being developed by the University of Washington; 2) a union catalog (http://mic.rutgers.edu/), populated with metadata for collections from various organizations, which YY is developing here at RU; 3) a web UI (http://mic.imtc.gatech.edu/mic/index.htm), providing an interface for searches against both the directory and the catalog, which is being developed at Georgia Tech. GT is using the Zebra engine to index and search exported XML records.
The MIC schema, so far, has forty-seven core data elements which map and may be exported to a variety of other metadata standards / schemas, e.g. MARC and MPEG7. It should be noted that once an organization has submitted a collection of metadata to the MIC union catalog, the original metadata is not stored; the appropriate fields are mapped onto the forty- seven core data elements, and the rest are stripped out.
The other two modules, which are not yet web-accessible, are a dynamic portal and the MPEG7 cataloging utility which YY developed. The latter can be used by organizations who either do not possess or are not satisfied with their cataloging solution. The utility is implemented as a Microsoft Access database, and at least 90% of the fields represented within it are mappable onto the forty-seven core MIC data elements.
JT: JT has developed prototypes for FEDORA web interfaces, one for public searching (http://fedora.rutgers.edu/~triggs/FEDORA/) and the other, an administrative interface, for management of FEDORA objects (http://fedora.rutgers.edu/~triggs/FEDORA/EDIT/). Both interfaces include search options such as full-text, full record, and metadata searches, and may also be filtered by collection, type, and format. JT is using a robust XML-aware search engine and indexer called amberfish, which enables him to create different collections and merge search results dynamically. Due to this ability, every collection can have its own index, thus creating portals that search only a single collection is trivial. Amberfish is indexing the underlying XML of objects stored in FEDORA as well as any indexable datastreams attached to the objects.
Some expressed concern about the appearance of the search prototypes, wondering about their current configuration (and/or their configurability) and how they would be integrated into various collection portals and other sites. It should be noted that JT’s prototypes are just that; prototypes which have been developed as platforms with extensibility and configurability already taken into account.
The next DAWG meeting will be held on Wednesday, December 17th from 9:30am until 12pm, at the SCC Heyer Conference Room (404A).