Minutes of January 16, 2015 Meeting

Stephanie Bartz, Natalie Borisovets, Janet Croft, Robert Cupryk, Francesca Giannetti (recorder), Janice Pilch, Triveni Kuchi, Christopher Sterback (for Tibor Purger), Jane Sloan, Gracemary Smulewitz, Mary Beth Weber, Tao Yang (chair)
Tibor Purger

1. Updates on the action items from our kickoff meeting - database A-Z list, content access and discovery, HT API, HTRC, etc.

Database A-Z list - Yang and Bartz

HathiTrust was added to the Indexes and Databases A-Z list by Bartz and Yang. Bartz included it on the list of core databases for government documents. Suggestion was made to ask all selectors to evaluate whether or not to add HT to databases list in various disciplines. We proposed to add HT to Findů Books and Findů Journals areas of RUL website. Yang will bring the recommendations to USC for endorsement.

Kuchi suggested the addition of a subtitle ("Search millions of books") to explain briefly what HT is to users.

Content access and discovery - Smulewitz and Pilch

Smulewitz said HT documentation on bibliographic data and discovery via ILS and discovery services is lengthy and complex. The options include linking to the HT full text viewer or HT catalog records, including records for restricted content.

HT uses a custom system called Zephyr ( to store, manage and export bibliographic metadata.

The idea of including a declaration somewhere within the record of the nature of access between public domain (PD) and restricted works was raised.

There are a total of 15 categories of use: including PD but restricted because of privacy concerns, PD in USA, in copyright in USA, etc. Some geoblocking is used: works created after 1875 cannot be served to an EU IP address.

It is unclear what Rutgers partnership provides access to and what those operatives are. There are three categories of access: works that partners can download; works that the general public can download; works that may only be searched.

No one, not even partners can view protected works. The only exemption is for preservation; member library needs to demonstrate ownership of the work in question, document attempt to replace it at fair market value. It is unclear how this process of securing a replacement copy from HT works.

HT API - Sterback

Sterback and Purger have conducted VuFind testing. VuFind has out of the box options that allow HT links to be turned on automatically. By default, this function will only enable links to works in PD, in addition to the "ic-world" attribute code, which stands for "in-copyright and permitted as world viewable by the copyright holder."

HT works appear in search results when links are turned on, although performance needs to be evaluated more systematically. There is a potential concern about users receiving a "bad SSL certificate" message when approaching HT through a secure (https) site like VuFind. SSL issue presents only in VuFind, not through current RUL website and catalog.

Fengzhi Fan experimented with building a customized holding screen in the VuFind test catalog, which gives links and previews, using the HT API. It is unclear what kind of view is desired; evaluation by the catalog committee and some user testing are required.

HathiTrust Research Center - Giannetti

The systems and tools available through the HTRC are completely open, although they do require the creation of user account(s). From the HTRC portal (", possible to run analytic algorithms against the HathiTrust volumes that are in the public domain (~2.5 million).

The Workset Builder is a search interface for the HathiTrust PD corpus; search results can be saved as a 'workset': a collection of volumes on which the text mining algorithms are performed.

In addition to the main HTRC services (Portal and Workset Builder), there is a sandbox stack with the same tools. The sandbox runs against non-Google scanned content (about 260,000 volumes). The advantage of the sandbox is that the index and Data API can be accessed directly, permitting the use of the researcher's algorithms.

Bookworm - Of the HTRC tools, Bookworm is most user friendly. Visualizes language usage trends as a simple timeplot.

Data Capsule - The HTRC Data Capsule permits "non-consumptive" research on HT's use protected texts (works in copyright). What is non-consumptive research? As defined by the Google Books settlement: "research in which computational analysis is performed on one or more books, but not research in which a researcher reads or displays."

Single Interface for Music Score Searching and Analysis project (SIMSSA) - SIMSSA researchers use Optical Music Recognition (OMR) software to transform digital images of music into searchable representations of music notation.

NovelTM - This partnership seeks to produce the first large-scale cross-cultural study of the novel using quantitative methods.

Workset Creation for Scholarly Analysis - An immediate objective for HTRC is to allow scholars to collect items together for computational analysis. But the often rudimentary, print-based metadata of the HT does not adequately support the creation of scholarly worksets. A goal of the WCSA is to enrich the metadata in HT and augment it with URIs to leverage discovery and sharing through external services.

HTRC UnCamp - The HTRC will host its third annual HTRC UnCamp on March 30-31, 2015 at the University of Michigan. "The UnCamp is part hands-on coding and demonstration, part inspirational use-cases, part community building, and a part informational, all structured in the dynamic setting of an un-conference programming format."

The HTRC Wiki is very useful for wrapping one's head around their activity:

2. Plan for the visit of Jeremy York, Assistant Director of HathiTrust - Yang

Yang reported that Jeremy York has agreed to give a talk at RUL and the preliminary date is February 24. Melissa Just, Tom Izbicki and Grace Agnew might be interested in attending this talk and subsequent discussion. Yang asked the group whether to put off the talk to March out of a concern about inclement weather, but group agreed it was best to have him come sooner. Yang will confirm the date with York. Note was made that 2/24 is the date of one of IJS candidate interviews.

It was decided to begin researching list of questions before York's arrival. Ideally, questions will be submitted to him in advance of his trip so he can prepare responses in addition to delivering his talk. We might invite somebody from an HT partner institution that is geographically closer in conjunction to his visit or as a backup plan. It was suggested to sort questions from practical to more complex to facilitate our implementation process.

3. Discussion of the comprehensive list of questions for investigation - Yang

Yang made suggestions on shape of task force final report: preamble outlining task force timeline and tasks accomplished; essential information collected about HT; recommendations; lastly, notes section. The bulk of the report will be on recommendations. The recommendations need to have rationales: for example, if we recommend not putting title level records into ILS, then we need to provide rational for that recommendation.

Group then examined list of questions and made assignments. Yang added assignments directly into Google Doc.

Discovery: What should we do about bibliographic records for both restricted access and PD. Question was raised about frustrating users with records for protected works, although suggestion made that the work can subsequently be requested through ILL. Suggestion was made to form team to investigate access through library catalog and EBSCO Discovery Service (EDS). Currently any e-book offered through Springer, Elsevier or Wiley in library catalog is also discoverable in Articles+.

RUL currently has 2.9 million records in Sirsi; only 100,000 records away from being bumped up to another pricing tier. Suggestion made to ask colleagues at the University of Chicago, OSU, Michigan, etc. how they are handling HT bibliographic records. If we opt not to load them into Sirsi, then there is incentive to load into EDS. There is the question of whether or not it is possible to load records selectively (PD only) into EDS. Note was made that HT works are already discoverable in EDS.

The API works, but there remains the question of what to provide access to. VuFind is still in flux. Purger and Sterback will work on a list of additional tasks before making recommendations.

Content access: Pilch will work on paragraph exploring what Rutgers as a partner has access to: PD, preservation copy, disability access. Jeremy York can confirm or amend.

Relationship with Internet Archive (IA) and Digital Public Library of America (DPLA): Per statement received by Smulewitz from Internet Archive, "The items from IA that are in HT were not placed there by IA but by partners... If Rutgers content is there, we did not have a hand in placing it there." 220 Rutgers titles were submitted to IA; some of them currently in HT. Many were part of special collections. Were they contributed by Michigan, or another partner? Or digitized by Google? Rutgers paid $9,000 to contribute 220 titles to IA. Janice will clarify relationship with DPLA.

There is some confusion about whether PD works can be downloaded without partner login. Everyone agreed to try to download a PD work both using RU proxy and not logging in so see what happens. What is it that triggers a "partners only" message?

Visiting scholars have barcode and pin access to RUL, which is the same as the emeritus status (no netID). Sloan and Croft will look into HT access for this user group.

What happens when legitimate RUL users who happen to be overseas get told that RUL cannot make ILL request b/c work is in HT? Will their access to HT be blocked because they are overseas? If they prefer print, will we still place the ILL request? Not everything available in HT is in full text. RUL staff will have to verify that the work is fully available in HT. What about external-to-RU ILL requests? Can we forward HT versions to other libraries? Pilch will investigate licensing questions pertaining to external ILL requests; Croft investigates other ILL, plus Reserves questions.

Next Steps

Subgroups will meet however they see fit. Find answers, write brief reports, e-mail task force when investigation complete, by February 16. The goal is to have a more complete draft of questions (and answers) ready for York's arrival.

