Outbound Harvesting with Encore as a Library Space-Saving Strategy : The Case of HathiTrust Docs

26
Outbound Harvesting with Encore as a Library Space-Saving Strategy: The Case of HathiTrust Docs Christopher C. Brown University of Denver, Penrose Library (303) 871-3404 [email protected] Friday, April 15, 2011

description

Brown, Christopher C. “Outbound Harvesting with Encore as a Library Space-Saving Strategy : The Case of HathiTrust Docs.” Presentation given at the Innovative Users Group at ALA Midwinter, 7 January 2011, San Diego, CA.

Transcript of Outbound Harvesting with Encore as a Library Space-Saving Strategy : The Case of HathiTrust Docs

Page 1: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Outbound Harvesting with Encore as a Library Space-Saving Strategy: The

Case of HathiTrust Docs

Christopher C. BrownUniversity of Denver, Penrose Library

(303) [email protected]

Friday, April 15, 2011

Page 2: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

This presentation will show how Encore harvesting can be used to mitigate a space problem in a library, substituting online access for the need for physical access to the collection.The government documents collection will be the primary focus.

DR, IR, Digital Texts

Inbound HarvestingOutbound Harvesting

Page 3: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

About University of Denver

Depository since 1909Historically a 70-75%

selectiveNow a 4.8% selective, but

receive 100% of online cataloging

Adding URLs to historic documents

Page 4: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

The Problem

Currently 80% of our paper documents are in storage

We will be remodelling our library – totally displaced for at least 18 months; 100% of documents will be in storage

Government documents will remain in storage after renovation

Page 5: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Partial Solution: Using Encore for Outbound Harvesting

Our users are accustomed to using electronic documents

Need to divert attention away from physical collection holdings

Encore harvesting of Hathi Trust can do this

OCLC report: 15% of HathiTrust public domain materials are government docs*

Malpas, Constance. 2011. Cloud-sourcing Research Collections: Managing Print in the Mass-digitized Library Environment. Dublin, Ohio: OCLC Research. http://www.oclc.org/research/publications/library/2011/2011-01.pdf.

Page 6: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

OAI-PMH Harvesting

http://www.openarchives.org/Promotes interoperability standards for

dissemination of contentHathi Trust allows harvesting of its recordsInnovative Interface’s Encore catalog

allows for records to be harvested (with the purchase of a harvester connection)

Page 7: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Encore Model

Traditional III

Millennium ILS

Local Site with Digital

Content

ClassicOPAC

Encore (III)

(next-gen catalog outside

the ILS box)

Har

vest

er

Remote Site with Digital

Content

Remote Site with Digital

Content •Harvested records appear only in Encore, not in “classic” catalog•Harvested records update on a periodic schedule – in our case daily

Page 8: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

PD = where docs generally live

Hathi Trust AttributesFrom: http://www.hathitrust.org/rights_database

Page 9: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

PD vs. PDUS

• Mass identification of copyright status based on bibliographically-derived information: a) As texts are loaded, a set query in Mirlyn identifies those texts that are:US federal government documents, or

• published in the US prior to 1923, or• published outside of the US before 1870• These are treated as public domain (ATTRIBUTE name=pd) based on

bibliographically-derived information (REASON name=bib). We do not restrict access to these materials. b) Those texts that do not meet these criteria (e.g,. US post-1923 and not a government document) are treated as in-copyright (i.e., ATTRIBUTE name=ic and REASON name=bib). c) An additional attribute is used to represent works published outside the United States between 1870 and 1923 because copyright status for these works depends on the location of the user. Works published outside the US prior to 1923 are in the public domain; however, due to the variations in copyright law in countries outside the US, it is estimated that 1870 is the earliest date works published in these countries may still be under copyright. Therefore, users accessing the volume from US IP addresses will have access to the works published outside the US between 1870 through 1923; however, users with non-US IP addresses will not (ATTRIBUTE name=pdus and REASON name=bib).

Page 10: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Public Domain Distribution

Page 11: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Sampling Method

I wanted to see how many government documents were in our HathiTrust harvest

Limit to HathiTrust for a given yearExamine first result on each page of 25

results (4% of results) [limitation: Encore only displays first 1,000 results]

Page 12: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Harvesting Hathi Docs: The Stats

Statistics as of mid-March, 2011The Docs Sampling columns show the estimated numbers of docs per year and the estimated percentage of docs per year from the Harvest

Date Range Hathi Totals

Hathi All Pub Domain

pdus + pd Hathi pdus DU pd Harvest Docs Sampling2000-2009 505,682 14,140 726 13,369 13,340 99.78%1990-1999 709,214 29,163 880 28,164 26,662 94.67%1980-1989 723,657 33,753 1,204 32,321 31,370 97.06%1970-1979 631,110 28,633 2,046 26,189 25,607 97.78%1960-1969 546,914 21,244 1,987 18,991 7,668 40.38%1950-1959 281,615 20,861 863 19,893 3,888 19.54%1940-1949 184,755 17,096 600 16,253 3,771 23.21%1930-1939 175,103 16,237 654 15,317 2,600 16.97%1920-1929 175,226 66,563 27,108 28,854 1,529 5.30%1910-1919 175,148 169,923 75,955 61,230 4,124 6.73%1900-1909 179,018 153,284 70,900 47,999 2,265 4.72%1890-1899 112,295 110,605 50,502 34,742 596 1.72%1880-1889 83,950 82,809 38,928 23,855 699 2.93%1870-1879 58,624 57,826 27,202 17,751 319 1.80%1860-1869 50,907 50,337 2,273 45,790 248 0.54%

4,593,218 872,474 301,828 430,718 124,686 28.95%

Page 13: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Hathi Docs Usage in Proportion to Docs Distribution

200920041999199419891984197919741969196419591954194919441939193419291924191919141909190418990

5000

10000

15000

20000

25000

30000

Total DocsHathi Docs

Sources: 1895-1976 data: Monthly Catalog, 1895-1976 (ProQuest);1976 onward data: CGP

Page 14: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Hathi Harvest in Perspective

Tracking of daily harvesting since harvesting began, April 16, 2010 through January 1, 2011

4/1/2

010

4/11/2

010

4/21/2

010

5/1/2

010

5/11/2

010

5/21/2

010

5/31/2

010

6/10/2

010

6/20/2

010

6/30/2

010

7/10/2

010

7/20/2

010

7/30/2

010

8/9/2

010

8/19/2

010

8/29/2

010

9/8/2

010

9/18/2

010

9/28/2

010

10/8/2

010

10/18/2

010

10/28/2

010

11/7/2

010

11/17/2

010

11/27/2

010

12/7/2

010

12/17/2

010

12/27/2

010 -

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

550,000

600,000

Harvested RecordsHarvested Docs

Page 15: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Inclusion of Serials

Although serial holdings do not sort properly, users can figure out what they need.

Page 16: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Access to Older Serials

Harve

sted

Rec

ord

Hathi

Trus

t Rec

ord

Hathi

Trus

t Ful

l Tex

t

Page 17: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

And Very Old Serials

Harve

sted

Rec

ord

Hathi

Trus

t Rec

ord

Hathi

Trus

t Ful

l Tex

t

Page 18: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Multivolume Works

Page 19: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Duplicate Holdings

U. Of Michigan and U. of California holdings both show in this record

Page 20: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Now, the Bad News:Records are Stripped Down“Lumber, Lumber, Lumber”

Page 21: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Harvested Record from our Catalog

Notice the multiple duplications of subject headings

Page 22: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Original Record in Hathi Trust

Same record, but subject heading subfields are present

Page 23: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Stripped-Out Fields

008 fixed field data

650 subfields other than “a”

500 notes5xx shipping list info

300 subfields after “a”

086 SuDocs number

Page 24: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Use Stats for Regular Online Docs

Represents clickthroughs from the catalog record to individual government documents over 7+ years.

Page 25: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Use Stats for Hathi Trust?

•Statistics for all Hathi Trust records accessed, not just documents•Spikes in usage are docs librarian (my) testing, not real users

Statistics from Google Analytics

Page 26: Outbound Harvesting with Encore as a Library Space-Saving  Strategy : The Case of HathiTrust Docs

Conclusions

Encore provides an easy way to add external content to a library catalog experience

HathiTrust records are freely available and are easy to harvest

The Encore-harvested records are stripped-down and inadequate, providing too few access points and inadequate descriptions

The content is superb, contain monographic and serial documents holdings over a span of about 150 years

Overall the project is worth having in our Encore catalog, especially since our legacy documents are all in storage and will remain there

We are considering adding other external collections using Encore, such as Center for Research Libraries digital holdings.