The Hiberlink Project is supported by the Andrew W. Mellon Foundation

10
Hiberlink – Towards Time Travel for the Scholarly Web July 25 th 2013, Indianapolis, IN, USA 1 Hiberlink Towards Time Travel for the Scholarly Web Martin Klein [email protected] @mart1nkle1n Robert Sanderson [email protected] @azaroth42 Herbert Van de Sompel [email protected] @hvdsomp http://www.hiberlink.org/ http://www.mementoweb.org/ The Hiberlink Project is supported by the Andrew W. Mellon Foundation

description

The Hiberlink Project is supported by the Andrew W. Mellon Foundation. Hiberlink – Towards Time Travel for the Scholarly Web. Martin Klein [email protected] @mart1nkle1n Robert Sanderson [email protected] @ azaroth42 Herbert Van de Sompel [email protected] - PowerPoint PPT Presentation

Transcript of The Hiberlink Project is supported by the Andrew W. Mellon Foundation

Page 1: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

1

Hiberlink – Towards Time Travelfor the Scholarly Web

Martin [email protected]@mart1nkle1n

Robert [email protected]@azaroth42

Herbert Van de [email protected]@hvdsomp

http://www.hiberlink.org/ http://www.mementoweb.org/The Hiberlink Project is supported by the

Andrew W. Mellon Foundation

Page 2: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

2

LANL

• Herbert Van de Sompel

• Rob Sanderson• Martin Klein

U. Edinburgh

• Claire Grover• Beatrix Alex• Richard Tobin• Adam Zhou

Hiberlink Project and Partners

EDINA

• Peter Burnhill• Christine Rees• Muriel Mewissen• Tim Strickland• Neil Mayo

Two year project funded by Andrew W. Mellon Foundation

Page 3: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

3

Problem Statement

Preservation of formal scholarly output is (relatively) well understood.

Preservation of the resources that make up the context for that research is not:

• Datasets• Software• Workflows• Videos, Slides• Project and Demonstration web sites• AJAX• …

Page 4: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

4

To what extent are web resources that are referenced from works in repositories still available at their original URL …

or from archives of web resources?

Participants: LANL, UNT, arXiv

Paper: http://arxiv.org/abs/1105.3459

Contributions: • Much larger scale than any previous study, 162,052

unique URLs• Automatically searched multiple archives for all URLs,

rather than manually for a small subset

Pilot Study

Page 5: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

7

Pilot Study: Results

• 72% in archives and/or still exist

• High proportion of archived URLs, possibly due to academic level and general disciplines

• 78% in archives and/or still exist

• 45% still exist, but not archived!Possibly due to high value, but very discipline specific references

UNT

arXiv

Page 6: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

8

To what extent are web resources that are referenced from works in repositories still available at their original URL … or from archives of web resources?

Redo the same experiment with…• Even larger dataset with millions of papers and URLs• Text mining processes for URL extraction • Track location of URL (citations, footnote, text, etc)• Evaluation of extraction via gold standard dataset• Determine type of resource referenced• Track type of publication (journal, thesis, report, etc)

Hiberlink: Quantify Full Extent of the Problem

Page 7: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

9

We propose two active archiving solutions of resources referenced from scholarly papers to ensure that the scholarly record remains unbroken

1. Active Crawling:• Run extraction routines at repositories, publishers, or

third parties via text mining agreements or open access publications

• Feed the URL seed list to existing web crawlers, such as the Internet Archive

• IA (and others) already Memento compliant

Hiberlink: Propose Solutions (1)

Page 8: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

10

2. Transactional Archiving:• Willing server forks responses for resources and

sends to both browser and to archive for preservation

Hiberlink: Propose Solutions (2)

Page 9: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

11

2011 pilot study showed:• Significant problem!• Random archiving by web crawlers is not enough

Hiberlink project will:• Fully quantify the extent to which web resources that

form the context of scholarly output are available and archived

• Propose active solutions to prevent the loss of further resources

• Use Memento for both research and access

Summary

Page 10: The  Hiberlink  Project is supported by  the Andrew  W. Mellon Foundation

Hiberlink – Towards Time Travel for the Scholarly Web July 25th 2013, Indianapolis, IN, USA

12

Hiberlink – Towards Time Travelfor the Scholarly Web

Martin [email protected]@mart1nkle1n

Robert [email protected]@azaroth42

Herbert Van de [email protected]@hvdsomp

http://www.hiberlink.org/ http://www.mementoweb.org/The Hiberlink Project is supported by the

Andrew W. Mellon Foundation