dd 2012 02 29 - Archive
Transcript of dd 2012 02 29 - Archive
![Page 1: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/1.jpg)
Digital Humani,es At Scale: Hathi Trust Research Center
Beth Plale Co-‐Director, HathiTrust Research Center
Professor, School of Informa;cs and Compu;ng Indiana University
![Page 2: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/2.jpg)
New ques;ons require computa;onal access to large corpus
Inves;gate way in which concepts of philosophy are used in physics through – Extrac;ng argumenta;ve structure from large dataset using mixture of automated and social compu;ng techniques
– Capture evidence for conjecture that availability of such analyses will enable innova;ve interdisciplinary research
– Digging into Data 2012 award, Colin Allen, IU
![Page 3: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/3.jpg)
New ques;ons, cont.
Document through soNware text analysis techniques, the appearance, frequency and context of terms, concepts and usages related to human rights in a selec;on of English-‐language novels. – Ronnie Lipschutz of UCSC is currently doing this analysis on one of Jane Austen’s books. He’d like to extend the work to encompass a far larger corpus.
![Page 4: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/4.jpg)
New ques;ons, cont.
Iden;fy all 18th century published books in HathiTrust corpus, and apply topic modeling to create a consistent overall subject metadata
![Page 5: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/5.jpg)
GOOGLE DIGITAL HUMANITIES AWARDS RECIPIENT
INTERVIEWS REPORT PREPARED FOR THE HATHITRUST RESEARCH CENTER
VIRGIL E. VARVEL JR. ANDREA THOMER
CENTER FOR INFORMATICS RESEARCH IN SCIENCE AND SCHOLARSHIP
UNIVERSITY OF ILLINOIS AT URBANA-‐CHAMPAIGN
Fall 2011
![Page 6: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/6.jpg)
The study
• Dr. John Unsworth, a representa;ve of HTRC, distributed invita;ons to par;cipate in this study via email to 22 researchers given Google Digital Humani;es Research Awards.
• Interviews were conducted via telephone, Skype®, or face-‐to-‐face, and all were audio recorded. All par;cipants agreed to IRB permission statement via email.
• A semi-‐structured interview protocol was developed with input from HTRC to elicit responses from par;cipants on primary goals of project.
![Page 7: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/7.jpg)
Select findings
• Op;cal Character Recogni;on – Steps should be taken to improve OCR quality if and when possible
– Scalability of scanned image viewing is necessary for OCR reference and correc;on
– Metadata should expose the quality of OCR
![Page 8: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/8.jpg)
Select Findings
• “Would like be_er metadata about text languages, par;cularly in mul;-‐text documents and on language by sec;ons within text. Automa;c language iden;fica;on func;ons would be helpful, but human-‐created metadata is preferred, par;cularly for documents with low OCR quality.”
• “primary issue was retrieving the bibliographic records in usable form, unparsed by Google. […] process took 10 months to design the queries and get the data.”
![Page 9: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/9.jpg)
HathiTrust Research Center: dedicated to provision of computa;onal access to
comprehensive body of published works for scholarship and
educa;on
![Page 10: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/10.jpg)
! HathiTrust is large corpus providing opportunity for new forms of computation investigation. ! The bigger the data, the less able we are to move it to a researcher’s desktop machine ! Future research on large collections will require computation moves to the data, not vice versa
![Page 11: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/11.jpg)
Goal of HTRC
• HTRC will provide a persistent and sustainable structure to enable original and cueng edge research.
• S;mulate the development in community of new func;onality and tools to enable new discoveries that would not be possible without the HTRC.
![Page 12: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/12.jpg)
Goal, cont.
• Leverage data storage and computa;onal infrastructure at Indiana U and UIUC,
• Provision secure computa;onal and data environment for scholars to perform research using HathiTrust Digital Library.
• Center will break new ground, allowing scholars to fully u;lize content of HathiTrust Library while preven;ng intellectual property misuse within confines of current U.S. copyright law.
![Page 13: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/13.jpg)
02/29/12
Descr iption of Application Space in H T R C
Prepared by Jiaan 1. The Whole Diagram
Tag Cloud
Entities Timeline
Text Summarizer
Readability Test
Term UsageConcept
NLP PoS
Concatenate Text Text Extractor
NLP Tokenizer NLP Sentence Detector Token Filter
NLP Name Entity
NLP Sentence Tokenizer
Sentiment Tracking Naive Bayes
Decision Tree
Author, document, keyword
relationship
Topic Modeling
Advanced Search
����������trace
Track a certain topic (e.g.
Humane right)
Simple StatisticClassificationTracking Trend
User
Basic Application Units
Applications
Basic Operations
Open Read Seek Close
File System API
Network Graph
Search
Semantic Relation Metadata
Metadata Access
Latent Semantic Analysis
• Analysis on 10,000,000+ volumes of HathiTrust digital repository
• Founded 2011 • Working with OCR • Large-scale data storage
and access • HPC and Cloud
Type of Data (Public domain and copyrighted works)
Es;mated ini;al size: 300-‐500 TB
Solr Indexes 36 TB (3 indexes)
File system rsync 12 TB
Fast volume access store 30TB
Versions of collec;on (5) 120 TB
Volume store indexes 100 TB
HathiTrust Research Center
13
![Page 14: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/14.jpg)
02/29/12
Corpus Usage Pa_ernsChapter 1
Chapter 1
Chapter 1
Page IV
Page IV
Page IV
Table of Contents 1………….# 2…………##
Table of Contents 1………….# 2…………##
Table of Contents 1………….# 2…………##
Access by chapter
Access by page
Access by special contents (table of contents, index, glossary)
14
![Page 15: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/15.jpg)
HTRC Timeline
• Phase I: an 18-‐mo development cycle – Began 01 July 2011
– Demo of capability June 2012 (12 mo mark)
• Phase II: broad availability of resource, begins 01 January 2013
![Page 16: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/16.jpg)
Governance
• HTRC Exec Management Team: Beth Plale (IU), chair; Robert McDonald (IU), Marshall Sco_ Poole (UIUC), J. Stephen Downie (IU), John Unsworth (Brandeis Univ)
• Advisory board • MOUs guide IU-‐UIUC interac;on and HTRC-‐HT interac;on
• Laine Farley, California Digital Library, and HT Execu;ve Commi_ee is liaison to HTRC
• Google Public Domain agreement – in process of signing (IU and UIUC individually execu;ng)
![Page 17: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/17.jpg)
02/29/12
What is it architecturally?
• Web services architecture and protocols
• Registry of services and algorithms
• Solr full text indexes
• noSQL store as volume store
• Large scale compu;ng
• openID authen;ca;on
• Portal front-‐end, programma;c access
• SEASR mining algos17
![Page 18: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/18.jpg)
02/29/12
Agent framework
Page/volume tree (file system)
Volume store (Cassandra)
SEASR analy;cs service
Web portalDesktop SEASR client
Task deployment
WSO2 registry services, collec;ons, data
capsule images
Solr index
HathiTrust corpusrsync
HTR
C Data API v0.1
Future Grid
NCSA local resources
Penguin on Demand
Programma;c access e.g.,
CI logon (NCSA)
Access control (e.g. Grouper)
University of Michigan
Meandre Orchestra;on
Agent instanceAgent
instance
Agent instanceAgent
instance
Non-consumptive Data capsules
NCSA HPC resources
18
Blacklight
Volume store (Cassandra)Volume store (Cassandra)
![Page 19: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/19.jpg)
02/29/12
One access point: through SEASR
19
![Page 20: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/20.jpg)
02/29/12
Query = title:Abraham Lincoln AND publishDate:1920
20
SEASR: workflow used to generate tagcloud
![Page 21: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/21.jpg)
02/29/12Query = author:Withers, Hartley 21
![Page 22: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/22.jpg)
02/29/12
Workflow invokes HTRC Solr index and HTRC data API.
22
![Page 23: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/23.jpg)
HTRC Solr index
• The Solr Data API 0.1 test version available. – Preserves all query syntax of original Solr,
– Prevents user from modifica;on,
– Hides the host machine and port number HTRC Solr is actually running on,
– Creates audit log of requests, and
– Provides filtered term vector for words star;ng with user-‐specified le_er
• Test version service soon available
![Page 24: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/24.jpg)
HTRC Solr index: most used API pa_erns
• ===== query • h_p://coffeetree.cs.indiana.edu:9994/solr/select/?q=ocr:war • =====faceted search • h_p://coffeetree.cs.indiana.edu:9994/solr/select/?
q=*:*&facet=on&facet.field=genre • ===get frequency and offset of words star;ng with le_er • h_p://coffeetree.cs.indiana.edu:9994/solr/gewreqoffset/inu.
32000011575976/w • ===== banned modifica;on request: • h_p://localhost:8983/solr/update?
stream.body=<delete><query>id:298253</query></delete>&commit=truethanks,
![Page 25: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/25.jpg)
Requirement: run big jobs on large scale (free or nearly free) compute resources
![Page 26: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/26.jpg)
02/29/12
Data Capsules VM Cluster
Provide secure VM
Scholars
Remote Desktop Or VNC
Submit secure capsule map/reduce Data Capsule images to FutureGrid. Receive and review results
FutureGrid Computa;on
Cloud
HTRC Volume Store and Index
Secure Data Capsule
![Page 27: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/27.jpg)
02/29/12
HTRC block store Take experiment with 5M vols (4TB uncompressed
data.) Block abstrac;on at vol
size or larger.
User buffer for secondary data user submits for use in
computa;on
Mapreduce headnode.
Par;;ons 4TB evenly
amongst 1000 nodes. Trusted
because run as user HTRC. User buffer needs to be
copied to each map node.
Map node (2)
Map node (1)
Map node (0)
Map node (999)
4GB chunk + user buffer
4GB chunk
4GB chunk
4GB chunk
Reduce
Read only
Single source of write from map node. < 1GB per map node
HTRC FutureGrid
Data Capsule nodes
Data Capsule node
Provenance capture (through Karma provenance tool)
![Page 28: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/28.jpg)
02/29/12 28
![Page 29: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/29.jpg)
02/29/12 29
![Page 30: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/30.jpg)
Crea;ng Func;onality around Non-‐consump;ve Research
Key no;ons
![Page 31: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/31.jpg)
No;on of a Proxy
• “researcher” in Google Book Se_lement defini;on means a human
• An avatar is a virtual life form ac;ng on behalf of person
• Computer program acts on behalf of person • Computer program (i.e., proxy) must be able to read Google texts, otherwise computa;onal analysis is impossible to carry out.
• So non-‐consump;on applied to books applies to human consump;on.
![Page 32: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/32.jpg)
Non-‐consump*ve research – implementa*on defini*on
• No ac*on or set of ac*ons on part of users, either ac*ng alone or in coopera*on with other users over dura*on of one or mul*ple sessions can result in sufficient informa*on gathered from collec*on of copyrighted works to reassemble pages from collec*on.
• Defini;on disallows collusion between users, or accumula;on of material over ;me. Differen;ates human researcher from proxy which is not a user. Users are human beings.
![Page 33: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/33.jpg)
No;on of Algorithm
• Computa;onal analysis is accomplished through algorithms – An algorithm carries out one coherent analysis task: sort list of words, compute word frequency for text
• Researcher’s computa;onal analysis oNen requires running sequence of algorithms. Important dis;nc;on for implemen;ng non-‐consump;ve research is “who owns the algorithm”?
![Page 34: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/34.jpg)
Infrastructure for computa;onal analysis
• When needing to support computa;on over 10+M volume corpus, algorithms must be co-‐located with data.
• That is, algorithms must be located where repository is located, and not on user’s desktop.
• When computa;onal analysis is to be non-‐consump;ve, likely one loca;on for the data.
![Page 35: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/35.jpg)
Who owns algorithm?
• HTRC owns the algorithms, – use SoNware Environment for Advancement of Scholarly Research (SEASR) suite of algorithms
– we are examining security requirements of users, algorithms, and data
![Page 36: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/36.jpg)
User owns and submits their algorithms
• HTRC recently received funding from Alfred P. Sloan founda;on to prototype “data capsule framework” that provisions for non-‐consump;ve research.
• Founded on principle of “trust but verify”. Informa;cs-‐savvy humani;es scholar is given freedom to experiment with new algorithms on protected informa;on, but technological mechanisms in place to prevent undesirable behavior (leakage.)
![Page 37: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/37.jpg)
Non-‐consump;ve, user-‐owned algorithms infrastructure; requirements:
• Implements non-‐consump;ve
• Openness – users not limited to using known set of algorithms
• Efficiency – Not possible to analyze algorithms for conformance prior to running
• Low cost and scale – Run at large-‐scale and low cost to scholarly community of users
• Long term value –adop;on for other purposes
![Page 38: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/38.jpg)
02/29/12
Descr iption of Application Space in H T R C
Prepared by Jiaan 1. The Whole Diagram
Tag Cloud
Entities Timeline
Text Summarizer
Readability Test
Term UsageConcept
NLP PoS
Concatenate Text Text Extractor
NLP Tokenizer NLP Sentence Detector Token Filter
NLP Name Entity
NLP Sentence Tokenizer
Sentiment Tracking Naive Bayes
Decision Tree
Author, document, keyword
relationship
Topic Modeling
Advanced Search
����������trace
Track a certain topic (e.g.
Humane right)
Simple StatisticClassificationTracking Trend
User
Basic Application Units
Applications
Basic Operations
Open Read Seek Close
File System API
Network Graph
Search
Semantic Relation Metadata
Metadata Access
Latent Semantic Analysis
Categories of algorithms. Can fair use be determined based on categoriza;on of
algorithm? Or is all computa;onal use fair use?
38
![Page 39: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/39.jpg)
Algo results fair use?
• Center supplied – Easier because we know category of algorithm
• User supplied – HTRC is not examining code, so open ques;on
![Page 40: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/40.jpg)
Par;ng philosophy
• Finally, results of computa;onal research that conforms to restric;ons of non-‐consump;ve research must belong to researcher
![Page 41: dd 2012 02 29 - Archive](https://reader036.fdocuments.net/reader036/viewer/2022081410/6299300a4d4f5e146479a8ba/html5/thumbnails/41.jpg)
How to Engage
• Building partnership with researchers and research communi;es is key goal of the HathiTrust Research Center
• HTRC can give technical advice to researchers as they look for funding opportuni;es involving access to research data
• Upcoming “Fix the OCR and Metadata Shortage Community Challenge” : help us address couple key weaknesses of HT corpus