Internet content as research data
-
Upload
national-library-of-australia -
Category
Technology
-
view
467 -
download
0
Transcript of Internet content as research data
![Page 1: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/1.jpg)
Internet Content as Research Data
Digital Humanities Australia March 2012, Canberra
Monica Omodei & Gordon Mohr
![Page 2: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/2.jpg)
Research Examples
• Social networking • Lexicography • Linguistics • Network Science • Political Science • Media Studies • Contemporary history
![Page 3: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/3.jpg)
Common Collec)on Strategies
• Crawl Scope & Focus 1) Thema)c/Topical (elec)ons, events, global warming…) 2) Resource-‐specific (video, pdf, etc.) 3) Broad survey (domain wide for .com/.net/.org/.edu/.gov) 4) Exhaus)ve (end of life, closure crawls, natl domains) 5) Frequency-‐Based
• Key Inputs: nomina)ons from subject maSer experts, prior crawl data, registry data, trusted directories, wikipedia
![Page 4: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/4.jpg)
Exis)ng web archives
• Internet Archive • Common Crawl • Pandora Archive • Internet Memory Founda)on Archive • Other na)onal archives • Research, University Library archives
![Page 5: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/5.jpg)
Internet Archive’s Web Archive
Positives – Very broad – 175+ billion web instances – Historic – started 1996 – Publicly accessible – Time-based URL search – API access – Not constrained by legislation – covered by
fair use and fast take-down response
![Page 6: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/6.jpg)
Internet Archive’s Web Archive Negatives
– Because of size can’t search by keyword – Because of size, fully automated - QA not
possible
![Page 7: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/7.jpg)
Common Use Cases for IA’s web archive
• Content discovery • Nostalgia queries • Web site restora)on and file recovery • Domain name valua)on • Collabora)ve R&D • Prior art analysis and patent/copyright infringement research
• Legal cases • Topic analysis, web trends analysis, popularity analysis
![Page 8: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/8.jpg)
![Page 9: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/9.jpg)
![Page 10: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/10.jpg)
![Page 11: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/11.jpg)
Common Crawl
• Non-‐profit founda)on building an open crawl of the web to seed research and innova)on
• Currently 5 billion pages • Stored on Amazon’s S3 • Accessible via MapReduce processing in Amazon’s EC2 compute cloud
• Wholesale extrac)on, transforma)on, and analysis of web data cheap and easy
• commoncrawl.org/data/accessing-‐the-‐data/
![Page 12: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/12.jpg)
Common Crawl
Nega)ves • Not designed for human browsing but for machine access
• Objec)ve is to support large-‐scale analysis and text mining/indexing – not long-‐term preserva)on
• Some costs are involved for direct extrac)on of data from S3 storage using Requester-‐Pays API
![Page 13: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/13.jpg)
Pandora Archive • Posi)ves
– Quality checked – Targeted Australian content with selec)on policy – Historical – started 1996 – Bibliocentric approach –we sites/publica)ons selected for archiving are catalogued (see Trove)
– Keyword search – Publicly accessible – You can nominate Australian web sites for inclusion -‐ pandora.nla.gov.au/registra)on_form.html
![Page 14: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/14.jpg)
![Page 15: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/15.jpg)
Pandora Archive
• Nega)ves – labour intensive so small – significant content missed because permission to copy refused
• Situa)on will improve markedly if Legal Deposit provisions extended to digital publica)ons
• Broader coverage will be achieved when infrastructure is upgraded hence reducing labour costs for checking/fixing crawls
![Page 16: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/16.jpg)
Pandora Archive Stats
• Size – 6.32 TB • Number of Files > 140 million • Number of ‘)tles’ > 30.5K • Number of )tle instances > 73.5K
![Page 17: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/17.jpg)
![Page 18: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/18.jpg)
![Page 19: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/19.jpg)
![Page 20: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/20.jpg)
![Page 21: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/21.jpg)
.au Domain Annual Snapshots • Annual crawls since 2005 commissioned from Internet Archive
• Includes sites on servers located in Australia as well as .au domain
• Robots.txt respected except for inline images and stylesheets
• No public access – researcher access protocols are being developed
• Full text search – tailored to archive search • Separate .gov crawl publicly accessible soon
![Page 22: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/22.jpg)
Australian web domain crawls
Year 2005 2006 2007 2008 2009 2011
Files 185 million
596 million
516 million
1 billion 765 million
660 million
Hosts crawled
811,523 1,046,038 1,247,614 3,038,658 1,074,645 1,346,549
Size (TBs) 6.69 19.04 18.47 34.55 24.29 30.71
![Page 23: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/23.jpg)
Internet Memory Founda)on Archive
• internetmemory.org/en/ • no keyword search yet – only URL • Number of European partners
![Page 24: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/24.jpg)
![Page 25: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/25.jpg)
Other Na)onal Archives • List of Interna)onal Internet Preserva)on Consor)um member archives – netpreserve.org/about/archiveList.php
• Some are whole domain archives, some are selec)ve archives, many are both
• Some have public access, others you will need to nego)ate access for research
• Most archives have been collected using the heritrix open-‐source crawler and thus use the standard format (warc ISO format)
![Page 26: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/26.jpg)
Research Archives • California Digital Library • Harvard University Libraries • Columbia University Libraries • University of North Texas …. and many more • WebCITE -‐ webcita)on.org (cita)on service archive)
![Page 27: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/27.jpg)
Bringing Archives Together
• Common standard and APIs • Memento project
![Page 28: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/28.jpg)
Create your own Archive
• Use a subscrip)on service • Build your own archive using open-‐source crawler heritrix and standard file format .warc
• Use web cita)on services that create archive copies as you bookmark pages
![Page 29: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/29.jpg)
Subscrip)on Services
• archive-‐it.org (service operated by non-‐profit Internet Archive since 2006)
• archivethe.net (service operated by non-‐profit Internet Memory Founda)on)
• California Digital Library Web Archiving Service -‐ cdlib.org/services/uc3/was.html
• OCLC Harvester Service -‐ oclc.org/webharvester/overview/default.htm
![Page 30: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/30.jpg)
![Page 31: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/31.jpg)
Install web archiving system locally
• Easy-‐to-‐deploy web archiving toolkit not yet available (that meets web archive standards)
• Ins)tu)onal web archiving infrastructure is feasible and has been established at a number of universi)es for use by researchers – needs IT systems engineers to set up though
• Archives can be deposited with the NLA for long-‐term preserva)on
![Page 32: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/32.jpg)
'Memento': adding )me to the web
Protocol and browser add-‐on (MementoFox) • Aids discovery, aggrega)on of page histories
![Page 33: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/33.jpg)
Innovation is increasingly driven from Large scale Data Analysis
Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
Web Data Mining & Analysis – What is it? Why Do It?
![Page 34: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/34.jpg)
Platform & Toolkit: Overview
• Software – Apache Hadoop – Apache Pig
• Data/File format – WARC – CDX – WAT (new!)
![Page 35: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/35.jpg)
Apache Hadoop
• HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS
• MapReduce – Distributed computation – You write Java functions – Hadoop distributes work across cluster – Tolerates failures
![Page 36: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/36.jpg)
File formats and data: WARC
![Page 37: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/37.jpg)
File formats and data: CDX
• Index for Wayback Machine: used to browse WARC-based archive
• Space-delimited text file • Only essential metadata needed by Wayback
– URL – Content Digest – Capture Timestamp – Content-Type – HTTP response code – etc.
![Page 38: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/38.jpg)
File formats and data: WAT
• Yet Another Metadata Format! ☺ ☹ • Not preservation format • Data exchange and analysis • Less than full WARC, more than CDX • Essential metadata for many types of analysis • Avoids barriers to data exchange: copyright,
privacy • Work-in-progress: we want your feedback
![Page 39: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/39.jpg)
File formats and data: WAT • WAT is WARC ☺
– WAT records are WARC metadata records
– WARC-Refers-To header identifies original WARC record
• WAT payload is JSON – Compact – Hierarchical – Supported by every
programming environ
File formats & data: • CDX: 53 MB • WAT: 443 MB • WARC: 8,651 MB
![Page 40: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/40.jpg)
Some References
• hSp://en.wikipedia.org/wiki/Web_archiving • hSp://netpreserve.org/about/archiveList.php • Web Archives: The Future(s) -‐ hSp://www.netpreserve.org/publica)ons/2011_06_IIPC_WebArchives-‐TheFutures.pdf
![Page 41: Internet content as research data](https://reader033.fdocuments.net/reader033/viewer/2022052904/557d82f0d8b42a75548b52fa/html5/thumbnails/41.jpg)
Contacts • Webarchive @ nla.gov.au • Secretariat @ internetmemory.org • Queries about the internet archive web archive hSp://iawebarchiving.wordpress.com/
• Queries about Archive-‐It service hSp://www.archive-‐it.org/contact-‐us
• momodei @ nla.gov.au • gojomo @ xavvy.com