Download - Internet content as research data

Page 1: Internet content as research data

Internet Content as Research Data

Digital Humanities Australia March 2012, Canberra

Monica Omodei & Gordon Mohr

Page 2: Internet content as research data

Research Examples

•  Social networking •  Lexicography •  Linguistics •  Network Science •  Political Science •  Media Studies •  Contemporary history

Page 3: Internet content as research data

Common  Collec)on  Strategies  

•  Crawl  Scope  &  Focus  1)  Thema)c/Topical  (elec)ons,  events,  global  warming…)  2)  Resource-­‐specific  (video,  pdf,  etc.)  3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)  4)  Exhaus)ve  (end  of  life, closure crawls, natl domains)  5)  Frequency-­‐Based    

•  Key  Inputs:  nomina)ons  from  subject  maSer  experts,  prior  crawl  data,  registry  data,  trusted  directories,  wikipedia  

Page 4: Internet content as research data

Exis)ng  web  archives    

•  Internet  Archive  •  Common  Crawl    •  Pandora  Archive  •  Internet  Memory  Founda)on  Archive  •  Other  na)onal  archives  •  Research,  University  Library  archives    

Page 5: Internet content as research data

Internet Archive’s Web Archive

Positives – Very broad – 175+ billion web instances – Historic – started 1996 – Publicly accessible – Time-based URL search – API access – Not constrained by legislation – covered by

fair use and fast take-down response

Page 6: Internet content as research data

Internet  Archive’s  Web  Archive  Negatives

– Because of size can’t search by keyword – Because of size, fully automated - QA not


Page 7: Internet content as research data

Common  Use  Cases  for  IA’s  web  archive  

•  Content  discovery  •  Nostalgia  queries  •  Web  site  restora)on  and  file  recovery  •  Domain  name  valua)on  •  Collabora)ve  R&D  •  Prior  art  analysis  and  patent/copyright  infringement  research  

•  Legal  cases  •  Topic  analysis,  web  trends  analysis,  popularity  analysis  

Page 8: Internet content as research data
Page 9: Internet content as research data
Page 10: Internet content as research data
Page 11: Internet content as research data

Common  Crawl  

•  Non-­‐profit  founda)on  building  an  open  crawl  of  the  web  to  seed  research  and  innova)on  

•  Currently  5  billion  pages  •  Stored  on  Amazon’s  S3    •  Accessible  via  MapReduce  processing  in  Amazon’s  EC2  compute  cloud  

•  Wholesale  extrac)on,  transforma)on,  and  analysis  of  web  data  cheap  and  easy  


Page 12: Internet content as research data

Common  Crawl  

Nega)ves  •  Not  designed  for  human  browsing  but  for  machine  access  

•  Objec)ve  is  to  support  large-­‐scale  analysis  and  text  mining/indexing  –  not  long-­‐term  preserva)on  

•  Some  costs  are  involved  for  direct  extrac)on  of  data  from  S3  storage  using  Requester-­‐Pays  API    

Page 13: Internet content as research data

Pandora  Archive  •  Posi)ves  

– Quality  checked  – Targeted  Australian  content  with  selec)on  policy  – Historical  –  started  1996  – Bibliocentric  approach  –we  sites/publica)ons  selected  for  archiving  are  catalogued  (see  Trove)  

– Keyword  search  – Publicly  accessible  – You  can  nominate  Australian  web  sites  for  inclusion  -­‐  

Page 14: Internet content as research data
Page 15: Internet content as research data

Pandora  Archive  

•  Nega)ves  –  labour  intensive  so  small  – significant  content  missed  because  permission  to  copy  refused  

•  Situa)on  will  improve  markedly  if  Legal  Deposit  provisions  extended  to  digital  publica)ons  

•  Broader  coverage  will  be  achieved  when  infrastructure  is  upgraded  hence  reducing  labour  costs  for  checking/fixing  crawls  

Page 16: Internet content as research data

Pandora  Archive  Stats  

•  Size  –  6.32  TB  •  Number  of  Files    >  140  million  •  Number  of  ‘)tles’  >  30.5K  •  Number  of  )tle  instances  >  73.5K  

Page 17: Internet content as research data
Page 18: Internet content as research data
Page 19: Internet content as research data
Page 20: Internet content as research data
Page 21: Internet content as research data

.au  Domain  Annual  Snapshots  •  Annual  crawls  since  2005  commissioned  from  Internet  Archive  

•  Includes  sites  on  servers  located  in  Australia  as  well  as  .au  domain  

•  Robots.txt  respected  except  for  inline  images  and  stylesheets  

•  No  public  access  –  researcher  access  protocols  are  being  developed  

•  Full  text  search  –  tailored  to  archive  search  •  Separate  .gov  crawl  publicly  accessible  soon  

Page 22: Internet content as research data

Australian  web  domain  crawls  

Year   2005   2006   2007   2008   2009   2011  

Files   185  million  

596  million  

516  million  

1  billion   765  million  

660  million  

Hosts  crawled  

811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549  

Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  

Page 23: Internet content as research data

Internet  Memory  Founda)on  Archive  

•  •  no  keyword  search  yet  –  only  URL  •  Number  of  European  partners  

Page 24: Internet content as research data
Page 25: Internet content as research data

Other  Na)onal  Archives  •  List  of  Interna)onal  Internet  Preserva)on  Consor)um  member  archives  –  

•  Some  are  whole  domain  archives,  some    are  selec)ve  archives,  many  are  both  

•  Some  have  public  access,  others  you  will  need  to  nego)ate  access  for  research  

•  Most  archives  have  been  collected  using  the  heritrix  open-­‐source  crawler  and  thus  use  the  standard  format  (warc  ISO  format)  

Page 26: Internet content as research data

Research  Archives  •  California  Digital  Library  •  Harvard  University  Libraries  •  Columbia    University  Libraries  •  University  of  North  Texas  ….  and  many  more    •  WebCITE  -­‐  webcita)  (cita)on  service  archive)  

Page 27: Internet content as research data

Bringing  Archives  Together  

•  Common  standard  and  APIs  •  Memento  project    

Page 28: Internet content as research data

Create  your  own  Archive  

•  Use  a  subscrip)on  service  •  Build  your  own  archive  using  open-­‐source  crawler  heritrix  and  standard  file  format  .warc    

•  Use  web  cita)on  services  that  create  archive  copies  as  you  bookmark  pages  

Page 29: Internet content as research data

Subscrip)on  Services  

•  archive-­‐  (service  operated  by  non-­‐profit  Internet  Archive  since  2006)  

•  (service  operated  by  non-­‐profit    Internet  Memory  Founda)on)  

•  California  Digital  Library  Web  Archiving  Service  -­‐  

•  OCLC  Harvester  Service  -­‐  

Page 30: Internet content as research data
Page 31: Internet content as research data

Install  web  archiving  system  locally  

•  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet  available  (that  meets  web  archive  standards)  

•  Ins)tu)onal  web  archiving  infrastructure  is  feasible  and  has  been  established  at  a  number  of  universi)es  for  use  by  researchers  –  needs  IT  systems  engineers  to  set  up  though  

•  Archives  can  be  deposited  with  the  NLA  for  long-­‐term  preserva)on  

Page 32: Internet content as research data

'Memento':  adding  )me  to  the  web  

Protocol  and  browser  add-­‐on  (MementoFox)  •  Aids  discovery,  aggrega)on  of  page  histories    

Page 33: Internet content as research data

Innovation is increasingly driven from Large scale Data Analysis

Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…

Web Data Mining & Analysis – What is it? Why Do It?

Page 34: Internet content as research data

Platform & Toolkit: Overview

•  Software – Apache Hadoop – Apache Pig

•  Data/File format – WARC – CDX – WAT (new!)

Page 35: Internet content as research data

Apache Hadoop

•  HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS

•  MapReduce – Distributed computation – You write Java functions – Hadoop distributes work across cluster – Tolerates failures

Page 36: Internet content as research data

File formats and data: WARC

Page 37: Internet content as research data

File formats and data: CDX

•  Index for Wayback Machine: used to browse WARC-based archive

•  Space-delimited text file •  Only essential metadata needed by Wayback

– URL – Content Digest – Capture Timestamp – Content-Type – HTTP response code – etc.

Page 38: Internet content as research data

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright,

privacy •  Work-in-progress: we want your feedback

Page 39: Internet content as research data

File formats and data: WAT •  WAT is WARC ☺

– WAT records are WARC metadata records

– WARC-Refers-To header identifies original WARC record

•  WAT payload is JSON – Compact – Hierarchical – Supported by every

programming environ

File formats & data: •  CDX: 53 MB •  WAT: 443 MB •  WARC: 8,651 MB

Page 40: Internet content as research data

Some  References  

•  hSp://  •  hSp://  •  Web  Archives:  The  Future(s)  -­‐  hSp://­‐TheFutures.pdf  

Page 41: Internet content as research data

Contacts  •  Webarchive  @  •  Secretariat  @  •  Queries  about  the  internet  archive  web  archive  hSp://  

•  Queries  about  Archive-­‐It  service  hSp://www.archive-­‐­‐us  

•  momodei  @  •  gojomo  @