Post on 05-Dec-2014
description
Luis Faria lfaria@keep.pt KEEP SOLUTIONS www.keep-‐solu=ons.com
SCAPE webminar July 26, 2014
Tools for uncovering preserva=on risks in your large repositories
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation policies
2
Why do we need monitoring?
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation policies
3
Why do we need monitoring?
RisksOpportunities
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 4
5.41%&0.77%&1.54%&1.93%&2.32%&2.70%&2.70%&
5.02%&7.34%&
9.27%&15.83%&
26.64%&28.57%&
0.00%& 5.00%& 10.00%& 15.00%& 20.00%& 25.00%& 30.00%&
Other&Data&intensive&industry&
Non&affiliated&Big&data&science&
Digital&preservaDon&vendor&Research&funder&Large&enterprise&
Publisher&or&content&producer&Small&or&medium&enterprise&Local&government&insDtuDon&
NaDonal&government&insDtuDon&Memory&insDtuDon&or&content&holder&
University&
What%descrip-ons%fit%your%organiza-on?%
Preserva'on monitoring survey
181 valid par=cipants
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Preserva'on monitoring survey
5
92%$
89%$
78%$
77%$
76%$
76%$
75%$
74%$
69%$
68%$
64%$
41%$
51%$
41%$
40%$
44%$
23%$
27%$
17%$
28%$
25%$
30%$
18%$
9%$
18%$
13%$
12%$
24%$
22%$
25%$
25%$
19%$
23%$
41%$
40%$
41%$
46%$
44%$
53%$
51%$
58%$
47%$
55%$
46%$
0.00%$ 10.00%$ 20.00%$ 30.00%$ 40.00%$ 50.00%$ 60.00%$ 70.00%$ 80.00%$ 90.00%$ 100.00%$
File$corrup7on$
Backup$failure$
Staff$not$enough$or$adequate$
SoDware$plaForm$obsolescence$
Hardware$plaForm$obsolescence$
Lack$of$context$informa7on$
Incorrect$ac7on$results$
Consumers$misalignment$
Outdated$preserva7on$plans$
Producers$misalignment$
Content$not$aligned$with$policies$
Importance$(normalized$mean)$ Monitoring$ Not$monitoring$ Uncertain$or$No$answer$
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 6
Tools for uncovering preserva'on risks
Content FITS C3PO Scout
FITS output (XML)
</>
File characteris=cs distribu=on (graphs and drill-‐down analysis)
File and world proper=es throughout =me and no=fica=ons
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
• h\p://fitstool.org • Characteriza=on
• Iden=fica=on • Feature extrac=on • Valida=on
• Support for: • DROID
• JHove
• Apache Tika
• ADL Tool
• Exidool
• FFIdent
• File U=lity (windows port)
• NLNZ Metadata Extractor
• OIS Audio, File and XML Informa=on
FITS -‐ File Informa'on Tool Set• h\ps://github.com/keeps/fits/tree/keeps
• Developed by KEEPS • Added support for:
• FIDO
• Microsod Office
• Adobe Illustrator
• Corel Draw
• Email (EML)
• Autocad (DWG)
• Shapefile
• RTF, TXT
• Databases (DBML)
7
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
FITS -‐ File Informa'on Tool Set
• Demonstra=on • Download from h\p://fitstool.org !
• Execute for a file !
!• Execute for a directory
8
./fits.sh -‐i file.png
./fits.sh -‐r -‐i source_directory/ -‐o output_directory/
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
FITS performance
• h\ps://github.com/keeps/fits-‐tes=ng • 3 to 6 seconds per file • 12 TB -‐ A year
• h\p://www.openplanetsfounda=on.org/blogs/2013-‐01-‐09-‐year-‐fits
• Other op=ons for scalability: • Fido • Apache Tika • Nanite
9
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
C3PO -‐ Clever, Cra?y Content Profile of Objects
• h\p://ifs.tuwien.ac.at/imp/c3po • Web applica=on • Content characteris=cs aggrega=on • Drill-‐down analysis
10
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
C3PO install
• Download binaries at: • h\p://dl.bintray.com/peshkira/c3po/
• Install mongodb: • h\p://www.mongodb.org/
• Install Apache Tomcat • h\p://tomcat.apache.org/
• Put C3PO web app in Apache Tomcat • Remove ROOT dir for webapps and rename C3PO web app to ROOT.war
• Start Apache Tomcat and connect to: • h\p://localhost:8080/
• Usage guide: • h\ps://github.com/peshkira/c3po/wiki/Usage-‐Guide
11
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
C3PO performance
Dataset: Statsbiblioteket (Denmark) • Size: 440M files (12 TB) • Process =me: 388h (16 days) / 50h for XML report • Average =me: 2.5s per 1000 files • Web applica=on has 2.5 million FITS files limit
12
Scout: a preserva'on watch system
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Monitors aspects of the world to detect preserva=on risks and opportuni=es
13
Content
Policies Web
Scout
Risk notification
Humanknowledge
Registries
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 14
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 15
Current information sources
• Repository content and events
• SCAPE Policy model
• PRONOM
• Web semantic extraction
• Web page renderability experiments
16
Define triggers
• Notify me when there are tools that can render the format X.
17
Define triggers Simple query with templates
18
Receive notifications
HTTP Push API
There are tools that can render format X.
19
Interfaces
Web page
REST API
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
How to be a part of Scout
• Checkout • Site: http://openplanets.github.io/scout/
• Report: http://www.scape-project.eu/deliverable/d12-2-final-version-of-the-preservation-watch-component
• Demo: http://scout.scape.keep.pt
• Integrate your content
• Contribute with information (soon) • Use Scout form for manual input of knowledge
20
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Roadmap
• User support • More trigger templates • More adaptors
• KrakeN / Propminer • Sodware catalogues • Other format registries • Other experiments informa=on sources • Manual input (human knowledge) • Simula=on
21
Luis Faria lfaria@keep.pt KEEP SOLUTIONS www.keep-‐solu=ons.com
SCAPE webminar July 26, 2014
Tools for uncovering preserva=on risks in large repositories