Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you...

28
Artur Kulmukhametov, Carl Wilson, Petar Petrov Vienna University of Technology, Open Planets Foundation Advanced Practitioner Course Glasgow, 17 July 2013 Understanding what you have Content profiling with C3PO

Transcript of Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you...

Page 1: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

Artur Kulmukhametov, Carl Wilson, Petar Petrov Vienna University of Technology, Open Planets Foundation

Advanced Practitioner Course Glasgow, 17 July 2013

Understanding what you have Content profiling with C3PO

Page 2: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Collection scale and heterogeneity

• An approach to getting control

• Characterisation tools

• C3PO, a tool for content profiling

2

Agenda

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 3: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Personal

• Cultural Heritage

• Scientific Data

• Government Documents

• …. a huge variety of formats and information

3

Heterogeneity of Data

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 4: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

4

What is It?

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 5: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

5

Large Synoptic Survey Telescope

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

30 Terabytes of data nightly

Page 6: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

6 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 7: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

….. that’s a lot of data ……

Do we know what it is?

Do we need to preserve it?

all of it??

7

Conclusions?

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 8: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

8

Place for Characterization

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 9: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

9

Characterization

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 10: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

10

Characterization

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 11: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

One size does not fit all

11

Heterogeneity

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 12: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

12

Scalability

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 13: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

13

Many Collections, One View

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Global View of Content

• distribution of file formats

• distribution of characteristics

• Three Stages

• collect metadata

• combine and filter

• analyse and reason

Page 14: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Based upon metadata

• As few as possible, as many as necessary

• Stratification across file type, size, time or any other

relevant characteristic for the use case

• Outliers identification

14

Representative Samples

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 15: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

15

Tools for Characterization

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

fido

jpylyzer

ffident Exiftool

Exif

Droid

Page 16: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• A lot of tools to manage and invoke

• Different output schemas

• Different configuration/environments

• No or bad higher level management

• Difficult to spot differences

16

A few Problems…

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 17: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• FITS identifies, validates, and extracts technical metadata for various file formats

• By Harvard University Library in 2009

• v0.6.2, LGPL

• Wraps other tools

• New version every 6-12 months

http://code.google.com/p/fits/

17

File Information Tool Set

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 18: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

Main features:

• Consolidates output

• Can include raw output

• Configurable/Extendable

FITS includes:

• Droid

• Metadata Extra

• Jhove

• Exiftool

• FFident

• File Utility

18

File Information Tool Set

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 19: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/

ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.0" timestamp="12/27/11 10:49 AM">

<identification>

<identity format="Portable Document Format" mimetype="application/pdf" toolname="FITS"

toolversion="0.6.0">

<tool toolname="Jhove" toolversion="1.5" />

<tool toolname="file utility" toolversion="5.03" />

<tool toolname="Exiftool" toolversion="7.74" />

<tool toolname="Droid" toolversion="3.0" />

<tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />

<tool toolname="ffident" toolversion="0.2" />

<version toolname="Jhove" toolversion="1.5">1.4</version>

<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/18</externalIdentifier>

</identity>

</identification>

<fileinfo>

<size toolname="Jhove" toolversion="1.5">39586</size>

<creatingApplicationName toolname="NLNZ Metadata Extractor" toolversion="3.4GA" status="SINGLE_RESULT">/XPP</creatingApplicationName>

<lastmodified toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2011:12:27 10:44:28+01:00</lastmodified>

<created toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2002:04:25 13:02:24Z</created>

<filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/home/petrov/taverna/tmp/000/000009.pdf</filepath>

19

FITS Output

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 20: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

<?xml version="1.0" encoding="UTF-8"?>

<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1“ timestamp="7/21/12 3:51 PM">

<identification status="CONFLICT“ >

<identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1">

<tool toolname="Jhove" toolversion="1.5" />

</identity>

<identity format="Rich Text Format" mimetype="application/rtf, text/rtf" toolname="FITS" toolversion="0.6.1">

<tool toolname="Droid" toolversion="3.0" />

<version toolname="Droid" toolversion="3.0" status="CONFLICT">1.5</version>

<version toolname="Droid" toolversion="3.0" status="CONFLICT">1.6</version>

<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/50</externalIdentifier>

<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/51</externalIdentifier>

</identity>

<identity format="Rich Text Format" mimetype="text/rtf" toolname="FITS" toolversion="0.6.1">

<tool toolname="ffident" toolversion="0.2" />

</identity>

</identification>

20

FITS Output Conflict

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 21: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

Advantages

• Only one output schema

• Basic QA hints

• Better type coverage (although...)

Disadvantages

• Consolidation is hard

• Performance and Scalability

• Conflicts

21

Yet Another?

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 22: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

C3PO is a tool for content profile generation.

Features:

• Uses characterization results,

• Deeper content analysis with nice visuals through the web-app,

• Generates content profiles (map/reduce)

https://github.com/openplanets/c3po

22

Clever, Crafty Content Profiling of Objects

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Sometimes, I don’t understand human

behavior?!

Page 23: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• CLI-app

• Java

• Parses and processes FITS files

• Stores them in mongoDB

• XML Profile + CSV

• Web-app

• Play Framework

• Overview and Browsing

• Filtering

• Representative Sample Set Generation

23

Clever, Crafty Content Profiling of Objects

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 24: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Size'o'Matic 3000 • The Size'o'Matic 3000 selects the smallest and the largest objects and fills

the rest of the representative set with random objects near to the average objects size.

• SysSampler • This algorithm implements a common statistical approach, called

systematic sampling. It divides the collection in bins and selects one element per bin at random. All elements have equal probability to be chosen.

• DistSampler • The distribution sampling algorithm takes a small number of properties as

input and selects sample objects that together have (nearly) the same distribution as the whole collection or filter. Note, that if you select too many properties, or a special combination of properties, it could happen, that no representatives can be found.

24

C3PO: Representative Samples

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 25: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Govdocs1

• 945699 objects - 1h 48m

• 112 different object properties

• profile - 12 minutes

• Web Archive Data

• 958638 objects - 2h 58m

• 105 different object properties

• profile - 13.5 minutes

• A single PC with 4GB RAM and 2.3GHz 2-core CPU.

25

C3PO: Performance

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 26: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• SB (Denmark) dataset - 12 TB, 440M FITS files

• Test case 1 – Import

• Linear ingest time of 0.65 ms for FITS file

• Test case 2 – GUI

• 2.5 million FITS files limit

• Generate profile in command-line

• 15 hours for 12M files

26

C3PO: Performance

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 27: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Conflict reduction

• Use the ontology from Planning and Watch for an

alignment with other tools

• Generate a profile from SB dataset (0.4 billion FITS-files)

27

C3PO: Challenges

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 28: Understanding what you have - Software engineeringkulmukhametov/files... · Understanding what you have Content profiling with C3PO •Collection scale and heterogeneity •An approach

• Characterization is time consuming

• It can be faulty

• Know your tools

• A tool for content profiling? C3PO!

28

Summary

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).