Dr Chris Maynard Application Consultant, EPCC [email protected]@ed.ac.uk +44 131 650 5077 Muttering...

41
Dr Chris Maynard Application Consultant, EPCC [email protected] +44 131 650 5077 Muttering about metadata Report from the Metadata work group Review of QCDml

Transcript of Dr Chris Maynard Application Consultant, EPCC [email protected]@ed.ac.uk +44 131 650 5077 Muttering...

Page 1: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Dr Chris MaynardApplication Consultant, EPCC

[email protected]+44 131 650 5077

Muttering about metadata

Report from the Metadata work group

Review of QCDml

Page 2: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

metadata

• Meta- Greek among, with, after, from

• Data Latin information

• Literally data about data

• Data– Gauge configuration – Ensemble of gauge configurations

• Metadata (MD)– How was data created

– Format, machine, code, algorithm, physics

210-12/03/2009 XML at light quarks @css

Page 3: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Why do we need metadata?

• Extreme example no metadata– Cfgs have random string names with no directory structure for

different ensembles– Impossible to use

• Organise files– Into directories for ensembles

– Give cfgs names with markov chain position

• Construct a scheme for the metadata– Rules for describing the data

310-12/03/2009 XML at light quarks @css

Page 4: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

A basic scheme

• Use “meaningful” filenames– Metadata is encoded into the names of the files and

directories– Can have some structure or hierarchy

– But not completely flexible– Example with three fields. What ordering?

–gauge-action/volume/quark-mass–gauge-action/quark-mass/volume

– What happens for 2+1 flavours– Not extensible

410-12/03/2009 XML at light quarks @css

Page 5: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

D52C202K3500U010010_LL3450X_FL3400X_CMesonT00T31

Old UKQCD meson correlator filename– What does X stand for?

Wilson, Rotated, Clover– Many different clover. Scheme broken– X means none of the above!

D52C202K3500U010010_LL3450X_FL3400X_CMesonT00T31

Dynamical cSW=2.02 ≈ 2.0171

NP determined – no information

510-12/03/2009 XML at light quarks @css

A broken scheme

Page 6: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

A better scheme

• Recreate data from MD– This is an very important requirement

– Know what the data is – Data provenance

• Combination of IO parameters and code– Implicit assumptions are recorded! :^):^)– Version n cannot read version m :^(:^(– Code X cannot process MD from Code Y :^(:^(

• Can we construct a general scheme?– Recreate data from MD?

610-12/03/2009 XML at light quarks @css

Page 7: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Extensible schemes

• LQCD metadata is hierarchical– Rich structure– Metadata scheme has to reflect this

– Extensible

• New metadata requires a new scheme– In extensible scheme

– Old scheme is included in new one– Old metadata fits in new scheme– No need to refactor existing documents

710-12/03/2009 XML at light quarks @css

Page 8: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Markup language

• Combines text and information about text– Presentational

– text format (e.g. This slide) WISIWIG– Procedural

– presentation of text, not WISIWIG– Tex, postscript

– Descriptive or semantic– Labels fragments of text– No presentation or other interpretation mandated– SGML, XML, VML

HTML has both procedural and descriptive elements

810-12/03/2009 XML at light quarks @css

Page 9: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

• XML eXtensible Markup Language• www.w3.org/XML

– XML is for structuring data

– XML looks a bit like HTML

– XML is text, but isn't meant to be read

– XML is verbose by design

– XML is a family of technologies

– XML is license-free, platform-independent and well-supported

– <beer>happiness</beer>

910-12/03/2009 XML at light quarks @css

XML Web standard

Page 10: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

XML II• Semantic, eXtensible Markup language

• XML was designed to carry data, not to display data– Cf. with HTML, designed for displaying data.– Incompatible applications can exchange data wrapped in xml

• XML is just plain text

• User defined tags allow structure to be developed– Lattice QCD metadata is structured

• XML does not DO anything– You need an application for this

• XML schema– Defines a set of rules for the XML document

10-12/03/2009 XML at light quarks @css 10

Page 11: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Well formed XML

1110-12/03/2009 XML at light quarks @css

Page 12: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

XML schema

• What is XML schema?– Collection of rules for XML documents

– Other schema languages, DTD, Relax NG, Schematron

– An XML schema is itself an XML document

• Why do we need an XML schema?– Computers can read and understand XML IDs– <length>16</length>– Meaning of length is context dependent– Applications can know types, parse and processes

XML data– Could just be an XSLT style sheet to transform XML

in HTML and render a web page e.g. LDG web-client

1210-12/03/2009 XML at light quarks @css

Page 13: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

A word of caution

• XML is not magic– XML is not a solution– It is a useful tool– Not the only tool– We still have to use the tool

• Ideally produce metadata from code– What metadata?

– What is standard/useful/implicit– Application has to do something with it

1310-12/03/2009 XML at light quarks @css

Page 14: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Metacrap

• Not all aspects/connotations of metadata are good

• Metacrap: Putting the torch to seven straw-men of

the meta-utopia

• http://www.well.com/~doctorow/metacrap.htm

• Amusing and valid critique of some MD ideas– But not all are relevant to this project

1410-12/03/2009 XML at light quarks @css

Page 15: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

International lattice data grid (ILDG)

• Lattice data is very expensive to generate

• Many groups now make data available– MILC, UKQCD, RBC, JLQCD, Adelaide group– Many others share data

• ILDG is representative of whole lattice community– MD working group (CMM is UK rep)– Middleware working group (MGB, RO [epcc])

1510-12/03/2009 XML at light quarks @css

Page 16: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Introducing QCDml

• XML schemata for gauge configuration MD

• Developed and maintained by MDWG– Design by committee – always a good idea

• Basic concept– MD describing an ensemble– MD describing a configuration belonging an ensemble

1610-12/03/2009 XML at light quarks @css

Page 17: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

QCDml Ensemble

1710-12/03/2009 XML at light quarks @css

Page 18: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Physics

1810-12/03/2009 XML at light quarks @css

Page 19: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Fermion action inheritance

1910-12/03/2009 XML at light quarks @css

Page 20: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Example QCDml Ensemble

• 1 Name of schema (URI)• 2 Using W3.org XML schema (URI)• 3 Schema location for

– a) named schema (URI) – b) Location of schema (URL)

• 4 Name of Ensemble (URI)

20XML at light quarks @css

Page 21: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Example QCDml Ensemble

• 1 Name of schema (URI)• 2 Using W3.org XML schema (URI)• 3 Schema location for

– a) named schema (URI) – b) Location of schema (URL)

• 4 Name of Ensemble (URI)

21XML at light quarks @css

Page 22: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Example QCDml Ensemble

• 1 Name of schema (URI)• 2 Using W3.org XML schema (URI)• 3 Schema location for

– a) named schema (URI) – b) Location of schema (URL)

• 4 Name of Ensemble (URI)

22XML at light quarks @css

Page 23: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Example QCDml Ensemble

• 1 Name of schema (URI)• 2 Using W3.org XML schema (URI)• 3 Schema location for

– a) named schema (URI) – b) Location of schema (URL)

• 4 Name of Ensemble (URI)

23XML at light quarks @css

Page 24: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Example quark action

2410-12/03/2009 XML at light quarks @css

Page 25: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

QCDml Config

2510-12/03/2009 XML at light quarks @css

Page 26: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Markov Step

2610-12/03/2009 XML at light quarks @css

Page 27: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Name hierarchy

Unique name in for ensemble in ILDG namespace

<markovChainURI/> Ensemble

<markovChainURI/> Config

<dataLFN/>

Replica catalogue

Actual file instances Multiple copies

2710-12/03/2009 XML at light quarks @css

Page 28: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Algorithm

• General scheme too complex• Algorithmic MD can belong to ensemble or

config.• Either name value pairs• Or import another schema

– Lives in separate namespace

28XML at light quarks @css

Page 29: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Namespaces

Allow another namespace to be imported here

Application processing QCDml can ignore this namespace

Can include all metadata into XML IDs

Local applications can be alg schema aware, but ignore non-local ones

2910-12/03/2009 XML at light quarks @css

Page 30: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Alg Namespace example

Imported namespace has it’s own schemaImported schema is not in default namespace, so has a prefixAll elements belonging to this namespace use this prefix

3010-12/03/2009 XML at light quarks @css

Page 31: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Data format

• ILDG specified format– All gauge configurations in the same format

• Based on NERSC data layout

• SciDAC LIME records

• UKQCD perspective– CPS cannot do this :^(:^(– Chroma can :^):^)– qdp++ tools exist for conversion etc

3110-12/03/2009 XML at light quarks @css

Page 32: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

How to generate QCDml

• Ensemble MD requires a human– Use schema aware tool (cmm uses XMLspy)– Take existing XML ID and hack– Not that hard, only once per ensemble

• Config XML– Post-processing on QCDOC– Example of DWF data

3210-12/03/2009 XML at light quarks @css

Page 33: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

DWF data

• CPS on QCDOC writes– Data in NERSC format– VML files containing MD

– Parameters of objects– Effective check-pointing

– Data stored with meaningful path- and filenames– Includes binary and source code version

• Satisfies important constraint– Recreate data from Metadata :^):^)

3310-12/03/2009 XML at light quarks @css

Page 34: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

From one scheme to another

• Series of scripts and utilities can do conversion– qdp++ – cmm – /host/cmaynard/tools/scripts

• Utilities compiled for qcdocx– Conversion and submit to grid

3410-12/03/2009 XML at light quarks @css

Page 35: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Scripts

• Data conversion and

XML chunks are built

by scripts

• makeQCDmlConfig.sh glues

XML together

• dataSet.sh contains

dataset specifications– Plus sundries

3510-12/03/2009 XML at light quarks @css

Page 36: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Chris Maynard ILDG 13 December 5 2008 36

Problems with XML

• Lattice QCD (meta)data is really mathematics– XML is not really ideal for storing this data

• QCDml has defined common names for <action/> etc– Even WilsonAction has more than one common usage

– Kappa versus mass

• Algorithm metadata is too complex for common names– Not really defined in the metadata– Unstructured parameter values included

• This is OK because – an ensemble is defined by the action – not the algorithm used to generate it

• Extending to propagators and correlators is hard for the

same reason as defining the algorithm

Page 37: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Chris Maynard ILDG 13 December 5 2008 37

We need an application• XML does not DO anything• For it to be useful we need to do something with it!

– What do we want to do with it?– Is QCDml good for this purpose

– QCDml design focused on searching the metadata catalogue– This was probably a good idea!

• Xpath used to query XML databases– Basic tools/APIs exist for constructing queries

– Cf. UKQCD DiGS GUI browser, LDG web-client and JLDG faceted navigation application

• Metadata capture– How do we create XML IDs?

– Does any application actually write QCDml?– UKQCD does post-processing

• Data provenance– Does QCDml provide this?

Hard-work

Page 38: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Chris Maynard ILDG 13 December 5 2008 38

What next?

• QCDml seems to work OK– How much is it being used?

• We don’t have many applications that DO something

with it CMM’s Questions for ILDG

– What do we want to do with metadata?

– Do we have the right sort of metadata for this?

– What tools or applications do we need?

• Someone then has to build them• if we don’t ask, we don’t get!

Can we review QCDml usage to define what tools we need?

Page 39: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Work flow tools

• Graphical tools for linking work together– components could be …– Machine job submission tasks– The actual MC code– data logistics

• Now used by many areas of science, – particle physics experiments– chemistry

• Examples– Unicore– “my experiment”

10-12/03/2009 XML at light quarks @css 39

Page 40: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Topics for Discussion

• Technical– tools

– Metadata capture– Data conversion

– Use of QCDml– Data Curation, – Data provenance

• Sociological – Encouraging other groups to join – ILDG paper– Funding

10-12/03/2009 XML at light quarks @css 40

Page 41: Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk@ed.ac.uk +44 131 650 5077 Muttering about metadata Report from the Metadata work group.

Finally

4110-12/03/2009 XML at light quarks @css