Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy...

11
DATASET DISCOVERY STATUS REPORT Maria Grigorieva DKB Team

Transcript of Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy...

Page 1: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

DATASET DISCOVERY STATUS REPORT

Maria Grigorieva

DKB Team

Page 2: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

DKB MEETINGS & DISCUSSIONS

• Data Knowledge Base Technical Discussion 18 Apr 2017

https://indico.cern.ch/event/628962/

• Data Knowledge Base / DCC Technical Discussion 20 Apr 2017

https://indico.cern.ch/event/632634/

• DCC Meeting 18 May 2017 https://indico.cern.ch/event/640192/

• DKB team was motivated to investigate campaigns/datasets

discovery and categorization methods.

2

Page 3: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

EXAMPLES OF TWIKI SUMMARIES

3

Metainformation in Twiki pages provides the most suitable

metadata categorization and representation for the end-

users. Twiki Monte-Carlo Campaign Pages contains:

•Aggregated events reports, categorized by physics categories

• Data sample’s lists for each physics category

• Data sample’s lists with breakdown a set of parameters, like:

•Monte-Carlo Generators •Powheg+Pythia8 ...

•Physics channel •W+ in mumu

W- in taumu Z/gamma* in tau tau ...

•Filtration methods •Without lepton filter Two lepton filter ...

But the metainformation in Twiki is not enough structured

and it doesn’t provide mechanisms for synchronization with

database back-ends.

The issue is to provide fully automatic search

and aggregation by arbitrary set of

parameters.

Page 4: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

DKB NEW SCOPES

• To provide fast and flexible data categorization,

search and aggregation

1. Reproduce Event Summary report in ProdTask

(https://prodtask-

dev.cern.ch/prodtask/request_hashtags_main/#/hashtags/&MC16c_CP),

but with physics category breakdown (like in Twiki Event

Summary).

2. Implement google-like search of tasks and data samples by

the arbitrary set of attributes, like campaign, project,

ATLAS geometry, Condition Tags, hashtags, physics

category, and others.

4

Page 5: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

METADATA INDEXING AND SEARCH FACILITIES

5

Beginning with the 2016th Monte-Carlo

simulation campaign, ProdSys2 metadata

were enhanced with ‘hashtags’ for tasks,

providing more detailed search by physics

categories, physics channel, Monte-Carlo

generators list, etc.

ATLAS_DEFT.t_production_task

ATLAS_DEFT.t_prodmanager_request

ATLAS_DEFT.t_hashtag

ATLAS_DEFT.t_production_step

ATLAS_DEFT.t_step_template

ATLAS_DEFT.t_ht_to_task

ATLAS_DEFT.t_task

ATLAS_PANDA.jedi_datasets

Page 6: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

6

Unchangeable data

Changeable data

Numbers of events,

that can be used for

aggregation

Metadata categories:- Task Parameters

- Taskid

- Taskname

- Status

- Timestamp

- Start time

- End time

- Request ID

- Ticket ID

- User Name

- Experiment parameters

- Energy

- Campaign/Subcampaign

- Project

- Physics group

- Physics category

- Hashtags

- Run number

- Configuration

- ALTAS geometry

- Conditions tags

- SW Release

- Trigger Config

- Events

- Requested

- Processed

- Data Samples

- Input

- Output

Page 7: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

18.09.17Maria Grigorieva 7

Page 8: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

SUMMARIES IN KIBANA

8

"query": {"bool": {

"must": [{ "term": { "subcampaign.keyword": "MC16a" } },

{ "term": { "status": "done" } }],"should": [

{ "term": { "hashtag_list": "MC16a"} },

{ "term": { "hashtag_list": "MC16a_CP"} }

]}

},"aggs": {"category": {

"terms": {"field": "phys_category"},"aggs": {

"step": {"terms": {

"field": "step_name.keyword"},"aggs": {

"requested": {"sum": {"field": "requested_events"

}},"processed": {

"sum": {"field": "processed_events"}}

}}

}}

}

Page 9: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

WEB-INTERFACE PROTOTYPE

9

GET prodsys/MC16/_search

{

"query": {

"bool": {

"must": {

"query_string": {

"query": “"MC16a” AND “Higgs” AND “ATLAS-R2-2016-01-00-01”

AND “Reco” AND “OFLCOND-MC16-SDR-16””,

"analyze_wildcard": true

}

},

"filter": {

"range": {

"task_timestamp": {

"gte": "01-05-2017 00:00:00",

"lt": "10-08-2017 00:00:00",

}

}

}

}

}

}

Page 10: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

10

Page 11: Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy - Campaign/Subcampaign - Project - Physics group - Physics category - Hashtags

PRODSYS2 & ELASTIC SYNCHRONIZATION

• Candidate for synchronization parameter is:

• T_production_task : timestamp

• Documents in ElasticSearch are immutable; we cannot change them. Instead, if we need to update an existing document, we reindex or replace it.

• Internally, ElasticSearch has marked the old document as deleted and added an entirely new document.The old version of the document doesn’t disappear immediately, although you won’t be able to access it. ElasticSearch cleans up deleted documents in the background as you continue to index more data.

• For synchronization we can get data from ProdSys2 for previous time period and simply rewrite it in the ElasticSearch.

11

https://www.elastic.co/guide/en/elasticsearch/guide/current/update-doc.html