Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy...
Transcript of Dataset Discovery Status Report - indico.cern.ch€¦ · - User Name-Experiment parameters - Energy...
DATASET DISCOVERY STATUS REPORT
Maria Grigorieva
DKB Team
DKB MEETINGS & DISCUSSIONS
• Data Knowledge Base Technical Discussion 18 Apr 2017
https://indico.cern.ch/event/628962/
• Data Knowledge Base / DCC Technical Discussion 20 Apr 2017
https://indico.cern.ch/event/632634/
• DCC Meeting 18 May 2017 https://indico.cern.ch/event/640192/
• DKB team was motivated to investigate campaigns/datasets
discovery and categorization methods.
2
EXAMPLES OF TWIKI SUMMARIES
3
Metainformation in Twiki pages provides the most suitable
metadata categorization and representation for the end-
users. Twiki Monte-Carlo Campaign Pages contains:
•Aggregated events reports, categorized by physics categories
• Data sample’s lists for each physics category
• Data sample’s lists with breakdown a set of parameters, like:
•Monte-Carlo Generators •Powheg+Pythia8 ...
•Physics channel •W+ in mumu
W- in taumu Z/gamma* in tau tau ...
•Filtration methods •Without lepton filter Two lepton filter ...
But the metainformation in Twiki is not enough structured
and it doesn’t provide mechanisms for synchronization with
database back-ends.
The issue is to provide fully automatic search
and aggregation by arbitrary set of
parameters.
DKB NEW SCOPES
• To provide fast and flexible data categorization,
search and aggregation
1. Reproduce Event Summary report in ProdTask
(https://prodtask-
dev.cern.ch/prodtask/request_hashtags_main/#/hashtags/&MC16c_CP),
but with physics category breakdown (like in Twiki Event
Summary).
2. Implement google-like search of tasks and data samples by
the arbitrary set of attributes, like campaign, project,
ATLAS geometry, Condition Tags, hashtags, physics
category, and others.
4
METADATA INDEXING AND SEARCH FACILITIES
5
Beginning with the 2016th Monte-Carlo
simulation campaign, ProdSys2 metadata
were enhanced with ‘hashtags’ for tasks,
providing more detailed search by physics
categories, physics channel, Monte-Carlo
generators list, etc.
ATLAS_DEFT.t_production_task
ATLAS_DEFT.t_prodmanager_request
ATLAS_DEFT.t_hashtag
ATLAS_DEFT.t_production_step
ATLAS_DEFT.t_step_template
ATLAS_DEFT.t_ht_to_task
ATLAS_DEFT.t_task
ATLAS_PANDA.jedi_datasets
6
Unchangeable data
Changeable data
Numbers of events,
that can be used for
aggregation
Metadata categories:- Task Parameters
- Taskid
- Taskname
- Status
- Timestamp
- Start time
- End time
- Request ID
- Ticket ID
- User Name
- Experiment parameters
- Energy
- Campaign/Subcampaign
- Project
- Physics group
- Physics category
- Hashtags
- Run number
- Configuration
- ALTAS geometry
- Conditions tags
- SW Release
- Trigger Config
- Events
- Requested
- Processed
- Data Samples
- Input
- Output
18.09.17Maria Grigorieva 7
SUMMARIES IN KIBANA
8
"query": {"bool": {
"must": [{ "term": { "subcampaign.keyword": "MC16a" } },
{ "term": { "status": "done" } }],"should": [
{ "term": { "hashtag_list": "MC16a"} },
{ "term": { "hashtag_list": "MC16a_CP"} }
]}
},"aggs": {"category": {
"terms": {"field": "phys_category"},"aggs": {
"step": {"terms": {
"field": "step_name.keyword"},"aggs": {
"requested": {"sum": {"field": "requested_events"
}},"processed": {
"sum": {"field": "processed_events"}}
}}
}}
}
WEB-INTERFACE PROTOTYPE
9
GET prodsys/MC16/_search
{
"query": {
"bool": {
"must": {
"query_string": {
"query": “"MC16a” AND “Higgs” AND “ATLAS-R2-2016-01-00-01”
AND “Reco” AND “OFLCOND-MC16-SDR-16””,
"analyze_wildcard": true
}
},
"filter": {
"range": {
"task_timestamp": {
"gte": "01-05-2017 00:00:00",
"lt": "10-08-2017 00:00:00",
}
}
}
}
}
}
10
PRODSYS2 & ELASTIC SYNCHRONIZATION
• Candidate for synchronization parameter is:
• T_production_task : timestamp
• Documents in ElasticSearch are immutable; we cannot change them. Instead, if we need to update an existing document, we reindex or replace it.
• Internally, ElasticSearch has marked the old document as deleted and added an entirely new document.The old version of the document doesn’t disappear immediately, although you won’t be able to access it. ElasticSearch cleans up deleted documents in the background as you continue to index more data.
• For synchronization we can get data from ProdSys2 for previous time period and simply rewrite it in the ElasticSearch.
11
https://www.elastic.co/guide/en/elasticsearch/guide/current/update-doc.html