LHC Physics Analysis and Databases

24
LHC Physics Analysis and Databases Maaike Limper

description

LHC Physics Analysis and Databases. Maaike Limper. Introduction. New Oracle sponsored CERN OpenLab fellow: Maaike Limper, started January 2012 Project outline : Investigate possibility of doing LHC-scale data reconstruction and/or physics analysis within an Oracle database - PowerPoint PPT Presentation

Transcript of LHC Physics Analysis and Databases

Page 1: LHC Physics Analysis and Databases

LHC Physics Analysis and Databases

Maaike Limper

Page 2: LHC Physics Analysis and Databases

Introduction

New Oracle sponsored CERN OpenLab fellow: Maaike Limper, started January 2012

Project outline: Investigate possibility of doing LHC-scale data reconstruction and/or physics analysis within an Oracle database

MINI-CV: Maaike Limper has a master in Physics from the University of Amsterdam, and a PhD in Particle Physics completed with Nikhef (The Dutch National Institute for Particle Physics). As a part of her PhD and sub-sequent post-doc she worked on the ATLAS experiment,  one of the experiments that measures events produced by the Large Hadron Collider at CERN. During her work as an ATLAS physicist, Maaike worked on the construction of the silicon tracker, developed track and vertex reconstruction algorithms and participated in the analysis of the first LHC data recorded with the ATLAS detector. Maaike was also involved in developing some of the ATLAS data-taking conditions databases for the pixel detector. As Prompt Reconstruction Coordinator for ATLAS, she was responsible for the reconstruction of all data recorded by the ATLAS detector. As of January 2012 she works full-time as an Oracle funded Openlab fellow for the CERN IT department.

2LHC physics analysis and databases - M. Limper

Page 3: LHC Physics Analysis and Databases

Lake Geneva

CERNLHC

ATLAS

Large Hadron Collider at CERN

Four main experiments recording events produced by the Large Hadron Collider: ATLAS, CMS, LHCb and ALICE

Examples of LHC-scale data-processing from my experience with the ATLAS experiment

3LHC physics analysis and databases - M. Limper

Page 4: LHC Physics Analysis and Databases

LHC-scale data processing

4

generationsimulationdigitization

reconstruction

analysis

interactivephysics analysis (thousands of users!)

eventanalysis

raw data

eventreconstruction

analysis objects(extracted per physics topic)

data acquisition

event data taking

event summary data

simulated raw data

ntuple1

ntuple2

ntupleN

eventsimulation

LHC physics analysis and databases - M. Limper

Page 5: LHC Physics Analysis and Databases

~20 thousand events that produced a Standard Model Higgs with a mass of 150 GeV~300 billion inelastic proton-proton interactions ATLAS uses a flexible trigger menu to determine which events are interesting enough to record…

ATLAS recorded 1.6 billion events in 2011

Event data taking

5

raw data

data acquisition

event data taking

In 2011 LHC delivered 5.61 fb-1 of p-p collision data

LHC physics analysis and databases - M. Limper

Page 6: LHC Physics Analysis and Databases

ATLAS data reconstruction

6

Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework

Reconstruction task examples:• Fit particle trajectories from

hits measured in the inner detector

• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles

• Fit trajectory from hits in muon spectrometer

• Combine track information to determine muon candidate from interaction point

LHC physics analysis and databases - M. Limper

Page 7: LHC Physics Analysis and Databases

ATLAS data reconstruction

7

Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework

Reconstruction task examples:• Fit particle trajectories from

hits measured in the inner detector

• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles

• Fit trajectory from hits in muon spectrometer

• Combine track information to determine muon candidate from interaction point

LHC physics analysis and databases - M. Limper

Page 8: LHC Physics Analysis and Databases

ATLAS data reconstruction

8

Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework

Reconstruction task examples:• Fit particle trajectories from

hits measured in the inner detector

• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles

• Fit trajectory from hits in muon spectrometer

• Combine track information to determine muon candidate from interaction point

LHC physics analysis and databases - M. Limper

Page 9: LHC Physics Analysis and Databases

ATLAS data reconstruction

Reconstruction task examples:• Fit particle trajectories from

hits measured in the inner detector

• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles

• Fit trajectory from hits in muon spectrometer

• Combine track information to determine muon candidate from interaction point

9

Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework

LHC physics analysis and databases - M. Limper

Page 10: LHC Physics Analysis and Databases

ATLAS data reconstruction

Reconstruction task examples:• Fit particle trajectories from

hits measured in the inner detector

• Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles

• Fit trajectory from hits in muon spectrometer

• Combine track information to determine muon candidate from interaction point

10

Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework

LHC physics analysis and databases - M. Limper

Page 11: LHC Physics Analysis and Databases

ATLAS real physics event example: ATLAS data analysis

Reconstruction focuses on creating physics objects from the information measured in the detectorAnalysis focuses on interpreting information from the reconstructed objects to determine what type of event took place

Z->mm candidate, mmm=93.4 GeV

Example: apply quality criteria on muon candidates and calculate the invariant mass from the sum of the muon 4-momentum to find a Z-boson candidate 11LHC physics analysis and databases - M. Limper

Page 12: LHC Physics Analysis and Databases

ATLAS data reconstructionATLAS uses flexible trigger menus to reduce data-taking rate to ~300 recorded events/second (2011 rate)Tier-0 computing center at CERN has ~3000 CPUs available to reconstruct ATLAS data while it is recordedInitial data gets reconstructed twice:• “Express reconstruction” during data-taking• “Bulk reconstruction” 36 hours after end of run, using

beamspot and calibration constants derived from express reconstruction

12

Reprocessing campaigns every 2/3 months to re-reconstruct all data with latest version of reconstruction software

number of CPUs increased during busy periods

expressbulk reco

LHC physics analysis and databases - M. Limper

Page 13: LHC Physics Analysis and Databases

Event Simulation

In addition to the real physics events, physicists require simulated (MC) events to compare/test/understand their resultsEach physics group requests sets of signal and background samples ~100 million simulated events requestedGenerationSimulationDigitizationReconstructionDuring each ATLAS reprocessing campaign new version of all simulation samples are provided

13

generationsimulationdigitization

simulated raw data

eventsimulation

LHC physics analysis and databases - M. Limper

Page 14: LHC Physics Analysis and Databases

Data analysis

ROOT-ntuples are centrally produced by physics groups from previously reconstructed event summary dataEach physics group determines specific content of ntuple• Physics objects to include • Level of detail to be stored per physics object• Event filter and/or pre-analysis steps

14

event summary data

ntuple1

ntuple2

ntupleNVariables stored for each event in the form of: • scalar (example: missing energy, number of reconstructed muons)• vectors (example: energy, direction, momentum of reconstructed muons)• vector-of-vectors (example: position of hits on reconstructed muons)

Physics analysis at LHC is mainly done with ROOT• C++• analysis tools (plotting/fitting/statistical analysis)

LHC physics analysis and databases - M. Limper

Page 15: LHC Physics Analysis and Databases

Physics Analysis from DB

Benchmark Physics Analys in an Oracle DB:• Simplified version of HZbbll analysis (search for standard model

Higgs boson produced in association with a Z-boson)• Select muon-candidates to recontruct Z-peak• Select b-jet-candidates to reconstruct Higgs-peak

• Signal sample: 29887 events (3 ntuples)• Background sample (Z->mumu+jets): 289916 events (30 ntuples)• Use ntuple defined by ATLAS Top Physics Group: ”NTUP_TOP”

• 4212 physics attributes per event

Initial challenges:Large number of attributes, many of which are vector-type, difficult to implement in a single table, so I divided data over multiple tablesNeed to select data with SQL-queries instead of C++ code containing a loop over all events in the file

15LHC physics analysis and databases - M. Limper

Page 16: LHC Physics Analysis and Databases

Physics DB Initial DB implementation holds 695 out of 4212 variables (16.5%):• “EventData”-table: 184 columns

184 event-related variables (scalar), primaryKey=(RunNumber,EventNumber)

• “muon”-table: 271 columns, 268 muon-related variables (muon-vector content), primaryKey=(muonId,RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber)

• “jet”-table: 193 columns 190 jet-related variables (jet-vector content), primaryKey=(jetId,RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber)

• “MET”-table: 55 columns 53 MET (Missing Transverse Energy)-related variables (scalar), primaryKey=(RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber)

16LHC physics analysis and databases - M. Limper

ROOT-ntuple size is 880 MBCurrent DB-size per stored ntuple (16.5% of contents) is ~ 200 MBFull DB-size would be ~1.2 GB per ntuple ~2.6 GB per ntuple

Page 17: LHC Physics Analysis and Databases

Physics Analysis

Simplified version of HZbbll analysis:• muon selection: “IsMuon”-function to return TRUE, include

requirement pT>20 GeV and |η|<2.4 plus several requirement on hits and holes on tracks

• Require exactly 2 selected muons per event• b-jet selection: tranverse momentum greater than pT>25 GeV, |η|<2.5

and “flavour_weight_Comb”>1.55 (to select b-jets)• Require exactly 2 selected b-jets per event• Require 1 of the 2 b-jets to have pT>45 GeV

• Plot “invariant mass” of muons (Z-peak) and of b-jets (Higgs-peak)

Two versions of this analysis:• Standard ntuple-analysis in ROOT (C++) using locally stored ntuples• Analysis from Oracle Physics DB running on same machine as DB and using

functions implemented in “PHYSANALYSIS” PL/SQL-package: “IsMuon”, “InvariantMassLeptons, “InvariantMassJets”

17LHC physics analysis and databases - M. Limper

Page 18: LHC Physics Analysis and Databases

select MLIMPER.PHYSANALYSIS.INV_MASS_LEPTONS(mu1."E",mu2."E",mu1."px",mu2."px",mu1."py", mu2."py",mu1."pz",mu2."pz")/1000. as "DiMuonMass", MLIMPER.PHYSANALYSIS.INV_MASS_JETS(jet1."E",jet2."E",jet1."pt",jet2."pt",jet1."phi",jet2."phi",jet1."eta",jet2."eta")/1000. as "DiJetMass" from selectedmuon mu1, selectedmuon mu2, selectedbjet jet1, selectedbjet jet2, selectedevents evSelwhere mu1."muon_i"<mu2."muon_i" and mu1."EventNumber"=evSel."EventNumber" and mu2."EventNumber"=evSel."EventNumber" and jet1."jet_i"<jet2."jet_i" and jet1."EventNumber"=evSel."EventNumber" and jet2."EventNumber"=evSel."EventNumber" and jet1."pt"/1000.>45.

with selectedmuon as (select "muon_i","EventNumber","RunNumber","E","px","py","pz" from "muon" where MLIMPER.PHYSANALYSIS.IS_MUON("muon_i", "pt", "eta", "phi", "E", "me_qoverp_exPV", "id_qoverp_exPV","me_theta_exPV", "id_theta_exPV", "id_theta","isCombinedMuon", "isLowPtReconstructedMuon","tight","expectBLayerHit", "nBLHits", "nPixHits","nPixelDeadSensors", "nPixHoles", "nSCTHits","nSCTDeadSensors", "nSCTHoles","nTRTHits", "nTRTOutliers",0,20000.,2.4) = 1 ),selectedeventsmuon as (select "EventNumber", COUNT(*) as "mu_sel_n" from selectedmuon group by "EventNumber" HAVING COUNT(*)=2),

selectedbjet as (select "jet_i","EventNumber","RunNumber","E","pt","phi","eta" from "jet" INNER JOIN selectedeventsmuon USING("EventNumber") where "pt"/1000>25 and abs("eta")<2.5 and "fl_w_Comb">1.55 ),selectedevents as (select "EventNumber", COUNT(*) as "jet_sel_n" from selectedbjet group by "EventNumber" HAVING COUNT(*)=2)

Physics Analysis in SQLDone using SQL-query making temporary tables for different selections and joining data from different tables via “EventNumber”

18

select MLIMPER.PHYSANALYSIS.INV_MASS_LEPTONS(mu1."E",mu2."E",mu1."px",mu2."px",mu1."py", mu2."py",mu1."pz",mu2."pz")/1000. as "DiMuonMass", MLIMPER.PHYSANALYSIS.INV_MASS_JETS(jet1."E",jet2."E",jet1."pt",jet2."pt",jet1."phi",jet2."phi",jet1."eta",jet2."eta")/1000. as "DiJetMass" from selectedmuon mu1, selectedmuon mu2, selectedbjet jet1, selectedbjet jet2, selectedevents evSelwhere mu1."muon_i"<mu2."muon_i" and mu1."EventNumber"=evSel."EventNumber" and mu2."EventNumber"=evSel."EventNumber" and jet1."jet_i"<jet2."jet_i" and jet1."EventNumber"=evSel."EventNumber" and jet2."EventNumber"=evSel."EventNumber" and jet1."pt"/1000.>45.

LHC physics analysis and databases - M. Limper

with selectedmuon as (select "muon_i","EventNumber","RunNumber","E","px","py","pz" from "muon" where MLIMPER.PHYSANALYSIS.IS_MUON("muon_i", "pt", "eta", "phi", "E", "me_qoverp_exPV", "id_qoverp_exPV","me_theta_exPV", "id_theta_exPV", "id_theta","isCombinedMuon", "isLowPtReconstructedMuon","tight","expectBLayerHit", "nBLHits", "nPixHits","nPixelDeadSensors", "nPixHoles", "nSCTHits","nSCTDeadSensors", "nSCTHoles","nTRTHits", "nTRTOutliers",0,20000.,2.4) = 1 ),selectedeventsmuon as (select "EventNumber", COUNT(*) as "mu_sel_n" from selectedmuon group by "EventNumber" HAVING COUNT(*)=2),

selectedbjet as (select "jet_i","EventNumber","RunNumber","E","pt","phi","eta" from "jet" INNER JOIN selectedeventsmuon USING("EventNumber") where "pt"/1000>25 and abs("eta")<2.5 and "fl_w_Comb">1.55 ),selectedevents as (select "EventNumber", COUNT(*) as "jet_sel_n" from selectedbjet group by "EventNumber" HAVING COUNT(*)=2)

Page 19: LHC Physics Analysis and Databases

Physics Analysis benchmark

Output of SQL-query send to ROOT to produce standard root-histograms:

19

ROOT-macro using original ntuples as input produces identical histograms:

LHC physics analysis and databases - M. Limper

Page 20: LHC Physics Analysis and Databases

Physics Analysis in DB

Both tests done on CERN virtual machine, 2 GB RAM, 2 CPU, SLC5 64-bitAverage time of analysis measured after reboot of virtual machineSpeed from ntuple scales depends on:

• number of files • number of used ntuple-branches (=physics-attributes)

DB-speed depends on:• clever implementation of SQL-query: I’m not (yet) an SQL-guru…• table-size: select from jet-table much slower than from muon-table, as

more jet-objects stored per event

20

Time to produce plots from physics DB vs from ntuple-files

Sample #events #sel.events time from DB time from ntupleHZbbll 9993 421 5 s ?? 7 sHZbbll 29987 1326 95 s 14 sZ+2jets 289916 170 85 s 110 s

LHC physics analysis and databases - M. Limper

Page 21: LHC Physics Analysis and Databases

Physics Analysis in DB

Possible space-gain for DB version of analysis data:Each physics group optimized their own ntuple-size based on physics and level of detail required for their analysis, but sum of different ntuple contains duplicate infoPhysics Analysis DB would contains all physics objects, divided over multiple tables, each physics group can choose which tables to use

21

First implementation of Physics Analysis in Oracle DB • Data in multiple tables• SQL-query can reproduce selection in loop over events• Analysis from DB speed similar to original ntuple-analysis but

many complexities still not implemented…

analysis objects(extracted per physics topic)

event summary data

ntuple1

ntuple2

ntupleN

LHC physics analysis and databases - M. Limper

Page 22: LHC Physics Analysis and Databases

Physics Analysis in DB

Possible space-gain for DB version of analysis data:Each physics group optimized their own ntuple-size based on physics and level of detail required for their analysis, but sum of different ntuple contains duplicate infoPhysics Analysis DB would contains all physics objects, divided over multiple tables, each physics group can choose which tables to use

22

First implementation of Physics Analysis in Oracle DB • data in multiple tables• SQL-query can reproduce selection in loop over events• Analysis from DB slower than original ntuple-analysis and many

complexities still not implemented…

analysis objectsstored in databaseevent

summary data

physicsDB

LHC physics analysis and databases - M. Limper

Page 23: LHC Physics Analysis and Databases

Physics Analysis in DB

Space requirement for realistic physics DB (order of magnitude):~1.2 2.6 GB per ntuple (10k events)~2 billion events data+simulation in 2011~240 TB 520 TB of analysis data~10 revisions (reconstruction software versions) actively analyzed at a given time (more revisions may need to be archived) ~2400 TB of analysis data~factor 4 more data from LHC expected in 2012Analysis DB would need to be accessible by thousands of users!

duplicates of DB at different analysis sites needed

2323

analysis objectsstored in databaseevent

summary data

physicsDB

LHC physics analysis and databases - M. Limper

Page 24: LHC Physics Analysis and Databases

To be continued…Reconstruction versus Analysis of LHC data

Analysis data is relatively easily organized in tables and columns Reconstruction of data starts from raw-event data, less easily organized in tables and columns and likely to require data in blobsAnalysis uses many select-type arguments and functions than can be implemented in PL/SQLReconstruction of data in DB will require use of external agent to run the experiment’s C++ reconstruction software at the databaseAnalysis data will need to be accessible by many users doing many different analysis’ at once Reconstruction tasks are centrally organized, data is reconstructed during data-taking and re-reconstructed during re-processing campaigns

24LHC physics analysis and databases - M. Limper

How to optimize the use of Oracle services for LHC-scale data processing?