Dataflow/workflow with real data through Tiers

24
1 CMSItalia 12-14 Feb 2007 N. De Filippis Dataflow/workflow with real data through Tiers Department of Physics and INFN Bari N. De Filippis Tutorial

description

Tutorial. Dataflow/workflow with real data through Tiers. N. De Filippis. Department of Physics and INFN Bari. Outline. Computing facilities in the control room, at Tier-0 and at Central Analysis facilities (CAF): Ex: Tracker Analysis Centre (TAC) - PowerPoint PPT Presentation

Transcript of Dataflow/workflow with real data through Tiers

Page 1: Dataflow/workflow with real data through Tiers

1CMSItalia 12-14 Feb 2007

N. De Filippis

Dataflow/workflow with real data through Tiers

Department of Physics and INFN Bari

N. De Filippis

Tutorial

Page 2: Dataflow/workflow with real data through Tiers

2CMSItalia 12-14 Feb 2007

N. De Filippis

OutlineComputing facilities in the control room, at

Tier-0 and at Central Analysis facilities (CAF): Ex: Tracker Analysis Centre (TAC) Local storage and automatic processing at TAC

how to register files in DBS/DLS

Automatic data shipping and remote processing at Tier-1 /Tier 2

Injection in PhEDEx for the transfer Re-reconstruction and skimming with Prodagent Data analysis in a distributed environment via

CRAB simulation of cosmics in a Tier-2 site

Page 3: Dataflow/workflow with real data through Tiers

3CMSItalia 12-14 Feb 2007

N. De Filippis

What expected in the CMS Computing Model

Dataflow/workflow from Point 5 to Tiers:

DAQ+Filter Farm

Disk Storage(temporary before transfer

to CASTOR)

CASTOR /Publishing in DBS/DLS

Local Storage and reconstruction

DQM Visualization

Shipping to TIER 1 / 2

Re-reconstruction

skimming

End-useranalsysis

The CAF will support:

diagnostics of detector problems, trigger performance services,

derivation of calibration and alignment data

reconstruction services, interactive and batch analysis facilities

Most of tasks have to be performed in remote Tier sites in distributed environ.

Page 4: Dataflow/workflow with real data through Tiers

4CMSItalia 12-14 Feb 2007

N. De Filippis

Computing facilities in the control room, at Tier-0 and at Central Analysis

facilities

Page 5: Dataflow/workflow with real data through Tiers

5CMSItalia 12-14 Feb 2007

N. De Filippis

Example of a facility for Tracker

• The TAC is a dedicated Tracker Control Room – To serve the needs of collecting and analysing the data

from the 25% Tracker test at the Tracker Integration Facility (TIF)

– In use since Oct. 1st 2006 by DAQ and detector people

• Computing elements at TAC:– 1 disk server: CMSTKSTORAGE– 1 DB server: CMSTIBDB– 1 wireless/wired router– 12 PC’s

• 2 DAQ (CMSTAC02 e CMSTAC02)• 3 DQM, 1 Visualization (CMSTKMON, CMSTAC04 e CMSTAC05)• 2 TIB/TID (CMSTAC00 e CMSTAC01)• 3 DCS (PCCMSTRDCS10, • PCCMSTRDCS11 and PCCMSTRDCS12)• 2 TEC+ (CMSTAC06 and CMSTAC07) + 1 private PC

TAC is like a control room + Tier-0 + CAF “in miniatura”

Page 6: Dataflow/workflow with real data through Tiers

6CMSItalia 12-14 Feb 2007

N. De Filippis

Local storage and processing at TAC

• A dedicated PC (CMSTKSTORAGE) is devoted to store temporarily the data:– it has now 2.8 TB local fast disk (no redundancy) – it will allow local caching for about 10 days of data taking (300 GB/day

expected for 25 % test)

• CMSTKSTORAGE also used to perform the following tasks:a) perform o2o for connection and pedestals runs to fill the Offline DB b) convert RU files into EDM-compliant formats c) write files to CASTOR when ready

Area in castor created under …/store/…• /castor/cern.ch/cms/store/TAC/PIXEL• /castor/cern.ch/cms/store/TAC/TIB• /castor/cern.ch/cms/store/TAC/TOB• /castor/cern.ch/cms/store/TAC/TEC

d) register files in Data Bookkeeping Service (DBS) and Data Location Service (DLS)

Page 7: Dataflow/workflow with real data through Tiers

7CMSItalia 12-14 Feb 2007

N. De Filippis

How to register files in DBS/DLS (1)

A grid certificate with CMS Role=Production is needed:

voms-proxy-init -voms cms:/cms/Role=production

DBS and DLS API

cvs co –r DBS_0_0_3a DBS cvs co –r DLS_0_1_2 DLS

One DBS and DLS instance: please use

MCLocal_4/Writer for DBS

prod-lfc-cms-central.cern.ch/grid/cms/DLS/MCLocal_4 for DLS

The following info about your EDM-compliant file are needed:--PrimaryDataset=TAC-TIB-120-DAQ-EDM--ProcessedDataset=CMSSW_1_2_0-RAW-Run-0000505--DataTier=RAW --LFN=/store/TAC/TIB/edm_2007_01_29/EDM0000505_000.root--Size=205347982--TotalEvents= 3707

One processed

dataset per run

Page 8: Dataflow/workflow with real data through Tiers

8CMSItalia 12-14 Feb 2007

N. De Filippis

-- GUID=38ACFC35-06B0-DB11-B463 extracted with EdmFileUtil -u file:file.root

-- CheckSum=4264158233 extracted with cksum command-- CMSSWVersion=CMSSW_1_2_0 -- ApplicationName=FUEventProcess -- ApplicationFamily=Online

-- PSetHash= 4cff1ae0-1565-43f8-b1e9-82ee0793cc8c extracted with uuidgen

Run the script for the registration in DBS:python dbsCgiCHWriter.py --DBSInstance=MCLocal_4/Writer --DBSURL= “http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery" --PrimaryDataset=$primdataset --ProcessedDataset=$procdataset --DataTier=RAW --LFN=$lfn --Size=$size --TotalEvents=$nevts --GUID=$guid --CheckSum=$cksum --CMSSWVersion=CMSSW_1_2_0 --ApplicationName=FUEventProcess --ApplicationFamily=Online --PSetHash=$psethash

Closure of blocks in DBS:python closeDBSFileBlock.py --DBSAddress=MCLocal_4/Writer -datasetPath=$dataset

The two scripts dbsCgiCHWriter.py and closeDBSFileBlock.py can be found in /afs/cern.ch/user/n/ndefilip/public/Registration/

How to register files in DBS/DLS (2)

Page 9: Dataflow/workflow with real data through Tiers

9CMSItalia 12-14 Feb 2007

N. De Filippis

How to register files in DBS/DLS (3)

Run the script for the registration of blocks of files in DLS:

python dbsread.py --datasetPath=$dataset

or for each block of files:

dls-add -i DLS_TYPE_LFC -e prod-lfc-cmscentral.cern.ch/grid/cms/DLS/MCLocal_4/TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000505#497a013d-3b49-43ad-a80f-dbc590e593d7 srm.cern.ch which is the name of the SE

Data registered in DBS

Page 10: Dataflow/workflow with real data through Tiers

10CMSItalia 12-14 Feb 2007

N. De Filippis

Tracker data

MTCC data

Results in Data discovery page:

http://cmsdbs.cern.ch/discovery/expert

Page 11: Dataflow/workflow with real data through Tiers

11CMSItalia 12-14 Feb 2007

N. De Filippis

Automatic data shipping and remote processing at

Tier-1/Tier 2

Page 12: Dataflow/workflow with real data through Tiers

12CMSItalia 12-14 Feb 2007

N. De Filippis

Data published in DBS and DLS are ready to be transferred via the CMS official data movement tool, PhEDEx The injection, that is the procedure to write into the database for transfer of PhEDEx, has to be run in principle from CERN where the data are collected but it can be run also in a remote site in a Tier-1 / Tier-2 hosting PhEDEx It runs at Bari via an official PhEDEx agent and a component of ProdAgent modified to “close” blocks at the end of the transfer in order to enable automatic publishing in DLS (the same procedure used for Monte Carlo data) complete automatisation is reached with a script that watches for new tracker related entries in DBS/DLS Once data are injected in PhEDEX any Tier-1 or Tier-2 can subscribe to them

PhEDEx injection (1)

Page 13: Dataflow/workflow with real data through Tiers

13CMSItalia 12-14 Feb 2007

N. De Filippis

PhEDEx injection (2)

ProdAgent_v0XX is needed: configure PA to use the PhEDEX dropbox /dir/state/inject-tmdb/inbox

prodAgent-edit-config --component=PhEDExInterface --parameter=PhEDExDropBox --value=/dropboxdir/

start the PhEDExInterface component of PA:

prodAgentd --start --component=PhEDExInterface

PhEDEx_2.4 is needed: configure the inject-tmdb agent in your Config file

### AGENT LABEL=inject-tmdb PROGRAM=Toolkit/DropBox/DropTMDBPublisher -db ${PHEDEX_DBPARAM} -node TX_NON_EXISTENT_NODE

start the inject-tmdb agent of PHEDEx:./Master -config Config start inject-tmdb

Page 14: Dataflow/workflow with real data through Tiers

14CMSItalia 12-14 Feb 2007

N. De Filippis

PhEDEx injection (3) For each datasetpath of a run:

python dbsinjectTMDB.py --datasetPath=$dataset --injectdir=logs/

In the log of PhEDEX you will find the following messages

In /afs/cern.ch/user/n/ndefilip/public/Registration

2007-01-31 07:55:05: TMDBInject[18582]: (re)connecting to databaseConnecting to databaseReading file information from /home1/prodagent/state/inject-tmdb/work/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869.xmlProcessing dbs http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery?instance=MCLocal_4/Writer (204) Processing dataset /TAC-TIB-120-DAQ-EDM/RAW (1364) Processing block /TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000520#353a3ae2-30a0-4f30-86df-e08ba9ac6869 (7634) :+/ 1 new files, 1 new replicas PTB R C2007-01-31 07:55:08: DropTMDBPublisher[5828]: stats: _TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09 3.04r 0.18u 0.08s success

Page 15: Dataflow/workflow with real data through Tiers

15CMSItalia 12-14 Feb 2007

N. De Filippis

Results in PhEDEx page:

http://cmsdoc.cern.ch/cms/aprom/phedex/prod/Data::Replicas?filter=TAC-T;view=global;dexp=1364;rows=;node=6;node=19;node=44;nvalue=Node%20files#d1364

http://cmsdoc.cern.ch/cms/aprom/phedex

Page 16: Dataflow/workflow with real data through Tiers

16CMSItalia 12-14 Feb 2007

N. De Filippis

Goal: to run reconstruction of raw data in a standard and official way, typically using code of a CMSSW release (no prerelease, no user patch)

ProdAgent tool evaluated to perform reconstruction with the same procedures as for monte carlo samples

ProdAgent can be run everywhere…better in a Tier-1 / Tier-2

Running with ProdAgent will ensure that RECO data are automatically registered in DBS and DLS, ready to be shipped to Tier-1 and Tier-2 and analysed via computing tools

in the close future the standard reconstruction, calibration and alignment tasks will run on Central Analysis Facility (CAF) machines at CERN, such as expected in the Computing Model.

“Official” reconstruction/skimming (1)

Page 17: Dataflow/workflow with real data through Tiers

17CMSItalia 12-14 Feb 2007

N. De Filippis

“Official” reconstruction/skimming (2)

Input data are processed run by run and new processed datasets are created as output, one for each run

ProdAgent use the DatasetInjector component to be aware of the input files to be processed

It is needed to create the workflow file from the cfg for reconstruction;

the following example is for DIGI-RECO processing starting from GEN-SIM input files

no Pileup, StartUp and LowLumi pileup can be set for digitization

splitting of input files can be done either by event of by file

Page 18: Dataflow/workflow with real data through Tiers

18CMSItalia 12-14 Feb 2007

N. De Filippis

Creating the workflow file for no pileup case:python $PRODAGENT_ROOT/util/createProcessingWorkflow.py --dataset=/TAC-TIB-120-DAQ-EDM/RAW/CMSSW_1_2_0-RAW-Run-0000530--cfg=DIGI-RECO-NoPU-OnSel.cfg --version=CMSSW_1_2_0 --category=mc --dbs-address=MCLocal_4/Writer--dbs-url=http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery --dls-type=DLS_TYPE_DLI --dls-address=lfc-cms-test.cern.ch/grid/cms/DLS/MCLocal_4 --same-primary-dataset --only-closed-blocks --fake-hash --split-type=event --split-size=1000 --pileup-files-per-job=1 --pileup-dataset=/mc-csa06-111-minbias/GEN/CMSSW_1_1_1-GEN-SIM-1164410273--name=TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU Submitting jobs: python PRODAGENT/test/python/IntTests/InjectTestSkimLCG.py --workflow=/yourpath/TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU-Workflow.xml --njobs=300

“Official” reconstruction/skimming (3)

Page 19: Dataflow/workflow with real data through Tiers

19CMSItalia 12-14 Feb 2007

N. De Filippis

Data analysis via CRAB at Tiers (1)

• Data published in DBS/DLS can be processed via CRAB in remote using the distributed environment tools• users have to edit crab.cfg and insert the dataset path of the run to be analyzed as obtained by DBS info• users have to provide their CMSSW cfg, setup the environment and compile their code via scramv1• offline DB accessed via frontier at Tier-1/2 already tested during CSA06 with alignment data• an example cfg to perform the reconstruction chain starting from raw data can be found in /afs/cern.ch/user/n/ndefilip/public/Registration/TACAnalysis_Run2048.cfg

• Thanks to D. Giordano for the support

Page 20: Dataflow/workflow with real data through Tiers

20CMSItalia 12-14 Feb 2007

N. De Filippis

Data analysis via CRAB at Tiers (2)

• The piece of the cfg useful to access the offline DB via frontier

• The output files produced with CRAB are not registrered in DBS/DLS (but the implementation of the code is under development…)

• Further details about CRAB in the tutorial of F. Fanzago.

Page 21: Dataflow/workflow with real data through Tiers

21CMSItalia 12-14 Feb 2007

N. De Filippis

“Official” Cosmics simulation (1)

Goal: to make standard simulation of cosmics with official code in CMSSW release (no patch, no prereleses) CMSSW_1_2_2 is needed:

Download AnalysisExamples/SiStripDetectorPerformance

cvs co –r CMSSW_1_2_2 AnalysisExamples/SiStripDetectorPerformance

Complete geometry of CMS, no magnetic field, cosmic filter implemented to get muon triggered by scintillators:

AnalysisExamples/SiStripDetectorPerformance/src/CosmicTIFFilter.cc

The configuration file is: AnalysisExamples/SiStripDetectorPerformance/test/cosmic_tif.cfg

interactively: cmsRun cosmic_tif.cfg

by using ProdAgent to make large-scale and fully automatized productions Thanks to L.

Fanò

Page 22: Dataflow/workflow with real data through Tiers

22CMSItalia 12-14 Feb 2007

N. De Filippis

“Official” Cosmics simulation (2)

ProdAgent_v012:

create the workflow from the cfg file for GEN-SIM-DIGI:

python $PRODAGENT_ROOT/util/createProductionWorkflow.py --cfg /your/path/cosmic_tif.cfg --version CMSSW_1_2_0 --fake-hash

Warnings: when using createPreProdWorkflow.py the PoolOutputModule name in cfg should be compliant with the conventions to reflect the data tier the output file contains (i.e. GEN-SIM , GEN-SIM-DIGI, FEVT ).

so download the modified cfg from /afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF.cfg

the workflow can be found in:

/afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF-Workflow.xml

Submit jobs via standard prodagent scripts:

python $PRODAGENT_ROOT/test/python/IntTests/InjectTestLCG.py --workflow=/your/path/COSMIC_TIF-Workflow.xml --run=30000001 --nevts=10000 –njobs=100

Page 23: Dataflow/workflow with real data through Tiers

23CMSItalia 12-14 Feb 2007

N. De Filippis

Pro and con’s Advantages of the CMS computing approach:

Data officially published processed with official tools

so results are reproducible

the access to a large number of distributed resources

profit from the experience of the computing teams

Con’s:

initial effort to learn official computing tools

possible problems at remote sites, storage issues, instability of grid components (RB,CE), etc…

concurrence of analysis jobs and production jobs

policy/prioritization to be set in remote sites.

Page 24: Dataflow/workflow with real data through Tiers

24CMSItalia 12-14 Feb 2007

N. De Filippis

Conclusions

First real data registered in DBS/DLS are officially available to the CMS community

Data are moved between sites and published by using official tools

Reconstruction, re-reconstruction and skimming could be “standardized” using ProdAgent

Data analysis is performed by using CRAB

Cosmic simulation for detector communities can be officially addressed

Many thanks to the people of the TAC team (fabrizio, giuseppe, domenico., livio, tommaso, subir….)