1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent...

16
1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent Progresses in High Energy Physics Bolu, Turkey. June 23-25, 2004

Transcript of 1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent...

1

ATLAS Grid Computing and Data Challenges

Nurcan Ozturk

University of Texas at Arlington

Recent Progresses in High Energy PhysicsBolu, Turkey. June 23-25, 2004

2

Outline

• Introduction • ATLAS Experiment• ATLAS Computing System• ATLAS Computing Timeline

• ATLAS Data Challenges• DC2 Event Samples• Data Production Scenario

• ATLAS New Production System• Grid Flavors in Production System• Windmill-Supervisor• An Example of XML Messages• Windmill-Capone Screenshots

• Grid Tools• Conclusions

3

Introduction

• Why Grid Computing:• Scientific research becomes more and more complex

and international teams of scientists grow larger and larger

• Grid technologies enables scientist to use remote computers and data storage systems to be able to retrieve and analyze the data around the world

• Grid Computing power will be a key to the success of the LHC experiments

• Grid computing is a challenge not only for particle physics experiments but also for biologists, astrophysicists and gravitational wave researchers

4

ATLAS Experiment• ATLAS (A Toroidal LHC Apparatus)

experiment at the Large Hadron Collider at CERN will start taking data in 2007.

• proton-proton collisions with a 14 TeV center-of-mass energy

• ATLAS will study:• SM Higgs Boson• SUSY states• SM QCD, EW, HQ Physics• New Physics?

• Total amount of “raw” data: 1 PB/year

• Needs the GRID to reconstruct and analyze this data: Complex “Worldwide Computing Model” and “Event Data Model”

• Raw Data @ CERN• Reconstructed data “distributed”• All members of the collaboration must have

access to “ALL” public copies of the data

~2000 Collaborators~150 Institutes 34 Countries

5

Tier2 Centre ~200kSI2k

Event Builder

Event Filter~159kSI2k

T0 ~5MSI2k

UK Regional Centre (RAL)

US Regional Centre

French Regional Centre

Italian Regional Centre

SheffieldManchester

LiverpoolLancaster ~0.25TIPS

Workstations

10 GB/sec

450 Mb/sec

100 - 1000 MB/s

•Some data for calibration and monitoring to institutes

•Calibrations flow back

Each Tier 2 has ~25 physicists working on one or more channels

Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data

Tier 2 do bulk of simulation

Physics data cache

~Pb/sec

~ 300MB/s/T1 /expt

Tier2 Centre ~200kSI2k

Tier2 Centre ~200kSI2k

622Mb/s

Tier 0Tier 0

Tier 1Tier 1

DesktopDesktop

PC (2004) = ~1 kSpecInt2k

Northern Tier ~200kSI2k

Tier 2Tier 2 ~200 Tb/year/T2

~7.7MSI2k/T1 ~2 Pb/year/T1

~9 Pb/year/T1 No simulation

622Mb/s

ATLAS Computing System (R. Jones)

6

• POOL/SEAL release (done)

• ATLAS release 7 (with POOL persistency) (done)

• LCG-1 deployment (done)

• ATLAS complete Geant4 validation (done)

• ATLAS release 8 (done)

• DC2 Phase 1: simulation production

• DC2 Phase 2: intensive reconstruction (the real challenge!)

• Combined test beams (barrel wedge)

• Computing Model paper

• Computing Memorandum of Understanding

• ATLAS Computing TDR and LCG TDR

• DC3: produce data for PRR and test LCG-n

• Physics Readiness Report

• Start commissioning run• GO!

2003

2004

2005

2006

2007

NOW

ATLAS Computing Timeline (D. Barberis)

7

ATLAS Data Challenges

Data Challenges --> generate and analyze simulated data with increasing scale and complexity using Grid (as much as possible)• Goal:

• Validation of the Computing Model, the software, the data model, and to ensure the correctness of the technical choices to be made

• Provide simulated data to design and optimize the detector• Experience gained these Data Challenges will be used to

formulate the ATLAS Computing Technical Design Report• Status:

• DC0 (December2001-June2002), DC1 (July2002-March2003) completed• DC2 ongoing• DC3, DC4 planned (one/year)

8

DC2 Event Samples (G. Poulard) Channel Decay Cuts Events (10**6) Events (10**6)

before filter

A0 Top 1A0a Top (mis-aligned)A1 Z e-e no Pt cut 1A2 mu-mu 1A3 tau-tau 1A4 W leptons 1A5 Z + jet 0.5A6 dijets Pt > 600 0.25A7 W + 4 jets W -> leptons 0.25A8 QCD 0.5A9 Suzy 0.1A10 Higss tau-tau 0.1A11 DC1 susy 0.05

B1 J ets Pt > 180 1B2 Gamma + jet Pt > 20 0.2B3 bb -> B mu6-mu6 0.25B4 J ets Pt > 17 1B5 Gamma_ jet 0.05

H1 Higgs (130) 4 leptons 0.04H2 Higgs (180) 4 leptons 0.04H3 Higgs (120) gamma-gamma 0.015H4 Higgs (170) W-W 0.015H5 Higgs (170) 0.015H6 Higgs (115) tau-tau 0.015H7 Higgs (115) tau-tau 0.015H8 MSSM Higgs b-b-A(300) 0.015H9 MSSM Higgs b-b-A(115) 0.015

M1 Minimum bias

Total 9.435

9

Data Production Scenario (G. Poulard)

RDO (or BS)

RDO (or BS)

< 2 GB files

< 2 GB files

Streaming?

Still some work

~ 2000 jobs/day

I nput:~ 10 GB/ job~ 10 TB/day~ 150 MB/s

No MCTruth if BS

J ob duration limited to 24h!~ 2000 jobs/day~ 500 GB/day~ 5 MB/s

Comments

AOD

ESD

BS

BS

Digits+MCTruth

Digits+MCTruth

Hits+ MCTruth

Generated events

Output

Several files

1 (or few) files

1 fileSeveral 10 files

1 file

“part of” < 2 GB files

I nput

noneEvent generation

ESDAOD production

RDO or BSReconstruction

RDO or BSEvents mixing

“pile- up” dataRDO

Byte- stream

Hits “signal”+MCTruth

Hits “min.b”

Hits+ MCTruth(Generated

events)

Generated Events

G4 simulation

Pile- up

Detector response

RDO (or BS)

RDO (or BS)

< 2 GB files

< 2 GB files

Streaming?

Still some work

~ 2000 jobs/day

I nput:~ 10 GB/ job~ 10 TB/day~ 150 MB/s

No MCTruth if BS

J ob duration limited to 24h!~ 2000 jobs/day~ 500 GB/day~ 5 MB/s

Comments

AOD

ESD

BS

BS

Digits+MCTruth

Digits+MCTruth

Hits+ MCTruth

Generated events

Output

Several files

1 (or few) files

1 fileSeveral 10 files

1 file

“part of” < 2 GB files

I nput

noneEvent generation

ESDAOD production

RDO or BSReconstruction

RDO or BSEvents mixing

“pile- up” dataRDO

Byte- stream

Hits “signal”+MCTruth

Hits “min.b”

Hits+ MCTruth(Generated

events)

Generated Events

G4 simulation

Pile- up

Detector response

10

ATLAS New Production System

LCG NG Grid3 LSF

LCGexe

LCGexe

NGexe

G3exe

LSFexe

super super super super super

prodDBdms

RLS RLS RLS

jabber jabber soap soap jabber

Don Quijote

Windmill

Lexor

AMI

CaponeDulcinea

http://www.nordugrid.org/applications/prodsys/

11

Grids Flavors in Production System

• LCG: LHC Computing Grid, > 40 sites• Grid3: USA Grid, 27 sites• NorduGrid: Denmark, Sweden, Norway, Finland, Germany, Estonia, Slovenia, Slovakia,

Australia, Switzerland, 35 sites

07-May-04country centre country centre

Austria UIBK Portugal LIP, Lisbon

Canada TRIUMF, Vancouver Russia SINP, Moscow

Univ. Montreal Spain PIC, Barcelona

Univ. Alberta IFIC, Valencia

Czech Republic CESNET, Prague IFCA, SantanderUniversity of Prague University of Barcelona

France IN2P3, Lyon** Uni. Santiago de CompostelaGermany FZK, Karlsruhe CIEMAT, Madrid

DESY UAM, MadridUniversity of Aachen Switzerland CERNUniversity of Wuppertal CSCS, Manno**

Greece GRNET, Athens Taiwan Academia Sinica, TaipeiHolland NIKHEF, Amsterdam NCU, Taipei

Hungary KFKI, Budapest UK RALIsrael Tel Aviv University** Cavendish, Cambridge

Weizmann Institute Imperial, LondonItaly CNAF, Bologna Lancaster University

INFN, Torino Manchester UniversityINFN, Milano Sheffield University

INFN, Roma QMUL, London

INFN, Legnaro USA FNALJapan ICEPP, Tokyo** BNL**Poland Cyfronet, Krakow

Regional Centres Connected to the LCG Grid

** not yet in LCG-2

Centres in process of being connectedcountry centre

China IHEP, BeijingIndia TIFR, MumbaiPakistan NCP, IslamabadHewlett Packard to provide “Tier 2-like” services for LCG, initially in Puerto Rico

L. Perini

12

Windmill-Supervisor

• Supervisor development team at UTA: Kaushik De, Nurcan Ozturk, Mark Sosebee

• supervisor-executor communication is via Jabber protocol developed for Instant Messaging

• XML (Extensible Markup Language ) messages are passed between supervisor-executor

• supervisor-executor interaction:• numJobsWanted• executeJobs• getExecutorData• getStatus• fixJob• killJob

• Final verification of jobs is done by supervisor

prodDB

supervisor data mgtsystem

replicacatalog

*

*prod manager

executor

Windmill webpage:http://www-hep.uta.edu

13

An Example of XML Messages

<?xml version="1.0" ?><windmill type="request” user="supervisor" version="0.6"> <numJobsWanted> <minimumResources> <transUses>JobTransforms-8.0.1.2 Atlas-8.0.1 – software version</transUses> <cpuConsumption> <count>100000 - minimum CPU required for a production job</count> <unit>specint2000seconds - unit of CPU usage</unit> </cpuConsumption> <diskConsumption> <count>500 - maximum output file size</count> <unit>MB</unit> </diskConsumption> <ipConnectivity>no - IP connection required from CE </ipConnectivity> <minimumRAM> <count>256 - minimum physical memory requirement</count> <unit>MB</unit> </minimumRAM> </minimumResources> </numJobsWanted></windmilll>

<?xml version="1.0" ?><windmill type="respond” user=“executor" version="0.8"> <numJobsWanted> <availableResources> <jobCount>5</jobCount> <cpuMax> <count>100000</count> <unit>specint2000</unit> </cpuMax> </availableResources> </numJobsWanted></windmill>

numJobWanted : supervisor-executor negotiation of number of jobs to process

supervisor’s request

executor’s respond

14

Windmill-Capone Screenshots

15

Grid Tools

An example: Grid3 - USA Grid• Joint project with USATLAS,

USCMS, iVDGL, PPDG, GriPhyN

• Components:• VDT based • Classic SE (gridftp)• Monitoring: Grid site Catalog,

Ganglia, MonALISA• Two RLS servers and VOMS

server for ATLAS• Installation:

• pacman –get iVDGL:Grid3• Takes ~ 4 hours to bring up a

site from scratch

VDT (Virtual Data Toolkit)

version 1.1.14 gives:• Virtual Data System 1.2.3• Class Ads 0.9.5• Condor 6.6.1• EDG CRL Update 1.2.5• EDG Make Gridmap 2.1.0• Fault Tolerant Shell (ftsh) 2.0.0• Globus 2.4.3 plus patches• GLUE Information providers• GLUE Schema 1.1, extended

version 1• GPT 3.1• GSI-Enabled OpenSSH 3.0• Java SDK 1.4.1• KX509 2031111• Monalisa 0.95• MyProxy 1.11• Netlogger 2.2• PyGlobus 1.0• PyGlobus URL Copy 1.1.2.11• RLS 2.1.4• UberFTP 1.3

What tools are needed for a Grid site?

16

Conclusions

• Grid paradigm works; opportunistic use of existing resources, run anywhere, from anywhere, by anyone...

• Grid computing is a challenge, needs world wide collaboration

• Data production using Grid is possible, successful so far• Data Challenges are the way to test the ATLAS

computing model before the real experiment starts• Data Challenges also provides data for Physics groups• A learning and improving experience with Data

Challenges