ATLAS DC2 Production …on Grid3

24
1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04

description

ATLAS DC2 Production …on Grid3. M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04. ATLAS Data Challenges. Purpose Validate the LHC computing model Develop distributed production & analysis tools Provide large datasets for physics working groups - PowerPoint PPT Presentation

Transcript of ATLAS DC2 Production …on Grid3

Page 1: ATLAS DC2 Production  …on Grid3

1

ATLAS DC2 Production …on Grid3

M. Mambelli, University of Chicago

for the US ATLAS DC2 team

September 28, 2004CHEP04

Page 2: ATLAS DC2 Production  …on Grid3

2

ATLAS Data Challenges Purpose

Validate the LHC computing model Develop distributed production & analysis tools Provide large datasets for physics working groups

Schedule DC1 (2002-2003): full software chain DC2 (2004): automatic grid production system DC3 (2006): drive final deployments for startup

Page 3: ATLAS DC2 Production  …on Grid3

3

ATLAS DC2 Production Phase I: Simulation (Jul-Sep 04)

generation, simulation & pileup produced datasets stored on Tier1 centers, then

CERN (Tier0) scale: ~10M events, 30 TB

Phase II: “Tier0 Test” @CERN (1/10 scale) Produce ESD, AOD (reconstruction) Stream to Tier1 centers

Phase III: Distributed analysis (Oct-Dec 04) access to event and non-event data from anywhere in

the world both in organized and chaotic wayscf. D. Adams, #115

Page 4: ATLAS DC2 Production  …on Grid3

4

ATLAS Production System Components

Production database ATLAS job definition and status

Supervisor (all Grids): Windmill (L. Goossens, #501) Job distribution and verification system

Data Management: Don Quijote (M. Branco #142) Provides ATLAS layer above Grid replica systems

Grid Executors LCG: Lexor (D. Rebatto #364) NorduGrid: Dulcinea (O. Smirnova #499) Grid3: Capone (this talk)

Page 5: ATLAS DC2 Production  …on Grid3

5

LCG NorduGrid Grid3 LSF

LCGexe

NGexe

G3exe

Legacyexe

super super super super

prodDB(CERN)

datamanagement

RLS RLS RLS

jabber soap soap jabber

Don Quijote “DQ”

Windmill

Lexor

AMI(Metadata)

CaponeDulcinea

ATLAS Global Architecture

this talk

Page 6: ATLAS DC2 Production  …on Grid3

6

Capone and Grid3 Requirements Interface to Grid3 (GriPhyN VDT based) Manage all steps in the job life cycle

prepare, submit, monitor, output & register Manage workload and data placement Process messages from Windmill Supervisor Provide useful logging information to user Communicate executor and job state

information to Windmill (ProdDB)

Page 7: ATLAS DC2 Production  …on Grid3

7

Capone Execution Environment GCE Server side

ATLAS releases and transformations Pacman installation, dynamically by grid-based jobs

Execution sandbox Chimera kickstart executable Transformation wrapper scripts

MDS info providers (required site-specific attributes) GCE Client side (web service)

Capone Chimera/Pegasus, Condor-G (from VDT) Globus RLS and DQ clients

“GCE” = Grid Component Environment“GCE” = Grid Component Environment

Page 8: ATLAS DC2 Production  …on Grid3

8

Capone Architecture Message interface

Web Service Jabber

Translation layer Windmill schema

CPE (Process Engine) Processes

Grid3: GCE interface Stub: local shell testing DonQuijote (future)

Message protocols

Translation

Web Service

CPE

Jabber

Windmill ADA

Stu

b

Grid

Do

nQ

uijo

te

Page 9: ATLAS DC2 Production  …on Grid3

9

Capone System Elements GriPhyN Virtual Data System (VDS) Transformation

A workflow accepting input data (datasets), parameters and producing output data (datasets)

Simple (executable)/Complex (DAG) Derivation

Transformation where the parameters have been bound to actual parameters

Directed Acyclic Graph (DAG) Abstract DAG (DAX) created by Chimera, with no

reference to concrete elements in the Grid Concrete DAG (cDAG) created by Pegasus, where CE, SE

and PFN have been assigned Globus, RLS, Condor

Page 10: ATLAS DC2 Production  …on Grid3

10

Capone Grid Interactions

Capone

Condor-G

schedd

GridMgr

CEgatekeeper

gsiftpWN

SE

Chimera

RLS Monitoring MDSGridCat

MonALISA

Windmill

Pegasus

ProdDB

VDC

DonQuijote

Page 11: ATLAS DC2 Production  …on Grid3

11

Capone

Condor-G

schedd

GridMgr

CEgatekeeper

gsiftpWN

SE

Chimera

RLS Monitoring MDSGridCat

MonALISA

Windmill

Pegasus

ProdDB

VDC

DonQuijote

A job in Capone (1, submission) Reception

Job received from Windmill Translation

Un-marshalling, ATLAS transformation DAX generation

Chimera generates abstract DAG Input file retrieval from RLS catalog

Check RLS for input LFNs (retrieval of GUID, PFN) Scheduling: CE and SE are chosen Concrete DAG generation and submission

Pegasus creates Condor submit files DAGMan invoked to manage remote steps

Capone

Condor-G

schedd

GridMgr

CEgatekeeper

gsiftpWN

SE

Chimera

RLS Monitoring MDSGridCat

MonALISA

Windmill

Pegasus

ProdDB

VDC

DonQuijote

Page 12: ATLAS DC2 Production  …on Grid3

12

Capone

Condor-G

schedd

GridMgr

CEgatekeeper

gsiftpWN

SE

Chimera

RLS Monitoring MDSGridCat

MonALISA

Windmill

Pegasus

ProdDB

VDC

DonQuijote

A job in Capone (2, execution) Remote job running / status checking

Stage-in of input files, create POOL FileCatalog Athena (ATLAS code) execution

Remote Execution Check Verification of output files and exit codes Recovery of metadata (GUID, MD5sum, exe attributes)

Stage Out: transfer from CE site to destination SE Output registration

Registration of the output LFN/PFN and metadata in RLS Finish

Job completed successfully, communicates to Windmill that jobs is ready for validation

Job status is sent to Windmill during all the execution Windmill/DQ validate & register output in ProdDB

Page 13: ATLAS DC2 Production  …on Grid3

13

Performance Summary (9/20/04) Several physics and calibration samples produced 56K job attempts at Windmill level

9K of these aborted before grid submission: mostly RLS down or selected CE down

“Full” success rate: 66% Average success after submitted: 70% Includes subsequent problems at submit host Includes errors from development

60 CPU-years consumed since July 8 TB produced

Job status Capone Total

failed 18812

finished 37371

Page 14: ATLAS DC2 Production  …on Grid3

14

ATLAS DC2 CPU usage

LCG41%

Grid330%

NorduGrid29%

G. Poulard, 9/21/04G. Poulard, 9/21/04

Total ATLAS DC2

~ 1470 kSI2k.months~ 100000 jobs~ 7.94 million events~ 30 TB

Total ATLAS DC2

~ 1470 kSI2k.months~ 100000 jobs~ 7.94 million events~ 30 TB

Page 15: ATLAS DC2 Production  …on Grid3

15

Ramp up ATLAS DC2

Sep 10

Mid July

CP

U-d

ay

Page 16: ATLAS DC2 Production  …on Grid3

16J. Shank, 9/21/04J. Shank, 9/21/04

Job Distribution on Grid3UTA_dpcc

17%

BNL_ATLAS17%

UC_ATLAS_Tier214%

BU_ATLAS_Tier213%

IU_ATLAS_Tier210%

UCSanDiego_PG5%

UM_ATLAS4%

UBuffalo_CCR4%

PDSF4%

FNAL_CMS4%

CalTech_PG4%

Others4%

Page 17: ATLAS DC2 Production  …on Grid3

17

# CE Gatekeeper TotalJobs Finished

Jobs Failed

Success Rate (%)

1 UTA_dpcc 8817 6703 2114 76.02

2 UC_ATLAS_Tier2 6132 4980 1152 81.21

3 BU_ATLAS_Tier2 6336 4890 1446 77.18

4 IU_ATLAS_Tier2 4836 3625 1211 74.96

5 BNL_ATLAS_BAK 4579 3591 988 78.42

6 BNL_ATLAS 3116 2548 568 81.77

7 UM_ATLAS 3583 1998 1585 55.76

8 UCSanDiego_PG 2097 1712 385 81.64

9 UBuffalo_CCR 1925 1594 331 82.81

10 FNAL_CMS 2649 1456 1193 54.96

11 PDSF 2328 1430 898 61.43

12 CalTech_PG 1834 1350 484 73.61

13 SMU_Physics_Cluster 660 438 222 66.36

14 Rice_Grid3 493 363 130 73.63

15 UWMadison 516 258 258 50.00

16 FNAL_CMS2 343 228 115 66.47

17 UFlorida_PG 394 182 212 46.19

Site Statistics (9/20/04)

Average success rate by site: 70%

Page 18: ATLAS DC2 Production  …on Grid3

18

Capone & Grid3 Failure Statistics Total jobs (validated) 37713 Jobs failed 19303

Submission 472 Execution 392 Post-job check 1147 Stage out 8037 RLS registration 989 Capone host interruptions 2725 Capone succeed, Windmill fail 57 Other 5139

9/20/04

Page 19: ATLAS DC2 Production  …on Grid3

19

Production lessons Single points of failure

Production database RLS, DQ, VDC and Jabber servers

One local network domain Distributed RLS

System expertise (people) Fragmented production software Fragmented operations (defining/fixing jobs in the production

database) Client (Capone submit) hosts

Load and memory requirements for job management Load caused by job state checking (interaction with Condor-G) Many processes

No client host persistency Need local database for job recovery next phase of development

DOEGrids certificate or certificate revocation list expiration

Page 20: ATLAS DC2 Production  …on Grid3

20

Production lessons (II) Site infrastructure problems

Hardware problems Software distribution, transformation upgrades File systems (NFS major culprit); various solutions by site

administrators Errors in stage-out caused by poor network connections and

gatekeeper load. Fixed by adding I/O throttling, checking number of TCP connections

Lack of storage management (eg SRM) on sites means submitters do some cleanup remotely. Not a major problem so far, but we’ve not had much competition

Load on gatekeepers Improved by moving md5sum off gatekeeper

Post job processing Remote execution (mostly in pre/post job) error prone Reason of the failure difficult to understand

No automated tools for validation

Page 21: ATLAS DC2 Production  …on Grid3

21

Operations Lessons Grid3 iGOC and US Tier1 developed operations response model Tier1 center

core services “on-call” person available always response protocol developed

iGOC Coordinates problem resolution for Tier1 “off hours” Trouble handling for non-ATLAS Grid3 sites. Problems resolved

at weekly iVDGL operations meetings Shift schedule (8-midnight since July 23)

7 trained DC2 submitters Keeps queues saturated, reports sites and system problems,

cleans working directories Extensive use of email lists

Partial use of alternatives like Web portals, IM

Page 22: ATLAS DC2 Production  …on Grid3

22

Conclusions Completely new system

Grid3 simplicity requires more functionality and state management on the executor submit host

All functions of job planning, job state tracking, and data management (stage-in, out) managed by Capone rather than grid systems clients exposed to all manner of grid failures good for experience, but a client-heavy system

Major areas for upgrade to Capone system Job state management and controls, state persistency Generic transformation handling for user-level

production

Page 23: ATLAS DC2 Production  …on Grid3

23

Authors GIERALTOWSKI, Gerald (Argonne National Laboratory)

MAY, Edward (Argonne National Laboratory)VANIACHINE, Alexandre (Argonne National Laboratory) SHANK, Jim (Boston University)YOUSSEF, Saul (Boston University)BAKER, Richard (Brookhaven National Laboratory)DENG, Wensheng (Brookhaven National Laboratory)NEVSKI, Pavel (Brookhaven National Laboratory)MAMBELLI, Marco (University of Chicago)GARDNER, Robert (University of Chicago)SMIRNOV, Yuri (University of Chicago)ZHAO, Xin (University of Chicago)LUEHRING, Frederick (Indiana University)SEVERINI, Horst (Oklahoma University)DE, Kaushik (University of Texas at Arlington)MCGUIGAN, Patrick (University of Texas at Arlington) OZTURK, Nurcan (University of Texas at Arlington)SOSEBEE, Mark (University of Texas at Arlington)

Page 24: ATLAS DC2 Production  …on Grid3

24

Acknowledgements Windmill team (Kaushik De) Don Quijote team (Miguel Branco) ATLAS production group, Luc Goossens, CERN IT (prodDB) ATLAS software distribution team (Alessandro de Salvo, Fred

Luehring) US ATLAS testbed sites and Grid3 site administrators iGOC operations group ATLAS Database group (ProdDB Capone-view displays) Physics Validation group: UC Berkeley, Brookhaven Lab More info

US ATLAS Grid http://www.usatlas.bnl.gov/computing/grid/ DC2 shift procedures http://grid.uchicago.edu/dc2shift US ATLAS Grid Tools & Services http://grid.uchicago.edu/gts/