Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

15
Production Tools in ATLAS Production Tools in ATLAS RWL Jones GridPP EB 24 RWL Jones GridPP EB 24 th th June 2003 June 2003

Transcript of Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

Production Tools in ATLASProduction Tools in ATLAS

RWL Jones GridPP EB 24RWL Jones GridPP EB 24thth June 2003 June 2003

RWL Jones, Lancaster University

Grid in ATLASGrid in ATLAS

ATLAS is a global collaboration, so the various Grid flavours are importantATLAS is a global collaboration, so the various Grid flavours are important

Both US ATLAS and NorduGrid provide their own production toolsBoth US ATLAS and NorduGrid provide their own production tools

US-ATLAS US-ATLAS EDG Testbed Prod EDG Testbed Prod NorduGridNorduGrid

RWL Jones, Lancaster University

All the services are either taken from Globus, or written using Globus libraries and APIAll the services are either taken from Globus, or written using Globus libraries and API Should be fairly compatible with Globus-based solutions

Information system knows everythingInformation system knows everything Substantially re-worked and patched Globus MDS Distributed and multi-rooted Allows for a mesh topology

The server (“Grid manager”) on each gatekeeper does most of the jobThe server (“Grid manager”) on each gatekeeper does most of the job No need for a centralized broker Pre- and post- stages files Interacts with PBS Keeps track of job status Cleans up the mess Sends mails to users

The client (“User Interface”) does the Grid job submission, monitoring, termination, The client (“User Interface”) does the Grid job submission, monitoring, termination, retrieval, cleaning etcretrieval, cleaning etc

Interprets user’s job task Gets the testbed status from the information system Forwards the task to the best Grid Manager Does some file uploading, if requested

RWL Jones, Lancaster University

Features and problemsFeatures and problems

Features:Features: Relatively simple to join, expands rapidly Installation is done on a single machine Hides complexity of the distributed resources Very convenient Replica Catalog implementation Highly stable and reliable Non-intrusive middleware Accepts EDG certificates Almost any runtime environment can be set up

Problems:Problems: Standard (a la Globus2) authentication and authorization mechanisms Simplified (not more than in Globus2) data management system No persistent book-keeping service Simplified recovery mechanisms (as much as LRMS provides) Lacks big storage facilities Only command-line interface No standardized procedure for runtime environment installation and validation

RWL Jones, Lancaster University

US GRAT SoftwareUS GRAT Software

GRid Applications ToolkitGRid Applications Toolkit Used for U.S. Data Challenge productionUsed for U.S. Data Challenge production Based on Globus, Magda, AMI & MySQLBased on Globus, Magda, AMI & MySQL Shell & Python scripts, modular designShell & Python scripts, modular design Rapid development platformRapid development platform

Essentially scripts Quickly develop packages as needed by DC

Single particle production Higgs & SUSY production Pileup production & data management Reconstruction

Test grid middleware, test grid performanceTest grid middleware, test grid performance Modules can be easily enhanced or replaced by Condor-G, EDG Modules can be easily enhanced or replaced by Condor-G, EDG

resource broker, Chimera, replica catalogue, OGSA… (in resource broker, Chimera, replica catalogue, OGSA… (in progress)progress)

RWL Jones, Lancaster University

GRAT Execution ModelGRAT Execution Model

1. Resource Discovery2. Partition Selection3. Job Creation4. Pre-stage5. Batch Submission6. Job Parameterization7. Simulation

DC1

Prod.(UTA)

RemoteGatekeeper

Replica(local)

MAGDA(BNL)

Param(CERN)

BatchExecution

scratch

1,4,5,10

2

3

4

5

6

7

89

8. Post-stage9. Cataloging10. Monitoring

RWL Jones, Lancaster University

US Middleware EvolutionUS Middleware Evolution

Used in currentproduction software(GRAT & Grappa)

Tested successfully(not yet used for largescale production)

Under developmentand testing

Tested for simulation(may be used for largescale reconstruction)

RWL Jones, Lancaster University

What is the Atlas Commander?What is the Atlas Commander?

–graphical interactive tool to support production manager•define jobs in large quantities•submit and monitor progress•scan log files for (un)known errors•update bookkeeping Databases (AMI, Magda)•clean up in case of failures

–Test bed for GANGA MC production componentsAtCom has its own web siteAtCom has its own web site

http://atlas-project-atcom.web.cern.ch/atlas-project-atcom/contains user guide, developer’s guide, documentation, downloads, relevant contacte-mails, etc.

RWL Jones, Lancaster University

Architecture: application + plug-insArchitecture: application + plug-ins

AtComcore

AMIMgt

MagdaMgt

Bookkeeping DBs

Magda

AMI

LSFComputingSystem

EDGComputingSystem

NGComputingSystem

PBSComputingSystem

Plug-ins

...

Clusters

Two main functions of AtComTwo main functions of AtComdefinition of jobs

job submission/monitoring

RWL Jones, Lancaster University

Architecture (continued)Architecture (continued)

–plug-in implements abstract ‘cluster’ interface for specific clusters

•e.g. LSF–a plug-in is a Java class + configuration parameters

•e.g. LSF@TIMBUKTU–the AtCom configuration file defines all existing plug-ins and allows each to have its own configuration section

•they are loaded at run-time

RWL Jones, Lancaster University

Available plug-insAvailable plug-ins LSF

well understood and supported NorduGrid

development suspended PBS

developed by Alvin Tan EDG

working, but no EDG based clusters used in production BQS

developed by Jerome Fulachier

RWL Jones, Lancaster University

Bookkeeping databasesBookkeeping databases 5 logical database domains, two physical databases

physicsmeta-data

permanentproduction

log

recipecatalog

transientproduction

log

replicacatalog

AMI (Atlas Meta-data Interface)- mySQL DB hosted at Grenoble

Magda (Manager for grid-based data)

- mySQL DB hosted at BNL

RWL Jones, Lancaster University

RWL Jones, Lancaster University

When a job moves from When a job moves from RUNNINGRUNNING to to DONEDONE post processingpost processing commences commences

–resolve validation script logical name into physical name and apply it to stdout/stderr in temp locations

•returns 1=OK, 2=Undecided or 3=Failed–if OK

•register output files with Magda replica catalog•resolve extract script and apply it to stdout•copy/move logfiles to final destination•set status of partition to Validated

if Faileddelete output files

if Undecidedmark job as suchproduction manager can look at output of validation script or at the logfiles themselves and then force a decision as OK or Failed

RWL Jones, Lancaster University

The FutureThe Future

GANGA is starting to provide the required functionalityGANGA is starting to provide the required functionality For DC2, a new tool is being built, and the GANGA core For DC2, a new tool is being built, and the GANGA core

should be its basis.should be its basis. DCs require immediate solutions Robust tools require slow development