Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.
-
Upload
dayna-baldwin -
Category
Documents
-
view
217 -
download
2
Transcript of Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.
Production Tools in ATLASProduction Tools in ATLAS
RWL Jones GridPP EB 24RWL Jones GridPP EB 24thth June 2003 June 2003
RWL Jones, Lancaster University
Grid in ATLASGrid in ATLAS
ATLAS is a global collaboration, so the various Grid flavours are importantATLAS is a global collaboration, so the various Grid flavours are important
Both US ATLAS and NorduGrid provide their own production toolsBoth US ATLAS and NorduGrid provide their own production tools
US-ATLAS US-ATLAS EDG Testbed Prod EDG Testbed Prod NorduGridNorduGrid
RWL Jones, Lancaster University
All the services are either taken from Globus, or written using Globus libraries and APIAll the services are either taken from Globus, or written using Globus libraries and API Should be fairly compatible with Globus-based solutions
Information system knows everythingInformation system knows everything Substantially re-worked and patched Globus MDS Distributed and multi-rooted Allows for a mesh topology
The server (“Grid manager”) on each gatekeeper does most of the jobThe server (“Grid manager”) on each gatekeeper does most of the job No need for a centralized broker Pre- and post- stages files Interacts with PBS Keeps track of job status Cleans up the mess Sends mails to users
The client (“User Interface”) does the Grid job submission, monitoring, termination, The client (“User Interface”) does the Grid job submission, monitoring, termination, retrieval, cleaning etcretrieval, cleaning etc
Interprets user’s job task Gets the testbed status from the information system Forwards the task to the best Grid Manager Does some file uploading, if requested
RWL Jones, Lancaster University
Features and problemsFeatures and problems
Features:Features: Relatively simple to join, expands rapidly Installation is done on a single machine Hides complexity of the distributed resources Very convenient Replica Catalog implementation Highly stable and reliable Non-intrusive middleware Accepts EDG certificates Almost any runtime environment can be set up
Problems:Problems: Standard (a la Globus2) authentication and authorization mechanisms Simplified (not more than in Globus2) data management system No persistent book-keeping service Simplified recovery mechanisms (as much as LRMS provides) Lacks big storage facilities Only command-line interface No standardized procedure for runtime environment installation and validation
RWL Jones, Lancaster University
US GRAT SoftwareUS GRAT Software
GRid Applications ToolkitGRid Applications Toolkit Used for U.S. Data Challenge productionUsed for U.S. Data Challenge production Based on Globus, Magda, AMI & MySQLBased on Globus, Magda, AMI & MySQL Shell & Python scripts, modular designShell & Python scripts, modular design Rapid development platformRapid development platform
Essentially scripts Quickly develop packages as needed by DC
Single particle production Higgs & SUSY production Pileup production & data management Reconstruction
Test grid middleware, test grid performanceTest grid middleware, test grid performance Modules can be easily enhanced or replaced by Condor-G, EDG Modules can be easily enhanced or replaced by Condor-G, EDG
resource broker, Chimera, replica catalogue, OGSA… (in resource broker, Chimera, replica catalogue, OGSA… (in progress)progress)
RWL Jones, Lancaster University
GRAT Execution ModelGRAT Execution Model
1. Resource Discovery2. Partition Selection3. Job Creation4. Pre-stage5. Batch Submission6. Job Parameterization7. Simulation
DC1
Prod.(UTA)
RemoteGatekeeper
Replica(local)
MAGDA(BNL)
Param(CERN)
BatchExecution
scratch
1,4,5,10
2
3
4
5
6
7
89
8. Post-stage9. Cataloging10. Monitoring
RWL Jones, Lancaster University
US Middleware EvolutionUS Middleware Evolution
Used in currentproduction software(GRAT & Grappa)
Tested successfully(not yet used for largescale production)
Under developmentand testing
Tested for simulation(may be used for largescale reconstruction)
RWL Jones, Lancaster University
What is the Atlas Commander?What is the Atlas Commander?
–graphical interactive tool to support production manager•define jobs in large quantities•submit and monitor progress•scan log files for (un)known errors•update bookkeeping Databases (AMI, Magda)•clean up in case of failures
–Test bed for GANGA MC production componentsAtCom has its own web siteAtCom has its own web site
http://atlas-project-atcom.web.cern.ch/atlas-project-atcom/contains user guide, developer’s guide, documentation, downloads, relevant contacte-mails, etc.
RWL Jones, Lancaster University
Architecture: application + plug-insArchitecture: application + plug-ins
AtComcore
AMIMgt
MagdaMgt
Bookkeeping DBs
Magda
AMI
LSFComputingSystem
EDGComputingSystem
NGComputingSystem
PBSComputingSystem
Plug-ins
...
Clusters
Two main functions of AtComTwo main functions of AtComdefinition of jobs
job submission/monitoring
RWL Jones, Lancaster University
Architecture (continued)Architecture (continued)
–plug-in implements abstract ‘cluster’ interface for specific clusters
•e.g. LSF–a plug-in is a Java class + configuration parameters
•e.g. LSF@TIMBUKTU–the AtCom configuration file defines all existing plug-ins and allows each to have its own configuration section
•they are loaded at run-time
RWL Jones, Lancaster University
Available plug-insAvailable plug-ins LSF
well understood and supported NorduGrid
development suspended PBS
developed by Alvin Tan EDG
working, but no EDG based clusters used in production BQS
developed by Jerome Fulachier
RWL Jones, Lancaster University
Bookkeeping databasesBookkeeping databases 5 logical database domains, two physical databases
physicsmeta-data
permanentproduction
log
recipecatalog
transientproduction
log
replicacatalog
AMI (Atlas Meta-data Interface)- mySQL DB hosted at Grenoble
Magda (Manager for grid-based data)
- mySQL DB hosted at BNL
RWL Jones, Lancaster University
When a job moves from When a job moves from RUNNINGRUNNING to to DONEDONE post processingpost processing commences commences
–resolve validation script logical name into physical name and apply it to stdout/stderr in temp locations
•returns 1=OK, 2=Undecided or 3=Failed–if OK
•register output files with Magda replica catalog•resolve extract script and apply it to stdout•copy/move logfiles to final destination•set status of partition to Validated
if Faileddelete output files
if Undecidedmark job as suchproduction manager can look at output of validation script or at the logfiles themselves and then force a decision as OK or Failed
RWL Jones, Lancaster University
The FutureThe Future
GANGA is starting to provide the required functionalityGANGA is starting to provide the required functionality For DC2, a new tool is being built, and the GANGA core For DC2, a new tool is being built, and the GANGA core
should be its basis.should be its basis. DCs require immediate solutions Robust tools require slow development