ATLAS Analysis Use
description
Transcript of ATLAS Analysis Use
ATLAS Analysis Use
Dietrich Liko
Credits
GANGA Team PANDA DA Team ProdSys DA Team ATLAS EGEE/LCG Taskforce EGEE Job Priorities WG DDM Team DDM Operations Team
Overview ATLAS Computing Model
AOD & ESD analysis TAG based analysis
Data Management DDM and Operation
Workload Management EGEE, OSG & Nordugrid Job Priorities
Distributed Analysis Tools GANGA pathena Other tools: Prodsys, LJSF
Distributed Analysis activities during SC4
Tier-2 Site requirements
Tier-2 in the Computing Model Tier-2 centers have an important role
Calibration Simulation Analysis
Tier-2 provide analysis capacity for the physics and detector groups In general chaotic access patterns
Typically a Tier-2 center will host … Full TAG samples One third of the full AOD sample Selected RAW and ESD data
Data will be distributed according to guidelines given by the physics groups
Analysis models
For efficient processing it is necessary to analyze data locally Remote access to data is discouraged
To analyses the full AOD and the ESD data it is necessary to locate the data and send the jobs to the relevant sites
TAG data at Tier-2 will be file based Analysis uses same pattern as AOD analysis
AOD Analysis
The assumption is that the users will perform some data reduction and generate Athena aware ntuples (AANT) There are also other possibilities
This development is steered by the Physics Analysis Tools group (PAT)
AANT are then analyzed locally
Aims for SC4 & CSC The main data to be analyzed are the AOD and the ESD data from the CSC
production The total amount of data is still small …few 100 GB We aim first at a full distribution of the data to all interested sites
Perform tests of the computing model by analysis these data and measurement of relevant indicators Reliability Scalability Throughput
Simply answer the questions … How long does it take to analyze the expected 150 TB of AOD
data corresponding to one year of running of the LHC And what happens if several of you try to do it at the same time
In the following …. I will discuss the technical aspects necessary to
achieve these goals
ATLAS has three grids with different middleware LCG/EGEE OSG Nordugrid
Data Management is shared between the grids But there are grid specific aspects
Workload Management is grid specific
Data Management I will review only some aspect related to Distributed Analysis
The ATLAS DDM is based on Don Quijote 2 (DQ2) See the tutorial session on Thursday for more details
Two major tasks Data registration
Datasets are used to increase the scalability Tier-1’s are providing the necessary services
Data replication Based on the gLite File Transfer Service (FTS) Fallback to SRM or gridftp possible Subscription are used to manage the actual file transfer
How to access data on a Tier-2
SE
FTS
Datasetcatalog
Tier 0
VOBOX
Tier 1Tier 2
CE
LRC
http
lrc protocol rfio
dcap
gridftp
nfs
To distributed the data for analysis … Real data
Data recorded and processed at CERN (Tier-0) Data distribution via Tier-1
Reprocessing Reprocessing at Tier-1 Distribution via other Tier-1
Simulation Simulation at Tier-1 and associated Tier-2 Collection of data at Tier-1 Distribution via other Tier-1
For analysis …
The challenge is not the amount of data, but the management of the overlapping flow patterns
For SC4 we have a simpler aim … Obtain a equal distribution for the current available
simulated data
Data from the Tier-0 exercise is not useful for analysis We will distribute only useful data
Grid specific aspects OSG
DDM fully in production since January Site services also at Tier-2 centers
LCG/EGEE Only dataset registration in production New deployment model addresses this issue Migration to new version 0.2.10 on the way
Nordugrid Up to now only dataset registration Final choice of file catalog still open
New version 0.2.10 Many essential features for Distributed Analysis
Support for the ATLAS Tier structure Fallback from FTS to SRM and gridftp Support for disk areas Parallel operation of production and SC4 And many more ….
Deployment is on the way We hope to see in production very soon OSG/Panda has to move to it asap We should stay with this version until autumn
The success of Distributed Analysis during SC4 is crucially depending on the success of this version
Local throughput An Athena job has in the ideal case about 2MB/sec data
throughput The limit given by the persistency technology
Storegate-POOL-ROOT
Up to 50% of the capacity of a site is dedicated to analysis
We plan to access data locally via the native protocol (rfio, dcap, root etc)
Local network configuration should take that into account
Data Management Summary
Data Distribution is essential for Distributed Analysis
DQ2 0.2.10 has the required features
There is a lot of work in front of us to control and validate data distribution
Local configuration determined by Athena I/O
Workload management Different middleware Different teams Different submission tools
Different submission tools are confusing to our users ..
We aim to obtain some common ATLAS UI following the ideas of pathena tool (see later)
But …. the priority for Distributed Analysis in the context of SC4 is to solve the technical problems within each grid infrastructure
LCG/EGEE
gLite UI gLite Resource Broker
Dataset Location Catalog BDII
SitesSites
SitesSites
Advantages of the gLite RB Bulk submission
Increased performance and scalability Improved Sandbox handling Shallow resubmission
If you want to use a Resource Broker for Distributed Analysis, you want to use finally the gLite RB
Status Being pushed into deployment with gLite 3.0 Has not yet the same maturity as the LCG RB Turing the gLite RB into production quality has evidently a high
priority
ATLAS EGEE/LCG TaskforcegLite LCG
Bulk submission Multiple threads
Submission: 0.3 sec/job 2 sec/job
Matchmaking: 0.6 sec/job
Job submission is not the limit any more (there are other limits ….)
Plan B: Prodsys + CondorG
CondorG Executor
Dataset Location Catalog
BDII
SitesSites
SitesSites
CondorG Negotiator
ProdDB
Advantages
Profits from the experiences of the production Proven record from the ongoing production Better control on the implementation
CondorG by itself is also part of the Resource Broker
Performance ~ 1 sec/job
Status Coupled to the evolvement of the production system Work on GANGA integration has started
OSG
On OSG ATLAS uses PANDA for Workload management Concept is similar to DIRAC and Alien
Fully integrated with the DDM
Status In production since January Work to optimize the support for analysis users is ongoing
PANDA Architecture
Advantages for DA
Integration with DDM All data already available Jobs start only when the data is available
Late job binding due to pilot jobs Addresses grid inefficiencies Fast response for user jobs
Integration of production and analysis activities
Nordugrid ARC middleware Compact User Interface
14 MB vs New version has very good job submission
performance 0.5 to 1 sec/job
Status Open questions: Support for all ATLAS users, DDM
integration Analysis capability seems to be coupled with the planned
NG Tier-1
ARC Job submission
ARC UI Sites
RLS Nordugrid IS
WMS Summary EGEE/LCG
gLite Resource Broker Prodsys & CondorG
OSG PANDA
Nordugrid ARC
Different system – different problems All job submission systems need work to optimize user analysis
Job Priorities
Different infrastructures – different problems
EGEE Job Priority WG
OSG/Panda Separate cloud of analysis pilots
Nordugrid Typically a site has several queues
EGEE Job Priority WG
TCG working group
ATLAS & CMS LHCb & Diligent observer JRA1 (developers) SA3 (deployment) Several sites (NIKHEF, CNAF)
ATLAS requirements
Split site resources in several shares Production Long jobs Short jobs Other jobs
Objectives Production should not be pushed from a site Analysis jobs should bypass production jobs Local fairshare
Proposed solution
Production
Long
Short
CE
CE
Software
70%
20 %
1 %
9 %
Role=Production
Role=Software
Status Based on VOMS Roles
Role=Production Role=Software
No new middleware A patch to the WMS has to be back ported
Test Installation NIKHEF, CNAF TCG & WLCG MB have agreed to the proposed solution We are planning the move to the preproduction service Move to the production sites in the not so far future
In the future Support for physics groups Dynamic settings Requires new middleware (as GPBOX)
PANDA
Increase the number of analysis pilots Fast pickup of user jobs First job can start in few seconds
Several techniques are being studied Multitasking pilots Analysis queues
Job Priorities Summary
EGEE/LCG New site configuration
OSG/Panda Addressed by PANDA internal developments
DA User Tools
pathena PANDA tool for Distributed Analysis Close to the Physics Analysis Tools group (PAT)
GANGA Common project between LHCb & ATLAS Used for Distributed Analysis on LCG
pathena
Developed in close collaboration between PAT and PANDA
Local jobathena ttbar_jobOptions.py
Grid jobpathena ttbar_jobOptions.py
–inDS csc11.005100.ttbar.recon.AOD…
–split 10
GANGA
Framework for job submission
Based on plugins for Backends
LCG, gLite, CondorG, LSF, PBS Applications
Athena, Executable
GPI abstraction layer Python Command Line Interface (CLIP) GUI
GANGA GUI
Other Tools
LJSF (Light Job submission framework) Used for ATLAS software installations Runs ATLAS transformations on the grid No integration with DDM yet
DA with Prodsys Special analysis transformations Work to interface with GANGA has started
DA Tools Summary
Main tools pathena with PANDA/OSG GANGA with Resource Broker/LCG
Integration with Athena as demonstrated by pathena is a clear advantage
GANGA plug-in mechanism allows in principle to obtain a unique interface Priority for the GANGA team is to deliver a robust
solution on LCG first
Distributed Analysis in SC4 Data distribution
Ongoing activity with the DDM operations team
Site configuration We will move soon to the preproduction service In few weeks we will then move to the production sites
Exercising the job priority model Analysis in parallel to production
Scaling tests of the computing infrastructure Measurement of the turnaround time for analysis of large
datasets
SC4 Timescale We plan to perform DA tests in August and then later in autumn
The aim is to quantify the current characteristics of the Distributed Analysis systems Scalability Reliability Throughput
Simply answer the questions … How long does it take to analyze the expected 150 TB of AOD
data corresponding to one year of running of the LHC And what happens if several of you try to do it at the same
time
Tier-2 Site Requirements
Configuration of the batch system to support the job priority model gLite 3.0 Analysis and production in parallel
Data availability Connect to DDM Disk area Sufficient local throughput