ATLAS Analysis Use

ATLAS Analysis Use

Dietrich Liko

Credits

GANGA Team PANDA DA Team ProdSys DA Team ATLAS EGEE/LCG Taskforce EGEE Job Priorities WG DDM Team DDM Operations Team

Overview ATLAS Computing Model

AOD & ESD analysis TAG based analysis

Data Management DDM and Operation

Workload Management EGEE, OSG & Nordugrid Job Priorities

Distributed Analysis Tools GANGA pathena Other tools: Prodsys, LJSF

Distributed Analysis activities during SC4

Tier-2 Site requirements

Tier-2 in the Computing Model Tier-2 centers have an important role

Calibration Simulation Analysis

Tier-2 provide analysis capacity for the physics and detector groups In general chaotic access patterns

Typically a Tier-2 center will host … Full TAG samples One third of the full AOD sample Selected RAW and ESD data

Data will be distributed according to guidelines given by the physics groups

Analysis models

For efficient processing it is necessary to analyze data locally Remote access to data is discouraged

To analyses the full AOD and the ESD data it is necessary to locate the data and send the jobs to the relevant sites

TAG data at Tier-2 will be file based Analysis uses same pattern as AOD analysis

AOD Analysis

The assumption is that the users will perform some data reduction and generate Athena aware ntuples (AANT) There are also other possibilities

This development is steered by the Physics Analysis Tools group (PAT)

AANT are then analyzed locally

Aims for SC4 & CSC The main data to be analyzed are the AOD and the ESD data from the CSC

production The total amount of data is still small …few 100 GB We aim first at a full distribution of the data to all interested sites

Perform tests of the computing model by analysis these data and measurement of relevant indicators Reliability Scalability Throughput

Simply answer the questions … How long does it take to analyze the expected 150 TB of AOD

data corresponding to one year of running of the LHC And what happens if several of you try to do it at the same time

In the following …. I will discuss the technical aspects necessary to

achieve these goals

ATLAS has three grids with different middleware LCG/EGEE OSG Nordugrid

Data Management is shared between the grids But there are grid specific aspects

Workload Management is grid specific

Data Management I will review only some aspect related to Distributed Analysis

The ATLAS DDM is based on Don Quijote 2 (DQ2) See the tutorial session on Thursday for more details

Two major tasks Data registration

Datasets are used to increase the scalability Tier-1’s are providing the necessary services

Data replication Based on the gLite File Transfer Service (FTS) Fallback to SRM or gridftp possible Subscription are used to manage the actual file transfer

How to access data on a Tier-2

SE

FTS

Datasetcatalog

Tier 0

VOBOX

Tier 1Tier 2

CE

LRC

http

lrc protocol rfio

dcap

gridftp

nfs

To distributed the data for analysis … Real data

Data recorded and processed at CERN (Tier-0) Data distribution via Tier-1

Reprocessing Reprocessing at Tier-1 Distribution via other Tier-1

Simulation Simulation at Tier-1 and associated Tier-2 Collection of data at Tier-1 Distribution via other Tier-1

For analysis …

The challenge is not the amount of data, but the management of the overlapping flow patterns

For SC4 we have a simpler aim … Obtain a equal distribution for the current available

simulated data

Data from the Tier-0 exercise is not useful for analysis We will distribute only useful data

Grid specific aspects OSG

DDM fully in production since January Site services also at Tier-2 centers

LCG/EGEE Only dataset registration in production New deployment model addresses this issue Migration to new version 0.2.10 on the way

Nordugrid Up to now only dataset registration Final choice of file catalog still open

New version 0.2.10 Many essential features for Distributed Analysis

Support for the ATLAS Tier structure Fallback from FTS to SRM and gridftp Support for disk areas Parallel operation of production and SC4 And many more ….

Deployment is on the way We hope to see in production very soon OSG/Panda has to move to it asap We should stay with this version until autumn

The success of Distributed Analysis during SC4 is crucially depending on the success of this version

Local throughput An Athena job has in the ideal case about 2MB/sec data

throughput The limit given by the persistency technology

Storegate-POOL-ROOT

Up to 50% of the capacity of a site is dedicated to analysis

We plan to access data locally via the native protocol (rfio, dcap, root etc)

Local network configuration should take that into account

Data Management Summary

Data Distribution is essential for Distributed Analysis

DQ2 0.2.10 has the required features

There is a lot of work in front of us to control and validate data distribution

Local configuration determined by Athena I/O

Workload management Different middleware Different teams Different submission tools

Different submission tools are confusing to our users ..

We aim to obtain some common ATLAS UI following the ideas of pathena tool (see later)

But …. the priority for Distributed Analysis in the context of SC4 is to solve the technical problems within each grid infrastructure

LCG/EGEE

gLite UI gLite Resource Broker

Dataset Location Catalog BDII

SitesSites

SitesSites

Advantages of the gLite RB Bulk submission

Increased performance and scalability Improved Sandbox handling Shallow resubmission

If you want to use a Resource Broker for Distributed Analysis, you want to use finally the gLite RB

Status Being pushed into deployment with gLite 3.0 Has not yet the same maturity as the LCG RB Turing the gLite RB into production quality has evidently a high

priority

ATLAS EGEE/LCG TaskforcegLite LCG

Bulk submission Multiple threads

Submission: 0.3 sec/job 2 sec/job

Matchmaking: 0.6 sec/job

Job submission is not the limit any more (there are other limits ….)

Plan B: Prodsys + CondorG

CondorG Executor

Dataset Location Catalog

BDII

SitesSites

SitesSites

CondorG Negotiator

ProdDB

Advantages

Profits from the experiences of the production Proven record from the ongoing production Better control on the implementation

CondorG by itself is also part of the Resource Broker

Performance ~ 1 sec/job

Status Coupled to the evolvement of the production system Work on GANGA integration has started

OSG

On OSG ATLAS uses PANDA for Workload management Concept is similar to DIRAC and Alien

Fully integrated with the DDM

Status In production since January Work to optimize the support for analysis users is ongoing

PANDA Architecture

Advantages for DA

Integration with DDM All data already available Jobs start only when the data is available

Late job binding due to pilot jobs Addresses grid inefficiencies Fast response for user jobs

Integration of production and analysis activities

Nordugrid ARC middleware Compact User Interface

14 MB vs New version has very good job submission

performance 0.5 to 1 sec/job

Status Open questions: Support for all ATLAS users, DDM

integration Analysis capability seems to be coupled with the planned

NG Tier-1

ARC Job submission

ARC UI Sites

RLS Nordugrid IS

WMS Summary EGEE/LCG

gLite Resource Broker Prodsys & CondorG

OSG PANDA

Nordugrid ARC

Different system – different problems All job submission systems need work to optimize user analysis

Job Priorities

Different infrastructures – different problems

EGEE Job Priority WG

OSG/Panda Separate cloud of analysis pilots

Nordugrid Typically a site has several queues

EGEE Job Priority WG

TCG working group

ATLAS & CMS LHCb & Diligent observer JRA1 (developers) SA3 (deployment) Several sites (NIKHEF, CNAF)

ATLAS requirements

Split site resources in several shares Production Long jobs Short jobs Other jobs

Objectives Production should not be pushed from a site Analysis jobs should bypass production jobs Local fairshare

Proposed solution

Production

Long

Short

CE

CE

Software

70%

20 %

1 %

9 %

Role=Production

Role=Software

Status Based on VOMS Roles

Role=Production Role=Software

No new middleware A patch to the WMS has to be back ported

Test Installation NIKHEF, CNAF TCG & WLCG MB have agreed to the proposed solution We are planning the move to the preproduction service Move to the production sites in the not so far future

In the future Support for physics groups Dynamic settings Requires new middleware (as GPBOX)

PANDA

Increase the number of analysis pilots Fast pickup of user jobs First job can start in few seconds

Several techniques are being studied Multitasking pilots Analysis queues

Job Priorities Summary

EGEE/LCG New site configuration

OSG/Panda Addressed by PANDA internal developments

DA User Tools

pathena PANDA tool for Distributed Analysis Close to the Physics Analysis Tools group (PAT)

GANGA Common project between LHCb & ATLAS Used for Distributed Analysis on LCG

pathena

Developed in close collaboration between PAT and PANDA

Local jobathena ttbar_jobOptions.py

Grid jobpathena ttbar_jobOptions.py

–inDS csc11.005100.ttbar.recon.AOD…

–split 10

GANGA

Framework for job submission

Based on plugins for Backends

LCG, gLite, CondorG, LSF, PBS Applications

Athena, Executable

GPI abstraction layer Python Command Line Interface (CLIP) GUI

GANGA GUI

Other Tools

LJSF (Light Job submission framework) Used for ATLAS software installations Runs ATLAS transformations on the grid No integration with DDM yet

DA with Prodsys Special analysis transformations Work to interface with GANGA has started

DA Tools Summary

Main tools pathena with PANDA/OSG GANGA with Resource Broker/LCG

Integration with Athena as demonstrated by pathena is a clear advantage

GANGA plug-in mechanism allows in principle to obtain a unique interface Priority for the GANGA team is to deliver a robust

solution on LCG first

Distributed Analysis in SC4 Data distribution

Ongoing activity with the DDM operations team

Site configuration We will move soon to the preproduction service In few weeks we will then move to the production sites

Exercising the job priority model Analysis in parallel to production

Scaling tests of the computing infrastructure Measurement of the turnaround time for analysis of large

datasets

SC4 Timescale We plan to perform DA tests in August and then later in autumn

The aim is to quantify the current characteristics of the Distributed Analysis systems Scalability Reliability Throughput

Simply answer the questions … How long does it take to analyze the expected 150 TB of AOD

data corresponding to one year of running of the LHC And what happens if several of you try to do it at the same

time

Tier-2 Site Requirements

Configuration of the batch system to support the job priority model gLite 3.0 Analysis and production in parallel

Data availability Connect to DDM Disk area Sufficient local throughput

ATLAS Analysis Use

Documents

Transcript of ATLAS Analysis Use