ATLAS Distributed Analysis

61
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org ATLAS Distributed Analysis A. Zalite / PNPI

description

ATLAS Distributed Analysis. A. Zalite / PNPI. Overview. Why? Goal ADA Model First steps Demo example More examples Conclusion. Why?. Huge amount of data Atlas experiment is expected to record several petabytes of data per year - PowerPoint PPT Presentation

Transcript of ATLAS Distributed Analysis

Page 1: ATLAS Distributed Analysis

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

ATLAS Distributed Analysis

A. Zalite / PNPI

Page 2: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Overview

• Why?• Goal• ADA Model• First steps• Demo example• More examples• Conclusion

Page 3: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Why?

• Huge amount of data

– Atlas experiment is expected to record several petabytes of data per year

– Atlas offline system will produce similar amount of data (ESD, AOD, …)

• Globally-distributed members of Atlas collaboration

– Over 1000 physicists from all over the world will take part in data analysis

• The data have to be available to all members of the collaboration

Page 4: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Goal

• Provide to globally distributed users

– Access to globally distributed data – Tools to perform globally distributed processing on

this data• Easy to use and access from analysis environment

– Flexible to adopt to environment• Enable effective use of all ATLAS computing resources• Trace information about processing with any data

– Where did this data (event or analysis) come from?

Page 5: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA Model

Components:• Data described by a Datasets (collection of data)

– Location of the data (e.g. files)– Content (e.g. list of event ID’s and the type of the data

for each event) • Transformation describes an operation that can act on

a dataset to produce a new dataset

– Application scripts used to run job to build task or process data

– Task carries user parameters or code (E.g. atlas release, job options, and/or algorithm code)

• Job is an instance of the transformation acting on a dataset

Page 6: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA Model

• Many ATLAS-specific transformations have been defined– Atlasopt: user provides ATLAS release and job options– Aodhisto: atlasopt plus code to build in UserAnalysis

package– Atlasdev: atlasopt plus local development directory– Atlasdev-src: same as atlasdev except development

area is tarred up and will be rebuilt if platform changes

• All these transformations run Athena

Page 7: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA Model

D atase t 1 D atase t 2

D atase t

U se r a n a ly sisfra m e w o rk

A p p lic a tio n T a sk

C od e P a ra m s

7 . c re a te

4 . s e le c t

2 . s e le c t 3 . c re a te o r s e le c t

A n alys isS ervice

1 . c re a te o r lo c a te5 . s u b m it(a p p ,ts k ,d s )

R e sult 1

R e sult 2

Jo b 1

Jo b 2

8 . ru n(a p p ,ts k ,d s 1 )

8 . ru n(a p p ,ts k ,d s 2 )

9 . fill

9 . fill

1 0 . ga the r

6 . s p lit

D IA L c o m p o nentsS ep tem b er 20, 2004

R O O T ,G AN G A, . . .

E v en t d a ta ,s u m m ar y d a ta ,tu p les , . .

Ath en a , d ia lp aw ,R O O T , . . .

Transformation

Page 8: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA Model

This view enables distributed processing:• Split input dataset

– Along event, file, or sub-dataset boundaries • Create separate sub-job for each sub-dataset• Implies post-processing stage to merge results (output datasets)

Users carry out processing by• Defining a job

– Application, task and dataset • Submitting this definition to a scheduler

– Typically an analysis service• Polling for status

– Job state (and sub-job states)– Result dataset

Page 9: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA Model

On receiving a job request, the scheduler• Builds the task (or locates an existing build)• Split the dataset into sub-datasets• Create and submits a sub-job for each sub-dataset• Merge the results (output datasets) from each sub-job

into overall result

Page 10: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA Architecture

R O O T P Y T H O N

A M I D B S D IA L A S A T P R O D A S A R D A A S

LS F , C O N D O R gLite W M SA T P R O D

G U I andc o m m and l inec l ie nts

H igh le ve l s e rvic e sfo r c atalo ging andjo b s ubm is s io n andm o nito r ing

W o rklo adm anage m e nts ys te m s

AJ D L

s h S Q L g L ite

AM I w s

AJ D L

Page 11: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

ADA

ADA uses DIAL framework. Release 1.20 of DIAL is the basis for the current ADA system.

To use ADA it is necessary

• To have Grid certificate

– Certificate from Russian CA is OK

• To be member of Atlas VO

– Takes some time

Page 12: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

First Steps

• Working node - LXPLUS at CERN

• Setup grid environment: . /afs/cern.ch/project/gd/LCG-share/sl3/etc/profile.d/grid_env.sh

• Certification proxy initialization: grid-proxy-init• DIAL setup (setup script that defines a few

environmental variables and aliases)

at CERN: DIALSETDIR=/afs/cern.ch/user/d/dial/apps/dial/setup

• Verify user certificate and check the status of the unique ID service by issuing the command "uidtest" after setting up dial.

Page 13: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

First Steps

Page 14: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

First Steps

• The best way to start with DIAL is to run the demos inside ROOT

• These demos define a job– application (papp) – task (ptsk) – dataset (pdst)

• and submit it to the current scheduler (msch)• Start:

dialroot –i flag –i means that any missing DIAL configuration,

example or demo files will be copied into the local directory (necessary only 1st time)

Page 15: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

First Steps

Page 16: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

First Steps

Page 17: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

• Distributed analysis is an iterative process where a physicist defines a job, submits it to a processing system, examines the result and then repeats the sequence.

• Demo selects an application, task and dataset which are then submitted to a scheduler to define a job.root [0] .x demos/demo4.C

This defines papp, ptsk and pdst

root [1] submit() Submit a job based on papp, ptsk and pdst

root [2] get_results() Get job status and partial result

root [3] TBrowser br Check ouput ntuples and histgrams

Page 18: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

• A job is specified by defining a transformation and selecting a dataset to process with this transformation. The transformation is specified by an application and a task. The application carries the scripts that do the processing and the task carries user configuration data.

• Demo4 uses aodhisto to create histograms and ntuples from user source code

• The demo identifies objects by name, extract the corresponding ID from a selection catalog and use this ID to extract the object from a repository.

Page 19: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

void demo4() {

string aname = "aodhisto"; string tname = "aodhisto_zll_aod"; string dname = "hma.dc2.003007.digit.A1_z_ee.aod-1000.10files";

aid = asc.id(aname); tid = tsc.id(tname); did = dsc.id(dname);

papp = ar.extract(aid); ptsk = tr.extract(tid); pdst = dr.extract(did);

}

Page 20: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 21: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 22: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 23: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 24: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 25: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 26: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 27: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 28: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 29: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

• Objects:

papp - pointer to the current application   ptsk - pointer to the current task   pdst - pointer to the current dataset

• Can be displayed

root [4] pprint(papp) Display the application

root [5] pprint(ptsk) Display the task

root [6] pprint(pdst) Display the dataset

Page 30: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo example

Page 31: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Demo Example

Page 32: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

There are more examples:

• Demo5 uses esd2aod to create AOD from ESD using the prodsys transformation

• Demo6 uses atlasopt to run a job with provided job options

• Demo7 uses atlasdev to run a job based on a users atlas development area

• Demo8 uses atlasdev-src to run a job based on a tarball of a user development area

Page 33: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• Displaying the status of all catalogs to verify connection and see the size of each:

root [4] show_catalogs()

Page 34: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• A list of available datasets may be obtained by querying the DSC (dataset selection catalog, object dsc). The DSC is the primary user interface to datasets and it plays a role of what is often called a metadata catalog.

• Limit the query to 100 results (received 12). • The query resticts the selection to TOP level datasets,

i.e. complete samples intended for user access and then uses the name to select Rome samples with v10 reconstruction, SUSY data using all AOD data avaialble at BNL.

• AOD-bnl replaced with AOD to get samples available at both CERN and BNL.

• Counting datasets matching a query with the query_count method

Page 35: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

Page 36: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• DCS supports list of parameters which can be used in selection of Datasets

Page 37: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• List attributes for given Dataset• Record ID and fetch the Dataset from repository

Page 38: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

Page 39: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• Select an application in a similar way

Page 40: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• Select a task in a similar way

Page 41: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• The application usually is not modified, but necessity of task modification is very likely

• Extract the files from the task

Page 42: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• The list of jobOptions can be found in CVS repository at atlas/PhysicsAnalysis/AnalysisCommon/ AnalysisExamples/share/

Page 43: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

• Now it is possibly to build a new task from the modified files: ptsk = new dial::Task("atlas_release jo.py output_content",

"mytask");

The list of files used to construct the task may be replaced with "*" if you want all the files from the directory

• Now papp, ptsk and pdst are defined, and job can be submited

Page 44: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

Page 45: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More Examples

Page 46: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

• It is not necessary to do a lot of typing (as we did before) to perform previous analysis

• There is simple way to avoid this – job definition script that defines the application, task and dataset (variables papp, ptsk and pdst).

• Sample script can be found here:http://www.usatlas.bnl.gov/~dladams/dial/releases/1.20/jobdef.C

• The sample script is copied into the local directory when the dialroot files are installed (dialroot -i).

• Edit the top part of this script to specify the application, task and dataset of interest.

• Run:root [0] .x jobdef.C root [1] submit() ...

Page 47: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

void jobdef() { // Specify names for the application, task and dataset. // Typical job definition is created by changing these values. // Depending on the following code, a name may be intepreted as // one or more of the following. // 1. ID: Object identifier. // 2. name: Object name in the default selection catalog. // 3. directory: Name of a directory holding files to be used // construct the object. // 4. xml: Name of a file holding the XML description of the object. // Application: directory, name, or ID. string aname = "atlasopt"; // Task: directory, xml, name, or ID. string tname = "atlasopt_example_zll-10.0.1"; // Dataset: ID or name. string dname = "hma.dc2.003007.digit.A1_z_ee.aod-1000.10files"; …….

Page 48: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

• There is web-interface “DIAL CATALOG QUERY PAGE” http://www.atlasgrid.bnl.gov/dialds/dlShowMain-new.pl

Page 49: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

• This interface permit to switch to:– Dataset Selection Catalog (DSC)– Task Selection Catalog (TSC)– Application Selection Catalog (ASC)

• A list of available datasets may be obtained from DSC query page

• Some useful applications and example tasks are cataloged as well. The application and task catalogs may also be examined using the ASC query page and TSC query page

Page 50: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

Page 51: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

Page 52: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

Page 53: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

Page 54: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

Page 55: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on More Examples

Page 56: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on Transformations

• Transformations are applied to datasets to produce new datasets.

• A transformation includes:– an application which carries out the processing – a task used to configure the application

• An application provides two entry points: one to build (e.g compile) a task and one to process a dataset

• A task is a collection of named text files• It is not sensible to arbitrarily combine any task with

any application.

Page 57: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on Transformations

• There is a task interface that specifies which files must or may be present in a task and how these files are to be used.

• Tasks are labeled with the interface they provide and applications with the task interface they expect.

• Task interfaces:– atlas_release– atlas_job_options – atlas_simple_analysis– atlas_user_analysis– atlas_developer_directory – atlas_developer– atlas_xform

Page 58: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on Transformations

• The task interface atlas_release specifies an atlas release

• List of files for atlas_release:– atlas_release - ATLAS release version, e.g. 10.0.1.

• The task interface atlas_job_options specifies an atlas release, job options and output content

• List of files for atlas_job_options:– atlas_release - ATLAS release version, e.g. 10.0.1. – jo.py - User job options. – output_content - describes the output to be saved (content label

and name - HIST hist.root )

Page 59: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

More on Transformations

• The ADA task interface atlas_user_analysis specifies an atlas release and files to replace those in the UserAnalysis package

• List of files for atlas_user_analysis:– atlas_release - ATLAS release version, e.g. 10.0.1. – *.h - header files. – *.cxx - C++ source files – requirements - CMT requirements file – AnalysisSkeleton_jobOptions.py - job options file – output_content - describes the output to be saved

Page 60: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Documentation

• ADA system described on ADA home page– http://www.usatlas.bnl.gov/ADA

• This page also has a link to DIAL 1.20 release page

Page 61: ATLAS Distributed Analysis

Enabling Grids for E-sciencE

INFSO-RI-508833

Conclusion

• ADA permits now to perform distributed analysis for Atlas experiment

• Available documentation permits to newcomers to start using of ADA

• Further development (especially user-oriented) will allows more wider distribution of ADA among physicists