L'infrastructure de calcul pour le LHC Le point de vue d'ATLAS

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

L'infrastructure de calcul pour le LHC Le point de vue d'ATLAS

Simone Campana

CERN IT/GS

CERN IT Department

CH-1211 Genève 23


it

InternetServices

2

RAWRAW(1.6MB/ev)(1.6MB/ev)

ESDESD(1MB/ev)(1MB/ev)

AODAOD(150 KB/ev)(150 KB/ev)

DPDDPD(20KB/ev)(20KB/ev)

Raw DataRaw Data: output of the Event Filter Farm (HLT) in output of the Event Filter Farm (HLT) in byte-stream formatbyte-stream format

Event Summary DataEvent Summary Data: output of the event output of the event reconstruction (tracks, hits, calorimeter cell and reconstruction (tracks, hits, calorimeter cell and clustersclusters, combined reconstruction objects etc...)., combined reconstruction objects etc...).For calibrazion, allineamento, refitting …For calibrazion, allineamento, refitting …

Analysis Object DataAnalysis Object Data: reduced representation of the events, suitable for analysis. Reconstructed “physics objects” (elettrons, muons, jets (elettrons, muons, jets …)…)

Derived Physics DataDerived Physics Data: reduced information for ROOT reduced information for ROOT specific analysis. specific analysis.

ATLAS Event Data Model

CERN IT Department

CH-1211 Genève 23


it

InternetServices

CERN IT Department

CH-1211 Genève 23


it

InternetServices

LYON

BNL

LPCTokyo

NW

GRIF

T3

NET2

All Tier-1s have predefined (software) channel with CERN and with each other Tier-1.Tier-2s are associated with one Tier-1 and form the cloudTier-2s have predefined channel with the parent Tier-1 only.

FR CloudFR Cloud

BNL CloudBNL Cloud

Pékin

“Tier Cloud Model”Unit : 1 T1 + n T2/T3

ATLAS tiers Organization

NG

LYON

BNL

FZKTRIUMF

ASGC

PIC

SARA

RAL

CNAF

CERN

Clermont

LAPP

CCPM

Roumanie

SW

GL

SLAC

TWT2

Melbourne

CERN IT Department

CH-1211 Genève 23


it

InternetServices

NG

LYON

BNL

FZKTRIUMF

ASGC

PIC

SARA

RAL

CNAF• Raw data Mass Storage at CERN

• Raw data Tier 1 centers

• complete dataset distributed

among T1s

• ESD Tier 1 centers

• 2 copies of ESD distributed

worldwide

• AOD Each Tier 1 center

• 1 full set per T1

Original ProcessingT0 T1

Original ProcessingT0 T1

Detector Data Distribution

CERN

T2: 100 % AOD, small fraction ESD,RAW

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ReprocessingT1 T1

ReprocessingT1 T1

• Each T1 reconstructs its own RAW

• Produces new ESD, AOD

• Ships :

Reprocessed Data Distribution

• ESD to associated T1 • AOD to all other T1s

NG

LYON

BNL

FZKTRIUMF

ASGC

PIC

SARA

RAL

CNAF

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Tokyo

GRIF

Pékin

Clermont

Roumanie

• Monte Carlo production (ESD,AOD)Monte Carlo production (ESD,AOD)

• Ships RAW,ESD,AOD to

associated T1

• Physics AnalysisPhysics Analysis

• Gets (ESD) AOD from

associated T1

ATLAS Tier-2 activities

NG

LYON

BNL

FZKTRIUMF

ASGC

PIC

SARA

RAL

CNAF

CERN IT Department

CH-1211 Genève 23


it

InternetServices

ATLAS and Grid Middleware

• ATLAS resources are distributed across different Grid Infrastructures– EGEE, OSG, Nordugrid

• Most of the Grid Services are shared across different Grids– SRM interface for Storage Elements

• With different backend storage implementation– LCG File Catalog

• At all ATLAS T1s, contains infos for file replicas in the cloud– File Transfer Service at every T1

• Baseline transfer service to import data at any site of the cloud.– VOMS

• To administrate VO membership– CondorG

• For job dispatching

• The ATLAS computing framework guarantees Grid interoperability

CERN IT Department

CH-1211 Genève 23


it

InternetServices

9

The DDM in a nutshell

The Distributed Data Management …

• … enforces the concept of dataset– Logical collection of files– Dataset contents and location stored in central

catalogs – File information stored on local File Catalogs (LFC) at

T1s

• … based on a subscription model– Datasets are subscribed to sites – A series of services enforce the subscription

• Lookup data location in LFC• Trigger data movement via FTS• Validate data transfer

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Testing Data Distribution: CCRC08

• Week 1: Data Distribution Functional Test– to make sure all files get where we want them to go– between Tier-0 and Tier-1’s, for disk and tape

• Week 2: Tier-1 to Tier-1 tests– similar rates as between Tier-0 and Tier-1– more difficult to control and monitor centrally

• Week 3: Throughput test– try to maximize throughput but still following the model– Tier-0 to Tier-1 and Tier-1 to Tier-2

• Week 4: Final, all tests together– also artificial extra load from simulation production

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Week-4: Full Exercise

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Transfer ramp-up

Test of backlog recoveryFirst data generated over 12 hours and subscribed in bulk

12h backlog recovered 12h backlog recovered in 90 minutes! in 90 minutes!

MB

/s T0->T1s throughput

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Week-4: T0->T1s data distribution

Suspect DatasetsSuspect DatasetsDatasets is complete complete

(OK) but double registration

Suspect DatasetsSuspect DatasetsDatasets is complete complete

(OK) but double registration

Incomplete DatasetsIncomplete DatasetsEffect of the power-

cut at CERN on Friday morning

Incomplete DatasetsIncomplete DatasetsEffect of the power-

cut at CERN on Friday morning

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Week-4: T1-T1 transfer matrix

YELLOW boxesEffect of the power-cut

YELLOW boxesEffect of the power-cut

DARK GREEN boxesDouble Registration problem

DARK GREEN boxesDouble Registration problem

Compared with week-2 (3 problematic sites)Very good improvement

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Week-4: T1->T2s transfers

SIGNET: ATLAS DDM configuration issue (LFC vs RLS)SIGNET: ATLAS DDM configuration issue (LFC vs RLS)

CSTCDIE: joined very late. Prototype.CSTCDIE: joined very late. Prototype.

Many T2s oversubscribed(should get 1/3 of AOD)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Throughputs

T1->T2 transfers T1->T2 transfers

show a time structure

Datasets subscribed:-upon completion at T1 -every 4 hours

T0->T1 transfersT0->T1 transfers

Problem at load generator on 27th

Power-cut on 30th

MB/s

MB/sExpected Rate

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Week-4: Concurrent Production

# running jobs

# jobs/day

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Week-4: metrics

• We said: • T0->T1: sites should demonstrate to be capable to import 90% of

the subscribed datasets (complete datasets) within 6 hours from the end of the exercise

• T1->T2: a complete copy of the AODs at T1 should be replicated at among the T2s, withing 6 hours from the end of the exercise

• T1-T1 functional challenge, sites should demonstrate to be capable to import 90% of the subscribed datasets (complete datasets) for within 6 hours from the end of the exercise

• T1-T1 throughput challenge, sites should demonstrate to be capable to sustain the rate during nominal rate reprocessing i.e. F*200Hz, where F is the MoU share of the T1.

• Every site (cloud) meet the metric– Despite power-cut– Despite “double registration problem”– Despite competition of production activities

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Disk Space (month)

ATLAS “moved” 1.4PB of data in May 2008

1PB deleted in EGEE+NDGF in << 1day1PB deleted in EGEE+NDGF in << 1dayPossibly another 250TB deleted in OSG250TB deleted in OSG

Deletion agent at work. Uses SRM+LFC bulk methods.Deletion rate is more than good (but those were big files)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Lessons learned from CCRC08

• The Data Distribution framework seems in good shape and ready for data taking

• Few things need attention:– FTS servers at T1s need global tuning of

parameters– Some bugs found in ATLAS DDM services

• Now fixed

– In at least 3 cases, a network problem or inefficiency has been discovered

• Monitoring …

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Few words about the FDR

• FDR = Full Dress Rehearsal– Test the full chain, from the HLT to the analysis at

T2s. – Same set of Monte Carlo data (approx 8TB) in byte-

stream format, injected every day in the T0 machinery

– Data (RAW and reprocessed) distributed and handled as real data

• FDR2 data exports (June 2008)– Much less challenging than CCRC08 in terms of

distributed computing• 6 hours of data per day to be distributed in 24h• Three days of RAW data have been distributed in less than 4

hours • All datasets (RAW and derived) complete at every T1 and T2

(one exception for T2)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Data Export after CCRC08 and FRD

• Data Distribution functional test:– To test data transfers:

• Tier-0 to all Tier-1’s tape and disk (RAW, ESD, AOD)• all Tier-1’s to all other Tier-1’s (AOD, DPD)• each Tier-1 to all Tier-2’s in the same cloud (AOD,DPD)• muon calibration streams Tier-0 to some special Tier-2’s

– Completely automated:• at 5% of nominal rate, fake generated data from T0• starts every Monday at midday stops next Sunday at midnight• central data deletion of test data everywhere• reports weekly statistics

• Data taking – Mostly Cosmics … – RAW data exported to T1s (for custodial)– ESD exported to 2 T1s following Computing Model– Some data kept permanently on disk at CERN

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Activity after CCRC08

Most inefficiencies due Most inefficiencies due to Scheduled to Scheduled DowntimesDowntimes

Most inefficiencies due Most inefficiencies due to Scheduled to Scheduled DowntimesDowntimes

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Detector Data Replication

CERN IT Department

CH-1211 Genève 23


it

InternetServices

• Bursty activity, mainly depending on software readiness

• Main samples: fdr2, 10TeV, 900 GeV and validations

• Runs in Tier-2’s but also in Tier-1’s– no competition yet with analysis (T2) and re-processing

(T1)

• Average of 10k simultaneous jobs, peaks of 25k jobs

• All production now submitted through Panda system

Simulation Production

CERN IT Department

CH-1211 Genève 23


it

InternetServices

26

site A

server

site B

pilotpilot

Worker Nodes

condor-gSchedulerScheduler

glite

https

pull

run

run

job

pilotpilot

ProdDBjob

BambooBamboo

Monte Carlo Production

job

pull

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Panda in a nutshell

• Job definitions are hosted in the Production Database

• The agent “Bamboo” polls jobs from ProdDB and feeds the Panda server

• The Panda Server manages all job information centrally– Priority Control – Resource Allocation– Job Scheduling

• A job scheduler dispatches pilot jobs to sites – Using various mechanisms: local batch system commands,

gLite WMS, CondorG• Pilots jobs are prescheduled to Grid sites

– Pilots pull “real jobs” from Panda server as soon as suitable CPUs become available.

• Output data are aggregated at T1s using DDM

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Simulation Production

Running Jobs: Monthly StatisticsRunning Jobs: Monthly Statistics

Number of jobs per dayNumber of jobs per day

ErrorsErrors

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Simulation Production Functional Test

• submits one real MC task as a test to each cloud every Monday– 5000 events, 25 events/job 200 jobs of ~6 hours each– jobs should run in each of the Tier-2’s (and Tier-1) in the cloud– low priority to not interfere with real production

• task aborted on Thursday– kills remaining jobs and removes all output– statistics generated: efficiency, brokering, problem sites

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Reprocessing

• Reprocessing “just” is a special case of production system job– Handled by Panda– Runs at T1s only (first order approximation)

• However…– Needs to prestage files (RAW data) from tape at T1s– Needs to access the detector condition data on Oracle

racks at T1s

• Current issues:– pre-staging still not quite working yet

• Software exists, being tested• Every T1 has a different storage setup, performances etc …

– conditions database access not quite working yet• each job opens several connections to the database at the

beginning of the job• Too many concurrent and simultaneous jobs overload the

database. Being investigated.

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Analysis

• The ATLAS analysis model is “jobs go to data”– Analysis mostly run on DPD and AODs– Initially, large access ESD and possibly RAW

• Currently, 2 frameworks for analysis: Ganga and pAthena– Both fully integrated with ATLAS DDM for data

co-location– Will possibly be merged in a unique tool– Now a unique support team

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Ganga

• Client based analysis framework– Central Core component– Multiple plug-ins to benefit of various job

submission system• gLite WMS• CondorG• Local Batch System (LFS,PBS)

Multi VO projectMulti VO projectAnalysis Functional testsAnalysis Functional tests

CERN IT Department

CH-1211 Genève 23


it

InternetServices

pAthena

• Server based analysis framework– Full usage of the Panda infrastructure– Very advanced monitoring– Offers job prioritization and user shares

Monitoring per userMonitoring per user

Worldwide pAthenaWorldwide pAthenaActivityActivity

(last month) (last month)

CERN IT Department

CH-1211 Genève 23


it

InternetServices

User Storage Space

• ATLAS uses the srmv2 interface everywhere now– Offers the possibility to partition the space (space

tokens) depending on the use case

• For central activities – DATADISK and DATATAPE for real data – MCDISK, MCTAPE and PRODDISK for Simulation

Production

• For Group analysis (GROUPDISK) – Ideally, quota management per group– In reality, only global quota, little possibility to configure

group based ACLs. Need policing.

• User analysis– USERDISK

• scratch space for job output, cannot guarantee lifetime– LOCALGROUPDISK

• not ATLAS pledged resources, “home” space for users• Same limitation as for GROUPDISK

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Experience from one week of beam data

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Day 1: we were ready

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Data arrived …

CERN IT Department

CH-1211 Genève 23


it

InternetServices

We started exporting … and we saw issues.

Data Exports Throughput in MB/s

Effect of concurrent data access

from centralized transfers and user activity

(overload of disk server)

Number of errors

CERN IT Department

CH-1211 Genève 23


it

InternetServices

Conclusions

• Computing for LHC experiment is extremely challenging– Very demanding use case– The system is is complex, relies on many external components

• Centralized Data Distribution works reliably– Tested in many challenges and in real life

• Monte Carlo Production framework also reliable– But this is not true for the data reprocessing– Database access and data prestaging need attention

• Data Analysis user activities represent the real challenge now– Do not follow a particular pattern (non-organized by definition)– Not always possible to protect production from users or users from

other users– Never “tested” at the real scale

• The EGEE Grid and offers the necessary baseline services and the infrastructure for ATLAS data taking– Improvements in the area of Storage are foreseen in the near future,

based on experiments inputs and lessons.

L'infrastructure de calcul pour le LHC Le point de vue d'ATLAS

Documents

Transcript of L'infrastructure de calcul pour le LHC Le point de vue d'ATLAS