ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights

17
ADC Weekly Meeting, May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights Simone Campana – CERN IT/ES

description

ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights. Simone Campana – CERN IT/ES. Introduction. 2 days meeting (Wed PM to Fri AM) 4 sessions Data Management Production System Analysis Networking Plus one session with invited speakers - PowerPoint PPT Presentation

Transcript of ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights

Page 1: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly Meeting, May 8 2012

Annecy 2012 Technical Interchange Meeting Highlights

Simone Campana – CERN IT/ES

Page 2: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly, 8/5/2012 2

• 2 days meeting (Wed PM to Fri AM)

• 4 sessions– Data Management– Production System– Analysis– Networking

• Plus one session with invited speakers– Intel, EPFL (Miguel Branco)

• Many thanks to session conveners for the material

Introduction

Page 3: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

TIM April 2012Data Management Session Highlights and Action Items

for ADC weekly, CERN, 8 May 2012

S. Campana, V. Garonne, I. IUeda

Page 4: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly, 8/5/2012 4

• Storage Federations– xrootd the only realistic solution for the medium term– Use case focuses on failover for data access

• More advanced use cases can be explored in future (“repairing” data, file level caching)

– CMS experience in pre-production (failed access recovery)• CMS spent lot of time (and will spend more) in CMSSW I/O tuning (reducing #reads and increasing #hits

in read-ahead). Key for success in WAN access.

– ATLAS experience in USATLAS R&D• Automated tools for WAN tests on top of HC• Integration of xroot federation with Panda is in progress

• Many open questions– Security, monitoring, content publication– MB recommended to create topical working groups

• ATLAS will try to expand the experience with xroot federations outside the US.

Data Management

Page 5: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly, 8/5/2012 5

• Transfer Services– FTS will remain the baseline Transfer Service– FTS3 will cure known architectural issues

• Channel concept, plugin support for protocols

– FTS3 prototype in June, multi VO testing

• Point2Point protocols– gridFTP as baseline, new version and session reuse will help reducing overheads– Xrootd is an alternative. Needs to be supported on all systems (see also discussion on

federations)– HTTP is a serious option. Needs more integration and testing

• SRM– Functionalities will be slowly replaced– Core set of functionalities will remain (access to MSS)

• Positive experience with BestMAN+gridftp+Lustre at OU SWT2

• Interesting analysis from DDM Tracer data. Further studies suggested.

Data Management

Page 6: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly, 8/5/2012 6

• Rucio– Architecture and Prototype API now available– Rucio Demo in June, prototype in October

• Case sensitivity– Would like to move to Case Sensitive datasets and file names in DDM (UNIX like)– No strong online and offline objections, will try to agree at June SW week

• Rucio scope– Proposal presented, but possible issues for the usage of “Campaigns”– Is being re-thought, DDM team will present a new proposal soon

• Naming convention for files at sites in Rucio– Controversial discussion (less intuitive organization of files at sites for local access)– Being re-iterated within ADC and with Data Prep and Phys Coord (ICB?)

Data Management

Page 7: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

TIM April 2012ProdSys Session Highlights and Action Items

for ADC weekly, CERN, 8 May 2012

K.De, A.Filipcic, A.Klimentov, R.Walker and A.Vaniachine

Page 8: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

• Progress since TIM in Dubna– APF status

• PY factory to be replaced• Still manual config files• Pending integration with AGIS

– Fair share policy implementation– HLT Task Request

• Real time definition of tasks and jobs

– Multi-cloud production widely used– Tier-2s usage

• Short term plans– Jobs submission vs resources heterogeneity– AKTR, et al overload

• processing 10+k tasks requests with 90+k output datasets

– Previous overload happen about a year ago – at the time of TIM in Dubna• Not clear why these rare events (overload and TIM) are correlated in time

• Monitoring and better integration with SSBAlexei Klimentov – TIM Highlights

Production System and Grid Data Processing

Page 9: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

Alexei Klimentov – TIM Highlights

• JEDI core foundations• No predefined (and pre-assigned jobs)• Task Request: database templates• “Late” datasets registration

– Reassessment of PandaDB and ProdDB• Understand benefits of redundancy

– Separation of concerns

– Task post processing

• If you do not like the name “JEDI” the alternative is “PDJD” … – Panda Dynamic Job Definition

9

Dynamic Job Definition (JEDI)

Page 10: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

Alexei Klimentov – TIM Highlights

• Rate of task requests grows exponentially– Linear growth in users and support requests

• Growing list of requirements and use cases– New use cases: HLT, FTK, user analysis tasks

• First ideas about new architecture and how JEDI and DEfT will be developed– ProdSys technical meeting in Lubljana (June 2012)

to discuss JEDI and DEfT development

10

Dynamic Evolution for Tasks (DEfT)

Page 11: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

Alexei Klimentov – TIM Highlights

• Rucio/DDM and ProdSys/PanDA overlaps – What we want to keep and what we want to drop

• Multi-core jobs– Ready for full Grid Production in simple scernario

• glideinWMS studies– Work in progress to find limits in various

components

11

ProdSys session II

5/3/2012

Page 12: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

TIM April 2012Distributed Analysis Session Highlights and Action Items

for ADC weekly, CERN, 8 May 2012

F. Barreiro, D. Benjamin, D. Van Der Ster

Page 13: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

• Initiative from CERN IT-ES, ATLAS and CMS– Assess potential of using common analysis solution based on

PanDA and glideInWMS

• Currently at the end of Feasibility Study http://cern.ch/go/9mNC1. Compare and analyze experiments’ workflows and architectures

• Indentify dependencies, what can be reused and potential show-stoppers

2. Study and compare sub-components: Server sides, PanDA pilot and pilot factories, GlideInWMS

3. Evaluate integration scenarios for PanDA and GlideInWMS ensuring no loss of functionality

4. Prepare final document with conclusions and proposal for Proof-of-Concept • To be validated by the experiments • In case of green light used as input for coming Functionality and Operations Studies

ATLAS&CMS Common Analysis Framework

Page 14: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly, 8/5/2012 14

• Server Side Retries– Only 20% of failures are “retriable”

• Normally OK at 2nd attempt, 3rd attempt useless

– Non retriable failures are mostly “athena”• Well… something else, but masked by athena.

– Work will be done for accounting those properly• proot– Main goal is to catch failures and categorize them

properly (beside setting correctly the root env)– This is difficult if you do not “own” the event loop– So, now an EventLoop package and its grid driver have

been developed

Improving Job Efficiency

Page 15: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

• Current issues– Many actions today happen client side => slowness

• Data discovery, job splitting, DS registration, retry

– No task concept in Panda => complicated bookkeeping• User interest is in task rather than subjobs

• Start moving client functionalities to server side• Simplify client tools, centralize functionalities, improve bookkeeping

– Introduce Task concept in Panda (Task/Jobset table)– Modify clients to submit tasks/Jobsets (instead of subjobs)– Implement subjob definition server side– Evolve Panda server to handle subjobs and task/jobdef synchronization in DB

• Change bookkeeping tools– Interact with task/jobdef table directly– Send retry commands to be executed by the server

• Move toward server-side task management– Straightforward once job submission is mover server-side– Missing piece is task chaining

Server Side Tasks

Page 16: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

• Moving to “experiments” plugins– Refactor/clean pilot code– Provides a better platform for many contributors

• Job recovery simplified– Could be used outside US (UK interest)– Could be used for analysis (to be evaluated)

• StageIN/OUT– StageOUT retry to the T1 (instead of local): under development– StageIN retry from another source: leverage xrootd federation

• ErrorDiagnostic class in development and DEBUG mode for pilots– Avoid “grepping” logfiles, modularize etc … – Peeking capability

• Many others … help needed.– Common solution initiative should bring in more contributors

Pilot Plans/Ideas

Page 17: ADC Weekly Meeting ,  May 8 2012  Annecy 2012 Technical  Interchange Meeting  Highlights

ADC Weekly, 8/5/2012 17

• A very productive workshop– Some subjects probably deserved a bit more time for

discussion

• ADC software is nowhere “frozen”– Needs to keep up with the demand

• Strong focus on commonalities for long term sustainability

• Several ideas/plans will be followed up in the next months in ADCDev and ADCOps– Plus dedicated workshops (e.g. Prodsys in Lubjiana)

Conclusions