EGEE – A Large-Scale Production Grid Infrastructure

15
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org EGEE – A Large-Scale Production Grid Infrastructure Erwin Laure EGEE Technical Director CERN, Switzerland

description

EGEE – A Large-Scale Production Grid Infrastructure. Erwin Laure EGEE Technical Director CERN, Switzerland. Contents. The EGEE Project Operational Aspects Application Usage Interoperability, Standardization Outlook. The EGEE project. Objectives - PowerPoint PPT Presentation

Transcript of EGEE – A Large-Scale Production Grid Infrastructure

Page 1: EGEE – A Large-Scale Production Grid Infrastructure

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE – A Large-Scale Production Grid Infrastructure

Erwin Laure

EGEE Technical Director

CERN, Switzerland

Page 2: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Contents

• The EGEE Project

• Operational Aspects

• Application Usage

• Interoperability, Standardization

• Outlook

Page 3: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 3

Enabling Grids for E-sciencE

INFSO-RI-508833

The EGEE project

• Objectives– Large-scale, production-quality infrastructure for e-Science

leveraging national and regional grid activities worldwide consistent, robust and secure

– improving and maintaining the middleware– attracting new resources and users from industry as well as science

• EGEE – 1st April 2004 – 31 March 2006– 71 leading institutions in 27 countries,

federated in regional Grids

• EGEE-II– Proposed start 1 April 2006 (for 2 years)– Expanded consortium

> 90 partners in 32 countries (also non-European partners)

Related projects (such as BalticGrid, SEE-GRID, EUMedGrid, EUChinaGrid, EELA, …)

Page 4: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 4

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE Infrastructure

Scale> 170 sites in 39 countries

> 17 000 CPUs

> 5 PB storage

> 10 000 concurrent jobs per day

> 60 Virtual Organisations

Page 5: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 5

Enabling Grids for E-sciencE

INFSO-RI-508833

Applications

>20 supported applications from 7 domains– High Energy Physics– Biomedicine– Earth Sciences – Computational Chemistry– Astronomy– Geo-Physics– Financial Simulation

Another 8 applications from

4 domains are in evaluation stage

Page 6: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 6

Enabling Grids for E-sciencE

INFSO-RI-508833

10,000 jobs /day

From Accounting data: ~3 million jobs in 2005 so far Sustained daily rates (per month Jan – Nov 2005):

[2185, 2796, 7617, 10312, 11151, 9247, 9218, 11445, 10079, 11124, 9491]

~8.2 M kSI2K.cpu.hours >1000 cpu years

Real usage is higher as accounting data was not published from all sites until recently

Page 7: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Some example uses

Domain distribution of Flexx run jobs

es; 5122

fr; 7580

gr; 2004 il; 263 it; 3687

nl; 3356

tw; 827

uk; 8106

bg; 597 com; 1072

de; 715 cy; 383

pl; 1877

ru; 218

ro; 337

LCG sustained data transfers using FTS; avg. 500 MB/s, peak 1000 MB/s

Zeus collaboration at DESY

WISDOM data challenge

BIOMED jobs distribution

0

20000

40000

60000

80000

100000

120000

2005-01 2005-02 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08

nb

of

job

s

0

10

20

30

40

50

60

du

rati

on

est

imat

e (y

ears

)

registered jobs

successful jobs

cancelled jobs

aborted jobs

run duration estimate

Page 8: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 8

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE Operations Structure

• Operations Management Centre (OMC):

– At CERN – coordination etc• Core Infrastructure Centres

(CIC)– Manage daily grid

operations – oversight, troubleshooting

“Operator on Duty”

– Run infrastructure services– Provide 2nd level support to

ROCs– UK/I, Fr, It, CERN, Russia,

Taipei• Regional Operations Centres

(ROC)– Front-line support for user

and operations issues– Provide local knowledge

and adaptations– One in each region – many

distributed• User Support Centre (GGUS)

– In FZK: provide single point of contact (service desk), portal

Page 9: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Operations Process• CIC – on – duty (grid operator on duty)

– 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei

– Crucial in improving site stability and management• Operations coordination

– Weekly operations meetings– Regular ROC, CIC managers meetings– Series of EGEE Operations Workshops

Last one was a joint workshop with Open Science Grid

Bring in related infrastructure projects – coordination point

• Geographically distributed responsibility for operations:– There is no “central” operation– Tools are developed/hosted at different sites:

GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon)

• Procedures described in Operations Manual• Infrastructure planning guide (cookbook):

– Contains information on operational procedures, middleware, certification, user support, etc.

Improvement in site stability and reliability is due to:

CIC on duty oversight and strong

follow-up Site Functional Tests, Information System monitor

Page 10: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 10

Enabling Grids for E-sciencE

INFSO-RI-508833

Applications Example: LHC

• Fundamental activity in preparation of LHC start up– Physics– Computing systems

• Examples:– LHCb: ~700 CPU/years in 2005 on the EGEE infrastructure– ATLAS: over 10,000 jobs per day

CPU used: 6,389,638 hData Output: 77 TB

DIRAC.Barcelona.es 0.214% DIRAC.Bologna-T2.it 0.696%DIRAC.CERN.ch 0.571% DIRAC.Cambridge.uk 0.001%DIRAC.CracowAgu.pl 0.001% DIRAC.IF-UFRJ.br 0.175%DIRAC.LHCBONLINE.ch 0.779% DIRAC.Lyon.fr 2.552%DIRAC.PNPI.ru 0.000% DIRAC.Santiago.es 0.148%DIRAC.ScotGrid.uk 3.068% DIRAC.Zurich-spz.ch 0.003%DIRAC.Zurich.ch 0.756% LCG.ACAD.bg 0.106%LCG.BHAM-HEP.uk 0.705% LCG.Barcelona.es 0.281%LCG.Bari.it 1.357% LCG.Bologna.it 0.032%LCG.CERN.ch 10.960% LCG.CESGA.es 0.528%LCG.CGG.fr 0.676% LCG.CNAF-GRIDIT.it 0.012%LCG.CNAF.it 13.196% LCG.CNB.es 0.385%LCG.CPPM.fr 0.242% LCG.CSCS.ch 0.282%LCG.CY01.cy 0.103% LCG.Cagliari.it 0.515%LCG.Cambridge.uk 0.010% LCG.Catania.it 0.551%LCG.Durham.uk 0.476% LCG.Edinburgh.uk 0.031%LCG.FZK.de 1.708% LCG.Ferrara.it 0.073%LCG.Firenze.it 1.047% LCG.GR-01.gr 0.349%LCG.GR-02.gr 0.226% LCG.GR-03.gr 0.171%LCG.GR-04.gr 0.056% LCG.GRNET.gr 1.170%LCG.HPC2N.se 0.001% LCG.ICI.ro 0.088%LCG.IFCA.es 0.022% LCG.IHEP.su 1.245%LCG.IN2P3.fr 4.143% LCG.INTA.es 0.076%LCG.IPP.bg 0.033% LCG.ITEP.ru 0.792%LCG.Imperial.uk 0.891% LCG.Iowa.us 0.287%LCG.JINR.ru 0.472% LCG.KFKI.hu 1.436%LCG.Lancashire.uk 6.796% LCG.Legnaro.it 1.569%LCG.Manchester.uk 0.285% LCG.Milano.it 0.770%LCG.Montreal.ca 0.069% LCG.NIKHEF.nl 5.140%LCG.NSC.se 0.465% LCG.Napoli.it 0.175%LCG.Oxford.uk 1.214% LCG.PIC.es 2.366%LCG.PNPI.ru 0.278% LCG.Padova.it 2.041%LCG.Pisa.it 0.121% LCG.QMUL.uk 6.407%LCG.RAL-HEP.uk 0.938% LCG.RAL.uk 9.518%LCG.RHUL.uk 2.168% LCG.SARA.nl 0.675%LCG.Sheffield.uk 0.094% LCG.Torino.it 1.455%LCG.Toronto.ca 0.343% LCG.Triumf.ca 0.105%LCG.UCL-CCC.uk 1.455% LCG.USC.es 1.853%

20-6-2005, P. Jenni 4LCG POB: ATLAS on the LCG/EGEE Grid

This is the first successful use of the grid by a largeuser community, which has however also revealed several shortcomings which need now to be fixed as LHC turn-on is onlytwo years ahead!

Very instructive comments from the user feedback have been presented at the Workshop (obviously this was one of the main themes and purposes of the meeting)

All this is available on the Web

ATLASLHCb

Note the usage of 3 Grid

infrastructures

Page 11: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Applications Example: WISDOM

• Grid-enabled drug discovery process for neglected diseases– In silico docking

compute probability that potential drugs dock with target protein

– To speed up and reduce cost required to develop new drugs

• WISDOM (World-wide In Silico Docking On Malaria)– First biomedical data challenge

– 46 million ligands docked in 6 weeks Target proteins from malaria parasite Molecular docking applications: Autodock and FlexX ~1 million virtual ligands selected

– 1TB of data produced

– 1000 computers in 15 countries Equivalent to 80 CPU years

Never done for a neglected disease Never done on a large scale

production infrastructure

Domain distribution of Flexx run jobs

es; 5122

fr; 7580

gr; 2004 il; 263 it; 3687

nl; 3356

tw; 827

uk; 8106

bg; 597 com; 1072

de; 715 cy; 383

pl; 1877

ru; 218

ro; 337

Page 12: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 12

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid Interoperability

• We currently see different flavors of Grids deployed worldwide– Because of application needs, legacy constraints, funding, etc. – Diversity is positive! – Competition to find the best solutions

• Many applications need to operate on more than one Grid infrastructure– Pragmatic approach to interoperability is key – Applications need interoperable Grid infrastructures now– A production infrastructure cannot be an early adopter of quickly changing

standards

• EGEE is active contributor to interoperability and standardization efforts– Works with OSG, NAREGI, ARC, and the multi-grids interoperability effort– Provides valuable input on practical experiences to standardization process– Contributes to over 15 GGF WG/RG – Currently supplies two GGF area directors

Page 13: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 13

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid Interoperation

• Leaders from TeraGrid, OSG, EGEE, APAC, NAREGI, DEISA, Pragma, UK NGS, KISTI will lead an interoperation initiative in 2006.

• Six international teams will meet for the first time at GGF-16 in February 2006– Application Use Cases

(Bair/TeraGrid, Alessandrini/DEISA)

– Authentication/Identity Mgmt (Skow/TeraGrid)

– Job Description Language Newhouse/UK-NGS

– Data Location/Movement Pordes/OSG

– Information Schemas Matsuoka/NAREGI

– Testbeds Arzberger/Pragma

Leaders from nine Grid initiatives met at SC05 to plan an application-driven “Interop Challenge” in 2006.

Intero

perability w

orkshop Monday and Tuesday afte

rnoon

“AuthZ in

teroperabilit

y here and now” Thursday m

orning

Page 14: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Future

• EGEE-II supposed to start on April 1st – Smooth transition from current phase – Additional support for more application domains– Increased number of partners, also from US and AP– Unified EGEE middleware distribution gLite 3.0

• Continue and reinforce interoperability work

• Standardization in GGF, other bodies, and industry important for long term sustainability of Grid

Need for commonly accepted standards

• Towards a long-term sustainable Grid infrastructure

Page 15: EGEE – A Large-Scale Production Grid Infrastructure

GGF 16 - February 2006 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Upcoming EGEE Events

• Series of regional long term sustainability workshops in Europe during March– For more information get in contact with

[email protected]

• 1st EGEE User Forum, CERN, 1-3 March, 2006– Bring together the EGEE community– Exchange experience and needs on EGEE– http://cern.ch/egee-intranet/User-Forum

• EGEE-II conference, Geneva, 25-29 September, 2006– Include sessions with related projects – Theme will be long term sustainability