GridPP Status “How Ready are We for LHC Data-Taking?”

37
GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle

description

GridPP Status “How Ready are We for LHC Data-Taking?”. Tony Doyle. Outline. Exec 2 Summary 2006 Outturn The Year Ahead.. 2007 Status Some problems to solve.. “All 6’s and 7 ’s”?. 2007. Exec 2 Summary. 2006 was the second full year for the UK Production Grid - PowerPoint PPT Presentation

Transcript of GridPP Status “How Ready are We for LHC Data-Taking?”

Page 1: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP Status

“How Ready are We for LHC Data-Taking?”

Tony Doyle

Page 2: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Outline• Exec2 Summary• 2006 Outturn• The Year Ahead.. • 2007 Status• Some problems to solve..• “All 6’s and 7’s”?

2007

Page 3: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Exec2 Summary• 2006 was the second full year for the UK Production Grid • More than 5,000 CPUs and more than 1/2 Petabyte of

disk storage• The UK is the largest CPU provider on the EGEE Grid,

with total CPU used of 15 GSI2k-hours in 2006• The GridPP2 project has met 69% of its original targets

with 92% of the metrics within specification• The initial LCG Grid Service is now starting and will run

for the first 6 months of 2007• The aim is to continue to improve reliability and

performance ready for startup of the full Grid service on 1st July 2007

• The GridPP2 project has been extended by 7 months to April 2008

• The outcome of the GridPP3 proposal is now known• We anticipate a challenging period in the year ahead

Page 4: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Grid OverviewAim: by 2008 (full year’s data

taking)- CPU ~100MSI2k (100,000

CPUs)- Storage ~80PB - Involving >100 institutes

worldwide

- Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT)

1. Prototype went live in September 2003 in 12 countries

2. Extensively tested by the LHC experiments in September 2004

3. February 2006 25,547 CPUs, 4398 TB storage

Status in 2007: 177 sites, 32,412 CPUs, 13,282 TB storageMonitoring via Grid Operations

Centre

Page 5: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

htt

p:/

/ww

w3

.egee.c

esg

a.e

s/gri

dsi

te/a

ccounti

ng/C

ESG

A/t

ree_e

gee.p

hp

Resources2006 CPU Usage

by Region

Via APEL accounting

Page 6: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

2006 OutturnDefinitions:• "Promised" is the total that was planned at the Tier-

1/A (in the March 2005 planning) and Tier-2s (in the October 2004 Tier-2 MoU) for CPU and storage

• "Delivered" is the total that was physically installed for use by GridPP, including LCG and SAMGrid at Tier-2 and LCG and BaBar at Tier-1/A

• "Available" is available for LCG Grid use, i.e. declared via the EGEE mechanisms with storage via an SRM interface

• "Used" is as accounted for by the Grid Operations Centre

Page 7: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Resources Delivered

  CPU KSI2K Storage TB

  Promised Delivered Ratio Promised Delivered Ratio

Brunel 155 480 310% 21 6.3 30%

Imperial 1165 807 69% 93.3 60.3 65%

QMUL 917 1209 132% 58.5 18 31%

RHUL 204 163 80% 23.2 8.8 38%

UCL 60 121 202% 0.7 1.1 149%

Lancaster 510 484 95% 86.7 72 83%

Liverpool 605 592 98% 80.3 2.8 3%

Manchester 1305 1840 141% 372.6 145 39%

Sheffield 183 183 100% 3 2 67%

Durham 86 99 115% 5 4 79%

Edinburgh 7 11 152% 70.5 20 28%

Glasgow 246 800 325% 14.8 40 270%

Birmingham 196 223 114% 9.3 9.3 100%

Bristol 39 12 31% 1.9 3.8 200%

Cambridge 33 40 123% 4.4 4.4 101%

Oxford 414 150 36% 24.5 27 110%

RAL PPD 199 320 161% 17.4 66.1 381%

London 2501 2780 111% 196.7 94.4 48%

NorthGrid 2602 3099 119% 542.6 221.8 41%

ScotGrid 340 910 268% 90.3 64 71%

SouthGrid 880 745 85% 57.5 110.6 192%

Total 6322 7534 119% 887.1 490.8 55%

             

Tier-1 1604 1034 64% 1495 712 48%

Tier-1 and Tier-2 total delivery is impressive and usage is improved

Available

CPU: 8.5 MSI2k

Storage: 1.7 PB

Disk: 0.54 PB

Delivery of Tier-1 disk

Used

CPU: 15 GSI2k-hours

Disk: 0.26 PB

Usage of Tier-2 CPU, disk

Request: PPARC acceptance of the 2006 outturn (next week)

2006 Outturn

Page 8: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

  Available (KSI2K) Used (KSI2K Hours) Ratio

  1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06

Brunel 116 116 116 480 12,811 105,014 159,082 643,806 0.70% 41.30% 62.60% 61.20%

Imperial 203 203 203 642 16,828 83,627 82,593 557,943 2.20% 18.80% 18.60% 39.70%

QMUL 281 1209 1209 1209 214,335 612,564 459,427 1,259,446 13.70% 23.10% 17.40% 47.60%

RHUL 163 163 163 163 25,085 21,940 176,046 147,360 17.20% 6.10% 49.30% 41.30%

UCL 121 121 121 121 42,217 51,106 73,763 156,576 16.00% 19.30% 27.80% 59.10%

Lancaster 485 476 473 473 248,463 402,774 210,432 297,550 14.90% 38.60% 20.30% 28.70%

Liverpool 572 592 592 592 56,218 455,727 40,551 164,222 11.80% 35.20% 3.10% 12.70%

Manchester 480 720 1152 1840 380,857 1,042,154 248,704 370,567 0.30% 66.10% 9.90% 9.20%

Sheffield 240 183 183 183 38,411 59,860 78,795 127,039 2.00% 15.00% 19.70% 31.80%

Durham 72 80 80 80 36,699 58,185 33,671 59,123 4.30% 33.20% 19.20% 33.70%

Edinburgh 6 6 6 6 14,829 4,637 3,641 4,918 34.40% 35.30% 27.70% 37.40%

Glasgow 104 104 47 800 75,774 50,462 72,105 155,986 6.30% 22.20% 70.10% 8.90%

Birmingham 62 23 23 23 38,473 31,795 28,299 53,930 13.20% 62.00% 55.20% 105.20%

Bristol 1 7 7 7 842 7,208 8,982 6,593 15.10% 45.70% 57.00% 41.80%

Cambridge 40 38 38 38 654 2,228 2,442 1,811 0.00% 2.70% 2.90% 2.20%

Oxford 67 65 65 65 94,093 92,841 82,284 63,959 34.30% 65.40% 58.00% 45.10%

RAL PPD 73 73 320 320 106,919 132,046 143,648 235,172 59.90% 82.10% 20.50% 33.60%

London 884 1812 1812 2615 311,276 874,251 950,911 2,765,131 10.30% 22.00% 24.00% 48.30%

NorthGrid 1777 1971 2399 3087 723,949 1,960,515 578,482 959,378 10.10% 45.40% 11.00% 14.20%

ScotGrid 182 190 133 886 127,302 113,284 109,417 220,027 6.40% 27.20% 37.60% 11.30%

SouthGrid 243 207 453 453 240,981 266,118 265,655 361,465 31.00% 58.80% 26.80% 36.40%

Total Tier-2 3086 4179 4797 7041 1,403,508 3,214,168 1,904,465 4,306,001 12.20% 35.10% 18.10% 27.90%

                         

Tier-1 415 620 651 848 624,636 1,089,917 1,393,022 992,106 68.70% 80.20% 97.60% 53.40%

LCG CPU Usage

Page 9: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

(measured by UK Tier-1 for all VOs)

~90% CPU efficiency due to i/o bottlenecks is OK Concern that this fell to ~75%

Efficiency

Each experiment needs to work to improve their

system/deployment practice anticipating e.g. hanging

gridftp connections during batch work

target

Page 10: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

CPU by experiment

  Used at Tier-2 (KSI2K Hours) Used at Tier-1 (KSI2K Hours)

  1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06

ALICE   0 432 187 9031 17122 30139 36757

ATLAS 852014 1131060 569879 800194 156114 323195 256979 253869

CMS 125794 236409 489122 368427 77025 176198 407072 170784

LHCb 164339 1237858 718268 1072838 21210 404707 634341 396417

BaBar 41854 72932 9159 31454 254775 61636 15853 501

CDF   1 5 17        

D0 93373 102602 40541 221069 95963 53091 22433 27515

H1 3851 44018 23013 8460 3058 17083 18459 80

ZEUS 4965 23170 4140 60736 6906 20953 1353 19815

Other 115407 232299 4867 11341 548 15932 6384 478

LHC 1142147 2605327 1777701 2241646 263380 921222 1328531 857827

Total 1401597 3080349 1859426 2574723 624630 1089917 1393013 906216

Page 11: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

ALICE1%

ATLAS34%

CMS16%

LHCb35%

Other3%

H11%

ZEUS1%

D05%

BaBar4%

CDF0%

2006 CPU Usage

by experiment

UK Resources

Page 12: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

LCG Disk Usage

  Available (TB) Used (TB) Ratio

  1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06

Brunel   1.5 1.1 4.7   0.1 0.2 4.3   6.70% 18.10% 91.10%

Imperial 0.3 3.2 5.6 35.4 0.3 2.2 2.9 25.5 88.80% 69.40% 51.70% 72.00%

QMUL 18.2 15.9 18.2 18.2 14.3 3.6 3.4 4.8 78.40% 22.60% 18.40% 26.40%

RHUL 2.7 2.7 2.7 5.5 2.5 0.3 0.2 1.5 90.50% 10.60% 7.70% 27.30%

UCL 1.1 0 1 2 0.9 0 0.3 1.4 82.60% 54.30% 32.60% 70.00%

Lancaster 63.4 53.1 47.7 60 29.9 13.1 26.9 12.8 47.10% 24.70% 56.30% 21.30%

Liverpool   2.8 0.6 2.8 0 0 0.1 1.4   0.80% 16.30% 50.00%

Manchester   66.9 67.6 176.8 0 1.9 3.9 5.4   2.80% 5.80% 3.10%

Sheffield 4.5 3.9 2.3 2.2 4.4 1.2 0.3 0.1 95.80% 32.10% 12.40% 4.50%

Durham 1.9 1.9 3.5 3.5 0.6 1.3 0.9 1.2 30.90% 68.10% 25.40% 34.30%

Edinburgh 31 30 29 20 16.6 13.5 2.8 3.9 53.60% 45.10% 9.50% 19.50%

Glasgow 4.3 4.3 1.6 34 3.8 0.6 1.1 4.1 89.90% 15.00% 70.80% 12.10%

Birmingham 1.8 1.8 1.9 1.8 1.3 0.6 0.8 1.3 73.30% 31.80% 41.60% 72.20%

Bristol 0.2 0.2 2.1 1.8 0.2 0 0.3 0.4 89.60% 12.00% 16.00% 22.20%

Cambridge 3.2 3.2 3 3.1 3.1 0 0.8 2.1 94.70% 0.60% 26.30% 67.70%

Oxford 3.2 1.6 3.2 3.2 2.5 0 0 0.5 80.10% 1.10% 0.00% 15.60%

RAL PPD 6.8 6.8 6.4 16.6 6.4 0.6 0.3 13.5 93.50% 9.40% 4.20% 81.30%

London 22.4 23.4 28.7 65.8 17.9 6.2 7 37.5 80.30% 26.60% 24.40% 57.00%

NorthGrid 67.9 126.7 118.2 241.8 34.2 16.2 31.2 19.7 50.40% 12.80% 26.40% 8.10%

ScotGrid 37.1 36.2 34.1 57.5 21 15.5 4.8 9.2 56.60% 42.80% 14.00% 16.00%

SouthGrid 15.2 13.6 16.6 26.5 13.4 1.3 2.2 17.8 88.60% 9.30% 13.20% 67.20%

Total Tier-2 142.5 199.8 197.5 391.6 86.6 39.2 45.1 84.2 60.70% 19.60% 22.80% 21.50%

                         

Tier-1 121.1 114.4 123.1 145.3 56.4 107.2 149.4 177.7 46.60% 93.70% 121.40% 122.30%

Page 13: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

(individual rates)

0

100

200

300

400

500

600

700

800

900

Lanc

aster

RALPP

Birming

ham

Glasgo

w

Edinbu

rgh

Oxford

Man

ches

ter

Sheffie

ld

Cambridge

Bristo

l

UCL-CENTRAL

Durham

QMUL

IC-H

EP

IC-L

eSC

UCL-HEP

RHUL

Brunel

Liver

pool

Tra

nsf

er r

ate

in M

b/s

Inbound Q1 Inbound Q3 Outbound Q1 Outbound Q3

Aim: to maintain data transfers at a sustainable level as part of experiment service challenges

http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary

File Transfers

Current goals:>250Mb/s inbound-only>300-500Mb/s outbound-only >200Mb/s inbound and outbound

Page 14: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

2006 20082007

GridPP2 GridPP2+

GridPP3

First Collisions 14 TeV Collisions

900GeV

14TeVLHC:

The Year Ahead..

Don’tPanic

2001 2002 2003 2004 2005 2006 2007

EDG EGEE-I EGEE-IILHC Data Taking

GridPP1 GridPP2 GridPP3

EGI ?

Page 15: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Magnet & Cosmics Test (August 06)

Detector Lowering (January 07)

CMS

Page 16: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Job Receiver

Job Receiver

JobJDL

Sandbox

JobInput

JobDB

Job Receiver

Job Receiver

Job Receiver

Job Receiver

DataOptimizer

DataOptimizer

TaskQueue

LFCLFC

checkData

AgentDirectorAgent

Director

checkJob

RBRBRBRBRBRB

PilotJob

CECE

WNWN

PilotAgentPilot

Agent

JobWrapper

JobWrapper

execute UserApplication

UserApplicationfork

MatcherMatcher

CEJDL

JobJDL

getReplicas

SE

uploadData

VO-boxVO-boxputRequest

AgentMonitorAgent

Monitor

checkPilot

getSandbox

JobMonitor

JobMonitor

DIRACservicesDIRAC

services

LCGservices

LCGservices

WorkloadOn WN

WorkloadOn WN

DIRAC WMS

Page 17: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

• The longest running Data Challenge in ALICE– A comprehensive test of the ALICE Computing model– Running already for 9 months non-stop: approaching data taking regime of

operation– Participating: 55 computing centres on 4 continents: 6 Tier 1s, 49 T2s– 7MSI2k • hours 1500 CPUs running continuously • 685K Grid jobs total

• 530K production• 53K DAQ • 102K user !!!

• 40M evts, 0.5PB generated, reconstructed and stored

• User analysis ongoing

43% T1s57% T2s

T1 sites:

CNAF, CCIN2P3, GridKa, RAL, SARA

• FTS tests T0->T1 Sep-Dec• Design goal 300MB/s reached

but not maintained• 0.7PB DAQ data registered

ALICE PDC’06

Page 18: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Continued testing of computing models, basic services

Testing DAQTier-0 & integrating into DAQTier-0Tier-1data flow

Building up end-user analysis support

Exercising the computing systems, ramping up job rates, data management performance, ….

WLCG CommissioningSchedule

2006

2007

2008

SC4 – becomes initial service whenreliability and performance goals met

01jul07 - service commissioned - full 2007 capacity, performance

Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, ….

Introduce residual servicesFull FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4

first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1).

Continue DC mode, as per WLHC commissioning

AliRoot & Condition fwksSEs & Job priorities

Combined T0 testDA for calibration ready

Finalisation of CAF & Grid

The real thing

The Year Ahead..

Page 19: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

htt

p:/

/ww

w3

.egee.c

esg

a.e

s/gri

dsi

te/a

ccounti

ng/C

ESG

A/t

ree_e

gee.p

hp

Resources2007 CPU Usage

by Region

Via APEL accounting

Page 20: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

2007 CPU Usage

by experiment

UK Resources

Page 21: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

2007 CPU Usage

by experiment

UK Resources

Page 22: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Hardware OutlookPlanning for 2007..

• A profiled ramp-up of resources is planned throughout 2007 to meet the UK requirements of the LHC and other experiments

• The results are available for the Tier-1 and Tier-2s• The Tier-1/A Board reviewed UK input to

International MoU negotiations for the LHC experiments as well as providing input to the International Finance Committee for BaBar

• For LCG, the 2007 commitment for disk and CPU capacity can be met out of existing hardware already delivered

The Year Ahead..

Page 23: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

e.g. Glasgow: UKI-SCOTGRID-GLASGOW

• 800 kSI2k• 100 TB DPM

Needed for LHC start-upAugust 28

September 1

October 13

October 23

T2 Resources

IC-HEP•440 KSI2K•52 TB dCache

Brunel•260 KSI2K•5 TB DPM

Page 24: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

(measured by UK Tier-1 for all VOs)

~90% CPU efficiency due to i/o bottlenecks is OK Concern that this is falling further

Efficiency

Current transition from dCache to CASTOR at the Tier-1

contributes to the problem [see Andrew’s talk]

NB March is a mid-month figure

Each experiment needs to work to improve their system

Page 25: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

• Many problems identified and fixed at individual sites (GridPP DTeam)

• Other ‘Generic’ system failures that need to be addressed before fit for widespread use by inexperienced users

• Production teams mostly ‘work round’ these• Users can’t/won’t

Overall Success Rate

0102030405060708090

11/01

/200

7

18/01

/200

7

25/01

/200

7

01/02

/200

7

08/02

/200

7

15/02

/200

7

22/02

/200

7

01/03

/200

7

%

More sites and tests introduced Syste

m failures

ATLAS User Tests

Page 26: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Status (Sunday) Production jobs now

outnumbered by analysis and unknown jobs

Analysis (CRAB) efficiency OK? e.g. RAL 93.3%

CMS User Jobs

htt

p:/

/lxard

a0

9.c

ern

.ch/d

ash

board

/request

.py/jobsu

mm

ary

Page 27: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

• Data recorded in the experiment dashboards– Initially only data from CMS (dashboard)– Now more and more data from ATLAS as well– CMS: mostly analysis; ATLAS: dominated by production– We expect to have “all” type of jobs soon

“Job RetryCount” family 48757

Job proxy is expired 17465

Cannot plan: BrokerHelper: no compatible resources

16646

Job got an error while in the CondorG queue 5694

Cannot retrieve previous matches for … 2410

Job successfully submitted to Globus 948

Unable to receive data 291

Popular(?) Messages

Page 28: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

• Use RBs at RAL (2) and Imperial• Broke about once a week and all jobs lost or in limbo

– Never clear to user why• Switch to a different RB

– Users don’t know how to do this• Barely usable for bulk submission – too much

latency • Can barely submit and query ~20 jobs in 20 mins

before next submission– Users will want to do more than this

• Cancelling jobs doesn’t work properly – often fails and repeated attempts cause RB to fall over– Users will not cancel jobs

• (We know EDG RB is deprecated but gLite RB isn’t currently deployed)

• Work ongoing to improve RB availability and BDII (failover system) at Tier-1..

Resource Broker

Page 29: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

• lcg-info is used to find out what version of ATLAS software is available before submitting a job to a site but it is too unreliable and previous answer needs to be kept track of

• ldap query typically gives quick, reliable answer but lcg-info doesn’t

• The lcg-info command is very slow (querying *.ac.uk or xxx.ac.uk) and often fails completely

• Different bdiis seem to give different results and it is not clear to users which one to use (if the default fails)

• Many problems with UK SE's have made the creation of replicas painful - it is not helped by frequent bdii timeouts

• The FDR freedom of choice tool causes some problems because sites fail SAM tests because the job queues are full

Information System

Page 30: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

User Interface• Users need local UIs (where their files are)• These can be set up by local system managers but generally these

are not Grid experts • The local UI setup controls what RB, BDII, LFC etc all the users using

that UI get and these appear to be pretty random– There needs to be clear guidance on which of these to use and how to

change them if things go wrong

Proxy Certificates• These cause a lot of grief as the default 12 hours is not long enough• If the certificate expires it's not always clear from the error messages

when running jobs fail • They can be created with longer lifetimes but this starts to violate

security policies – Users will violate these policies

• Maybe MyProxy solves this but do users know?

UI and Proxies

Page 31: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

• GGUS is used to report some of these problems but it is not very satisfactory

• The initial response is usually quite quick saying it has been passed to X but then the response after that is very patchy

• Usually there is some sort of acknowledgement but rarely a solution and often the ticket is never closed even if the problem was transitory and now irrelevant

• There are two particular cases which GGUS does not handle well:a) Something breaks and probably just needs to be rebooted: the system

is just too slow and it's better to email someone (if you know whom)b) Something breaks and is rebooted/repaired etc but the underlying

cause is a bug in the middleware: this doesn't seem to be fed back to the developers

• There are also of course some known problems that take ages to be fixed (e.g. the globus port range bug, rfio libraries, ...)

• More generally, the GGUS system is working at the ~tens (up to 100) of tickets/week level but may not scale as new users start using the system

GGUS

Page 32: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Usability• The Grid is a great success for Monte Carlo production• However it is not in a fit state for a basic user analysis• The tools are not suitable for bulk operations by normal users

– Current users therefore set up ad-hoc scripts that can be mis-configured

• ‘System failures’ are too frequent (largely independent of the VO, probably location-

independent)• The User experience is poor

– Improved overall system stability is needed – Template UI configuration (being worked on) will help– Wider adoption of VO-specific user interfaces may help– Users need more (directed) guidance

• There is not long to go– Usability task force required?

Page 33: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

The Year Ahead..

3D tested by ATLAS and LHCb 3D used for the conditions DB

SRM 2.2 implementations SRM 2.2 tested by experiments

SLC4 Migration gLite CE new RB FTS v2

VOMS scheduling priorities 24x7 definition and 24x7 test scenario

VO boxes SLA VO boxes implementation Accounting data into APEL

repository Automated Accounting reports

Tier-1Tier-1

Tier-1 & Tier-2Tier-1 & Tier-2Tier-1 & Tier-2Tier-1 & Tier-2

Tier-1Tier-1 & Tier-2Experiments

Tier-1Tier-1

Tier-1 & Tier-2Tier-1 & Tier-2

GOCGOC

Page 34: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

FTS 2.0: schedule• FTS 2.0 currently deployed on pilot service@CERN

– In testing since December: running dteam tests to stress-test it– This is the ‘uncertified’ code

• Next step: open pilot to experiments to verify full backwards compatibility with experiment code– Arranging this now

• Deployment at CERN T0 scheduled in April 2007– Goal is April 1, but this is tight– Subject to successful verification

• Roll-out to T1 sites a month after that– We expect at least one update will be needed to the April version

The Year Ahead..

Example:

Page 35: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

The Year Ahead..

Site Reliability

Tier-0, Tier-12007 SAM targets monthly

averageTarget for each site- 91% - by Jun 07- 93% - by Dec 07Taking 8 best sites- 93% - by Jun 07- 95% - by Dec 07

Tier-2s “Begin reporting the monthly averages, but do not set targets yet”

~80% - by Jun 07~90% - by Dec 07

SAM tests (critical=subset)

BDII Top-level BDIIsBDII Site BDIIFTS File Transfer ServicegCE gLite Computing ElementLFC Global LFCVOMS VOMSCE Computing ElementSRM SRMgRB gLite Resource BrokerMyProxy MyProxyRB Resource BrokerVOBOX VO BOXSE Storage ElementRGMA RGMA Registry

Page 36: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Summary• Exec2 Summary – status OK• 2006 Outturn – some issues• The Year Ahead.. • Some problems to solve..

(AKA challenges)• The weather is fine• We need to set some targets

2007

Page 37: GridPP Status “How Ready are We for LHC Data-Taking?”

GridPP18 Collaboration Meeting20 March 2007 Tony Doyle - University of Glasgow

Devoncove Hotel931 Sauchiehall Street,Glasgow, G3 7TQTel: 0141 334 4000

Sandyford Hotel,904 Sauchiehall Street,Glasgow, G3 7TFTel: 0141 334 0000

You can see ~home if you look up