Post on 28-Mar-2015
GridPP Deployment Status,
User Statusand Future Outlook
Tony Doyle
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Introduction
A. What is the deployment status?B. Is the system usable?C. What is the future of GridPP?
Wot no middleware?
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
GridPP Middleware is..
Security
Network Monitoring
Information Services
Grid Data Management
Storage Interfaces
Workload Management
Middleware
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
e.g. LCG monitoring applet
• Monitor:– resource brokers– virtual organisations
• ATLAS• CMS• LHCb• DTeam• Other
• SQL queries to logging and book-keeping database
Middleware
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
e.g. APEL and R-GMA
R-GMA structure
• used in accounting system (GOCDB)
• For gLite the sensors are provided by DGAS via DGAS2APEL
• the EGEE portal for accounting data is provided by CESGA
Middleware
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Resources
0
1000
2000
3000
4000
5000
6000
7000
06/2
0/04
07/1
5/04
08/0
9/20
04
09/0
3/20
04
09/2
8/04
10/2
3/04
11/1
7/04
12/1
2/20
04
01/0
6/20
05
01/3
1/05
02/2
5/05
03/2
2/05
04/1
6/05
05/1
1/20
05
06/0
5/20
05
06/3
0/05
07/2
5/05
08/1
9/05
09/1
3/05
10/0
8/20
05
11/0
2/20
05
11/2
7/05
22/1
2/20
05
16/0
1/20
06
10/0
2/20
06
07/0
3/20
06
01/0
4/20
06
26/0
4/20
06
22/0
5/20
06
16/0
6/20
06
11/0
7/20
06
05/0
8/20
06
30/0
8/20
06
24/0
9/20
06
22/1
0/20
06
Date
Pu
bli
shed
jo
b s
lots
UK total job slots
17/12/06: EGEE total slots 34141 => UKI is 6949 ~20% of the total 17/12/06: EGEE jobs running 21291 => UKI is 2912 ~ 14% jobsMax EGEE = 42517 Max UKI = 8176 (N.B. hyperthreading distorts 1:1 job:CPU core relation – reduces UKI numbers by ~500)
http://goc.grid.sinica.edu.tw/gstat/UKI.html
Sunday’s STATUS totalCPU freeCPU runJob waitJob seAvail TB seUsed TB maxCPU avgCPU
Total 6949 3210 2912 77321 246 313 8176 6716
Steady climb since 2004 towards target of ~10,000 CPU (cores) (~job slots)
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
htt
p:/
/ww
w3
.egee.c
esg
a.e
s/gri
dsi
te/a
ccounti
ng/C
ESG
A/t
ree_e
gee.p
hp
Resources2006 CPU Usage
by Region
Via APEL accounting
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
(not all records are being accounted)
htt
p:/
/ww
w3
.egee.c
esg
a.e
s/gri
dsi
te/a
ccounti
ng/C
ESG
A/t
ree_e
gee.p
hp
Resources
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
2006 CPU Usageby experiment
htt
p:/
/ww
w3
.egee.c
esg
a.e
s/gri
dsi
te/a
ccounti
ng/C
ESG
A/t
ree_e
gee.p
hp
Resources
Total CPU used 52,876,788 kSI2k-hours!
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
06/0
2/20
04
06/2
9/04
07/2
6/04
08/2
2/04
09/1
8/04
10/1
5/04
11/1
1/20
04
12/0
8/20
04
01/0
4/20
05
01/3
1/05
02/2
7/05
03/2
6/05
04/2
2/05
05/1
9/05
06/1
5/05
07/1
2/20
05
08/0
8/20
05
09/0
4/20
05
10/0
1/20
05
10/2
8/05
11/2
4/05
21/1
2/20
05
17/0
1/20
06
13/0
2/20
06
12/0
3/20
06
08/0
4/20
06
06/0
5/20
06
02/0
6/20
06
29/0
6/20
06
26/0
7/20
06
22/0
8/20
06
18/0
9/20
06
18/1
0/20
06
Date
% j
ob
slo
ts u
sed
% EGEE slots used % UK slots used
(Estimated utilisation based on gstat job slots/usage)
UKI mirrors overall EGEE utilisation Average Utilisation for Q306: 66%Compared to target of ~70%CPU utilisation was a T2 issue, but now improving..
Utilisation
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
(measured by UK Tier-1 for all VOs)
~90% CPU efficiency due to i/o bottlenecks is OK Concern that this is currently ~75%
Efficiency
Each experiment needs to work to improve their
system/deployment practice anticipating e.g. hanging
gridftp connections during batch work
target
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
(is still an issue for Tier-1 and Tier-2s)
htt
p:/
/ww
w.g
ridpp.a
c.uk/
stora
ge/s
tatu
s/gri
dppD
iscS
tatu
s.htm
l
• Utilisation is low (~30%) at T2s and accounting [by VO] is not (yet) there
Storage
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
GOCDB Accounting Display - under development
• Looking at data for RAL-LCG2• Storage units are 1TB = 10^6 MB• Tape Used + Disk Used = Total
Sensor Drop Outs have been fixed
Total Used Storage (TB)
Tape UsedDisk Used
Storage
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
• SRM at T1 ~200TB of disk (deployment problem in 2006)– ~100% usage (problem for 2006 service challenges)– Castor 2.1
• SRM at all T2s ~200TB of disk in total– ~30% usage: difficult to calculate– dCache 1.7.0 and DPM v1.5.10– Dedicated disk servers advised (storage should be robust)
• Need to make sure sites are running the latest GIP plugins(https://twiki.cern.ch/twiki/bin/view/EGEE/GIP-Plugins)
• New GOC storage accounting system being put in place• being deployed at Tier-2s• SRM v2.2 is being implemented: need to test interoperability
Storage
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
(individual rates)
0
100
200
300
400
500
600
700
800
900
Lanc
aster
RALPP
Birming
ham
Glasgo
w
Edinbu
rgh
Oxford
Man
ches
ter
Sheffie
ld
Cambridge
Bristo
l
UCL-CENTRAL
Durham
QMUL
IC-H
EP
IC-L
eSC
UCL-HEP
RHUL
Brunel
Liver
pool
Tra
nsf
er r
ate
in M
b/s
Inbound Q1 Inbound Q3 Outbound Q1 Outbound Q3
Aim: to maintain data transfers at a sustainable level as part of experiment service challenges
http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary
File Transfers
Current goals:>250Mb/s inbound-only>250Mb/s outbound-only >200Mb/s inbound and outbound
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
• Approval for new (shared) machine room – ETA Summer 2008. Space for 300 racks.
• Procurement – March 06: 52 AMD 270 units, 21 disk
servers (168TB data capacity)– FY 06/07: 47 disk servers (282TB disk
capacity), 64 twin dual-core Intel Woodcrest 5130 units (550kSI2K)
– FY 06/07 upcoming: further 210 TB disk capacity plus high-availability systems (redundant PSUs, hot-swappable paired HDDs)
• Storage commissioning saga– Ongoing problems with March kit.
Firmware updates have now solved problem. (Disks on Areca 1170 in raid 6 experienced multiple dropouts during testing of WD drives)
• Move to CASTOR– Very support heavy but made available
for CSA06 and performing well
General- Air-con problems with high-temperatures triggering high pressure cut-outs in refrigerator gas circuits(summers are warmer even in the UK...)- July security incident - 10Gb CERN line in place. Second 10Gb line scheduled in 07Q1
Tier-1 Resource
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
e.g. Glasgow: UKI-SCOTGRID-GLASGOW
• 800 kSI2k• 100 TB DPM
Needed for LHC start-upAugust 28
September 1
October 13
October 23
T2 Resources
IC-HEP•440 KSI2K•52 TB dCache
Brunel•260 KSI2K•5 TB DPM
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Could also be 2006
T2 Resources
As overheard at one T2
site..
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
A. “Usability” (Prequel)• GridPP runs a major part of the EGEE/LCG
Grid, which supports ~3000 users • The Grid is not (yet) as transparent as
end-users want it to be• The underlying overall failure rate is
~10%• User (interface)s, middleware and
operational procedures (need to) adapt• Procedures to manage the underlying
problems such that system is usable are highlighted
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Virtual Organisations• Users are grouped into VOs
– Users/VO varies from 1 to 806 members (and growing..)
• Broadly four classes of VO– LHC experiments– EGEE supported– Worldwide (mainly non-LHC particle physics)– Local/regional e.g. UK PhenoGrid
• Sites can choose which VOs to support, subject to MOU/funding commitments– Most GridPP sites support ~20 VOs– GridPP nominally allocates 1% of resources to EGEE non-HEP
VOs– GridPP currently contributes 30% of the EGEE CPU resources
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
User evolution
Number of users of the UK Grid (exc. Deployment Team)
Quarter: 05Q4 06Q2 06Q3Value: 1342 1831 2777
Many EGEE VOs supported c.f. 3000 EGEE targetNumber of active users (> 10 jobs per month)Quarter: 05Q4 06Q1 06Q2Value: 83 166 201Fraction: 6.2% 11.0%Viewpoint: growing fairly rapidly, but not as active
as they could be? depends on the “active” definition
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
806 atlas 763 dzero 577 cms 566 dteam 150 lhcb 131
alice 75 bio 65 dteamsgm 41 esr 31 ilc 27 atlassgm 27 alicesgm 21 cmsprg 18
atlasprg 17 fusn 15 zeus 13 dteamprg 13 cmssgm 11 hone 9 pheno 9 geant 7 babar 6 aliceprg 5 lhcbsgm 5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2 geantsgm 2 cedar 1 phenosgm
1 minossgm 1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf
Kn
ow
you
r users
? U
K-e
nab
led
VO
s
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Resource allocation• Assign quotas and priorities to VOs and measure delivery, but
further work required on VOMS-roles/groups within each VO
• VOMS provides group/role information in the proxy
• Tools to control quotas and priorities in site services being developed– So far only at whole-VO level– Maui batch scheduler is flexible, easy to map to groups/roles– Sites set the target shares– Can publish VO/group-specific values in GLUE schema, hence the RB
can use them for scheduling
• Accounting tool (APEL) measures CPU use at global level (UK task)– Storage accounting currently being added– GridPP monitors storage across UK– Privacy issues around user-level accounting, being solved by
encryption
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
User Support• Becoming vital as the number of users grows
– But modest effort available in the various projects
• Global Grid User Support (GGUS) portal at Karlsruhe provides a central ticket interface– Problems are categorised
• Tickets are classified by an on-duty Ticket Process Manager, and assigned to an appropriate support unit– UK (GridPP) contributes support effort
• GGUS has a web-service interface to ticketing systems at each ROC– Other support units are local mailing lists– Mostly best-effort support, working hours only
• Currently ~tens of tickets/week– Manageable, but may not scale much further– Some tickets slip through the net
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Documentation & Training
• Need documentation and training for both system managers and users– Mostly expert users up to now, but user community is expanding– Induction of new VOs is a particular problem – no peer support– EGEE is running User Fora for users to share experience
• Next in Manchester in May ’07 (with OGF)– EGEE has a dedicated training activity run by NeSC/Edinburgh
• Documentation is often a low priority, little dedicated effort– The rapid pace of change means that material requires constant
review• Effort on documentation is now increasing
– GridPP has appointed a documentation officer• GridPP web site, wiki
– Installation manual for admins is good• There is also a wiki for admins to share experience
– Focus is now on user documentation• New EGEE web site – coming soon
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Alternative view?
• The number of users in the Grid School for the Gifted is ~manageable now
• The system may be too complex, requiring too much work by the “average user”?
• Or the (virtual) help desk may not be enough?
• Or the documentation may be misleading?
• Or..• Having smart users helps
(the current ones are)
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Timeline – 1
Proposal Writing Proposal Defence
Apr May Jun Jul Aug Sep Oct
31st March – P
PARC Call
16th June – G
ridPP16 at Q
MUL
6th September – 1
st PPRP review
1st November – Grid
PP17
8th NovemberPPRP “visiting panel”
13th July – B
id Submitted
CB OC CB
Future?
Year-long process to define future LHC exploitation
htt
p:/
/ww
w.g
ridpp.a
c.uk/
docs
/gri
dpp3
/
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Scenario Planning – Resource Requirements [TB, kSI2k]
GridPP requested a fair share of global requirements, according to experiment requirements
Changes in the LHC schedule prompted a(nother) round of resource planning - presented to CRRB on Oct 24th
New UK resource requirements have been derived and incorporated in the scenario planning e.g. Tier-1
Tier1 CPU 2008 2009 2010ALICE 10230 18430 22930ATLAS 18123 28423 49573CMS 12400 16900 36900LHCb 1770 4870 6740TOTAL 42523 68623 116143
Tier1 Disk 2008 2009 2010ALICE 5220 7940 9870ATLAS 9939 19686 39488CMS 5600 8500 13700LHCb 1025 2759 3250TOTAL 21784 38885 66308
Tier1 Tape 2008 2009 2010ALICE 7030 13980 20930ATLAS 7694 14950 28698CMS 13100 23500 36600LHCb 860 3070 5864TOTAL 28684 55500 92092
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Input to Scenario Planning –Hardware Costing
• Empirical extrapolations with extrapolated (large) uncertainties• Hardware prices have been re-examined following recent Tier-1 purchase •CPU (woodcrest) was cheaper than expected based on extrapolation of previous 4 years of data
CPU Costs
-4
-3
-2
-1
0
1
01-Jan-02 16-May-03 27-Sep-04 09-Feb-06 24-Jun-07 05-Nov-08 20-Mar-10Date
Ln
(K£/
KS
I2K
)
Past CPU Purchase
Best fit to past purchases
29 month extrapolation
20 month extrapolation
Future price estimates
Max
Min
GridPP3 submission
CERN
Disk Costs
-2
-1
0
1
2
3
01-Jan-02 16-May-03 27-Sep-04 09-Feb-06 24-Jun-07 05-Nov-08 20-Mar-10Date
Ln
(K
£/T
B)
Past Purchases
Best fit (21.7months)
24 months
19 months
Future price estimates
Upper Limit
Lower limit
GridPP3 Proposal
CERN
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Scenario PlanningAn example 70% scenario based on Experiment Inputs [£m]
WG Area Item Cost WG Frac Area Item CostStaff 4.99 93% Staff 4.62Hardware 11.72 85% Hardware 7.20Staff 3.29 89% Staff 2.94Hardware 5.12 85% Hardware 4.35
C Support Staff 4.50 C 69% Support Staff 3.10D Operations Staff 1.89 D 88% Operations Staff 1.66E Management Staff 1.17 E 90% Management Staff 1.06F Outreach Staff 0.37 F 74% Outreach Staff 0.28G Travel etc Other 0.84 G 75% Travel etc Other 0.63
33.89 25.841.25 0% 0.00
35.14 25.844.15 5.172.50 2.50
41.79 33.51Full Approval CostRunning Costs
GridPP3 Proposal 70% Scenario
Working AllowanceProject Cost
B
Full Approval Cost
Working AllowanceProject CostContingencyRunning Costs
Contingency
A Tier-1
B Tier-2
Tier-1
Tier-2
A
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Timeline – 2
Nov Dec Jan Feb Mar Apr May
8th Nov –PPRP Visit
ing Panel
6th Dec – PPRP re
commend to SC
PPARC Council
Science Committee
Grants etc.
GridPP2+ outcome (1/9/07-31/3/08) now known emphasis on operations (modest middleware
support)Anticipates GridPP3 outcome (1/4/08-31/3/11) known in the New Year
Back to the Future?
INFNGrid Meeting20 December 2006 Tony Doyle - University of Glasgow
Conclusion
A. What is the deployment status? (snapshot)
See e.g. “Performance of the UK Grid for Particle Physics” http://www.gridpp.ac.uk/papers/GridPP_IEEE06.pdf for more info.
B. Is the system usable?Yes, but more work required from end-user perspective
C. What is the future of GridPP?Operations-led activity, working with EGEE/EGI (EU) and NGS (UK)