Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

45
Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006

Transcript of Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Page 1: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Building the Grid

Grid Middleware 8

David Groep, lecture series 2005-2006

Page 2: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 2

Scale

Grid for handling large collaborations, with significant amounts of data LHC physics -> much data, quite a few users Bioinformatics -> reasonable amount of data, very many users Biomedicine & pharma -> highly confidential data, much

computation, quite a few users …

example again is LCG

Page 3: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 3

Atlas Tier-1 data flows

Tier-0

CPUfarm

T1T1OtherTier-1s

diskbuffer

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

RAW

ESD2

AODm2

0.044 Hz3.74K f/day44 MB/s3.66 TB/day

T1T1OtherTier-1s

T1T1Tier-2s

Tape

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

diskstorage

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm2

500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm2

500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day

ESD1

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm1

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

AODm1

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

AODm2

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

Plus simulation Plus simulation && analysis data analysis data

flowflow

Real data storage, reprocessing and

distribution

ATLAS data flows (draft). Source: Kors Bos, NIKHEF

Page 4: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Example Grid Resource Centre

NDPF and the Amsterdam Tier-1

Page 5: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 5

Grid Site Logical Layout

Page 6: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 6

NDPF Logical Composition

Page 7: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 7

Physical resources

Service machines (the ‘grid tax’) ~ 10 systems:

CE, RB, SE classic, SRM/DPM, MON, LFC, BDII, UI, installhost

compute clusters private IP space for convenience (I’m lazy ) mix of systems (in GLUE parlance: subClusters)

66 dual-AMD Athlon MP200+ (home-built) 27 dual-Intel XEON 2.8 GHz (Supermicro) 35 dual-Intel XEON EM64T 3.2 GHz (Dell) ~ 80 dual-dual-core Intel Woodcrest – 700 kSI2k capacity (Dell, Aug 2006)

in total ~ 560 cores or 1000 kSI2k capacity

disk storage 25 TByte in DPM managed pool

how to configure this to be an effective grid resource?

Page 8: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 8

NDPF Network Topology

Page 9: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 9

Batch Systems and Schedulers

Batch system keeps list of nodes and jobs Scheduler matches jobs to nodes based on policies

Page 10: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 10

SC3 storage network (SARA)

Disk-to-Disk 583 MByte/si.e. 4.6 Gbps

over the world

Graphic: Mark van de Sanden, SARA

Page 11: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 11

Tier-1 Architecture SARA (storage)

Graphic: Mark van de Sanden, SARA

Page 12: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 12

Matching Storage to Computing

Doing the math Simple job

Read 1 MByte piece of file (typically 1 “event”) Calculate on it for 30 seconds Do this for 2000 events per file (i.e. 2 GByte files) On 1000 files (1 day of running) this takes 700 days Need a total of 2 TByte, i.e. 4 IDE disks of 500 GB

On the Grid: spread out over 1000 CPUs All jobs start at the same time, retrieving a 2 GByte input The machine with this 2 TByte disk is on a 100 Mbps link Effective 10 MByte/s throughput Thus, 10 kByte/s per machine It takes 55 hours before the file transfers finish! And after that, only 17 hours of calculation

= 1Mbyte

Page 13: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 13

Storage

Just for ATLAS, one of the experiments RAW & ESD data flow ~ 4 TByte/day (1.4PB/y) to tape

Expected to be a permanent “museum” copy Largely scheduled access (intelligent staging possible), read & write Disk buffers before tape store can be smallish (~ 10%)

‘Chaotic’ access by real users: ~ 2-4 TByte/day throughput Lifetime of data is finite but long (typically 2+ years) Access needed from worker nodes, i.e., from O(1 000) CPUs Random “skimming” access pattern Need for disk server farms of typically 500 TByte – 1 PByte

Management of disk resources Split ‘file system view’ (file metadata) from the object store dCache & dcap, DPNS & DPM, GPFS & ObjectStore, …

Page 14: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 14

Grid Resources Amsterdam

• 2x 1.2 PByte in 2 robots

• 36+1024 CPUs IA32

• disk caches 10 + 50 TByte

• multiple 10 Gbit/s links

560 cores IA32/x86_64

25 TByte disk cache

10 Gbit link SURFnet

2 Gbit/s to SARA

only resources with either GridFTP or Grid job management

BIG GRID Approved January 2006!

Investment of € 29M in next 4 years

For: LCG, LOFAR, Life Sciences,

Medical, DANS, Philips Research, …

See http://www.biggrid.nl/

Page 15: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Configuring systems

Grid is what Murphy had in mind as he formulated his law …

Page 16: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 16

How to you see the Grid?Broker matches the user’s request with the site ‘information supermarket’ matchmaking (using Condor

Matchmaking) uses the information published by the site

Grid Information system‘the only information a user ever gets about a site’

So: should be reliable, consistent and complete Standard schema (GLUE) to

describe sites, queues, storage(complex schema semantics)

Currently presented as an LDAP directory

LDAP Browser Jarek Gawor: www.mcs.anl.gov/~gawor/ldap

Page 17: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 17

Glue Attributes Set by the Site

Site information SiteSysAdminContact: mailto: [email protected] SiteSecurityContact: mailto: [email protected]

Cluster infoGlueSubClusterUniqueID=gridgate.cs.tcd.ie

HostApplicationSoftwareRunTimeEnvironment: LCG-2_6_0HostApplicationSoftwareRunTimeEnvironment: VO-atlas-release-10.0.4HostBenchmarkSI00: 1300GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RHELGlueHostOperatingSystemRelease: 3.5GlueHostOperatingSystemVersion: 3

GlueCEStateEstimatedResponseTime: 519GlueCEStateRunningJobs: 175GlueCEStateTotalJobs: 248

Storage: similar info (paths, max number of files, quota, retention, …)

Page 18: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 18

Information system and brokering issues

Size of information system scales with #sites and #details already 12 MByte of LDIF matching a job takes ~15 sec

Scheduling policies are infinitely complex no static schema can likely express this information

Much information (still) needs to be set-up manually … next slides show situation as of Feb 3, 2006

The info system is the single most important grid service

Current broker tries to make optimal decision… instead of a `reasonable’ one

Page 19: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 19

Example: GlueServiceAccessControlRule

For your viewing pleasure: GlueServiceAccessControlRule 261 distinct values seen for GlueServiceAccessControlRule

(one of) least frequently occuring value(s): 1 instance(s) of GlueServiceAccessControlRule:

/C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt

(one of) most frequently occuring value(s): 310 instance(s) of GlueServiceAccessControlRule: dteam

(one of) shortest value(s) seen: GlueServiceAccessControlRule: d0

(one of) longest value(s) seen: GlueServiceAccessControlRule: anaconda-ks.cfg configure-firewall install.log install.log.syslog j2sdk-1_4_2_08-linux-i586.rpm lcg-yaim-latest.rpm myproxy-addons myproxy-addons.051021 site-info.def site-info.def.050922 site-info.def.050928 site-info.def.051021 yumit-client-2.0.2-1.noarch.rpm

Page 20: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 20

Example: GlueSEControlProtocolType

For your viewing pleasure: GlueSEControlProtocolType

freq value 1 GlueSEControlProtocolType: srm 1 GlueSEControlProtocolType: srm_v1 1 GlueSEControlProtocolType: srmv1 3 GlueSEControlProtocolType: SRM 7 GlueSEControlProtocolType: classic

… which means that of ~410 Storage Elements, only 13 publish interaction info. Ough!

Page 21: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 21

Example: GlueHostOperatingSystemRelease

Today's attribute: GlueHostOperatingSystemRelease    1  GlueHostOperatingSystemRelease: 3.02    1  GlueHostOperatingSystemRelease: 3.03    1  GlueHostOperatingSystemRelease: 3.2    1  GlueHostOperatingSystemRelease: 3.5    1  GlueHostOperatingSystemRelease: 303    1  GlueHostOperatingSystemRelease: 304    1  GlueHostOperatingSystemRelease: 3_0_4    1  GlueHostOperatingSystemRelease: SL    1  GlueHostOperatingSystemRelease: Sarge    1  GlueHostOperatingSystemRelease: sl3    2  GlueHostOperatingSystemRelease: 3.0    2  GlueHostOperatingSystemRelease: 305    4  GlueHostOperatingSystemRelease: 3.05    4  GlueHostOperatingSystemRelease: SLC3    5  GlueHostOperatingSystemRelease: 3.04    5  GlueHostOperatingSystemRelease: SL3   18  GlueHostOperatingSystemRelease: 3.0.3   19  GlueHostOperatingSystemRelease: 7.3   24  GlueHostOperatingSystemRelease: 3   37  GlueHostOperatingSystemRelease: 3.0.5   47  GlueHostOperatingSystemRelease: 3.0.4

Page 22: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 22

Example: GlueSAPolicyMaxNumFiles

136 separate Glue attributes seen

For your viewing pleasure: GlueSAPolicyMaxNumFiles

freq value

6 GlueSAPolicyMaxNumFiles: 99999999999999

26 GlueSAPolicyMaxNumFiles: 999999

52 GlueSAPolicyMaxNumFiles: 0

78 GlueSAPolicyMaxNumFiles: 00

1381 GlueSAPolicyMaxNumFiles: 10

136 separate Glue attributes seen

For your viewing pleasure: GlueServiceStatusInfo

freq value

2 GlueServiceStatusInfo: No Known Problems.

55 GlueServiceStatusInfo: No problems

206 GlueServiceStatusInfo: No Problems

Page 23: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 23

LCG’s Most Popular Resource Centre

Page 24: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 24

Example: SiteLatitude

Today's attribute: GlueSiteLatitude    1  GlueSiteLatitude: 1.376059    1  GlueSiteLatitude: 33.063924198120645    1  GlueSiteLatitude: 37.0    1  GlueSiteLatitude: 38.739925290125484    1  GlueSiteLatitude: 39.21 …   1  GlueSiteLatitude: 45.4567    1  GlueSiteLatitude: 55.9214118    1  GlueSiteLatitude: 56.44    1  GlueSiteLatitude: 59.56    1  GlueSiteLatitude: 67    1  GlueSiteLatitude: GlueSiteWeb: http://rsgrid3.its.uiowa.edu    2  GlueSiteLatitude: 40.8527    2  GlueSiteLatitude: 48.7    2  GlueSiteLatitude: 49.16    2  GlueSiteLatitude: 50    3  GlueSiteLatitude: 41.7827    3  GlueSiteLatitude: 46.12

   8  GlueSiteLatitude: 0.0

Page 25: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Operational Monitoring

Detecting faults and errorsexperiences in the NDPF

Page 26: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 26

User directory and automount maps

Large number of alternatives exists (nsswitch.conf/pam.d) files-based (/etc/passwd, /etc/auto.home, …) YP/NIS, NIS+ Database (MySQL/Oracle) LDAP

We went with LDAP: information is in a central location (like NIS) can scale by adding slave servers (like NIS) is secure by LDAP over TLS (unlike NIS) can be managed by external programs (also unlike NIS)

(in due course we will do real-time grid credential mapping to uid’s)

But you will need nscd, or a large number of slave servers

Page 27: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 27

Logging and Auditing

Auditing and logging syslog (also for grid gatekeeper, gsiftp, credential mapping) process accounting (psacct)

For the paranoid – use tools included for CAPP/EAL3+: LAuS system call auditing highly detailed:

useful both for debugging and incident response default auditing is critical: system will halt on audit errors

If your worker nodes are on private IP space need to preserve a log of the NAT box as well

Page 28: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 28

Grid Cluster Logging

Grid statistics and accounting rrdtool views from the batch system load per VO

combine qstat and pbsnodes output via script, cron and RRD

cricket network traffic grapher

extract pbs accounting data in dedicated database grid users have a ‘generic’ uid from a dynamic pool –

need to link this in the database to the grid DN and VO

from accounting db, upload anonymized records to APEL APEL is the grid accounting system for VOs and funding agencies accounting db also useful to charge costs to projects locally

Page 29: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 29

NDPF Occupancy

Usage of the NIKHEF NDPF Compute farm

Average occupancy in 2005: ~ 78%

each colour represents a grid VO, black line is #CPUs available

Page 30: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 30

But at times, in more detail

Auditing Indicent: a disk with less than 15% free makes the syscall-audit system panic, new processes cannot write audit entries, which is fatal, so they wait, and wait, and … a head node has most activity & fails first!

An unresponsive node causes the scheduler MAUI to wait for 15 minutes, then give up and start scheduling again, hitting the rotten node, and …

PBS Server trying desparately to contact adead node who’s CPU has turned into Norit… and unable to serve any more requests.

Page 31: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 31

Black Holes

A mis-configured worker node accepting jobs that all die within seconds.Not for long, the entire job population will be sucked into this black hole…

Page 32: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 32

Clusters: what did we see? the Grid (and your cluster) are error amplifiers

“black holes” may eat your jobs piecemeal dangerous “default” values can spoil the day (“GlueERT: 0”)

Monitor! (and allow for (some) failures, and design for rapid recovery)

Users don’t have a clue about your system beforehand(that’s the downside of those ‘autonomous organizations’)

If you want users to have clue, you push publish your clues correctly (the information system is all they can see)

Grid middleware may effectively do a DoS on your system doing qstat for every job every minute, to feed the logging & bookkeeping …

Power consumption is the greatest single limitation in CPU density And finally: keep your machine room tidy, and label everything … or

your colleague will not be able to find that #$%^$*! machine in the middle of the night…

Page 33: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid-wide monitoring

Page 34: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 34

Success Rate

What’s the chance the whole grid is working correctly?

If a single site has 98.5% reliability (i.e. is down 5 days/year) With 200 sites, this gives you a 4% chance that the whole

grid is working correctly And the 98.5% is quite optimistic to begin with …

So build the grid, both middleware and user jobs, for failure Monitor sites with both system and functional tests Exclude sites with a current malfunction dynamically

Page 35: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 35

Monitoring Tools

1. GIIS Monitor 2. GIIS Monitor graphs 3. Sites Functional Tests

4. GOC Data Base5. Scheduled Downtimes 6. Live Job Monitor

7. GridIce – VO view 8. GridIce – fabric view 9. Certificate Lifetime Monitor

Source: Ian Bird, SA1 Operations Status, EGEE-4 Conference, Pisa, November 2005

Page 36: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 36

Google Grid Map

Page 37: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 37

Freedom of Choice

Tool for VOs to make a site selection based on a set of standard tests

Page 38: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 38

Success Rate: WISDOM

Average success rate for jobs: 70-80% (single submit)Success rate (August)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 3 5 7 9 11 13 15 17 19 21 23 25 27

day

nb

of

job

s

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

succ

ess

rate

registered

success (final status)

aborted (final status)

cancelled (final status)

success rate :success/(registered-cancelled)

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005

Page 39: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 39

Failure reasons varyBiomed data challenge

Abort reasons distribution (10/07/2005 - 27-08-2005)

63%

28%

4%4%1%Missmatching ressources

wrong configuration

Network/Connection Failures

Proxy problems

JDL Problems

- Failing middleware component- Wrong request in the job JDL

Abort reasons distribution for all VO 01/2005 – 06/2005

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005

Page 40: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 40

Is the Grid middleware current?

Common causes of failure Specified impossible combination of resources Wrong middleware version at the site Not enough space in proper place ($TMPDIR) Environment configuration ($VO_vo_SW_DIR, $LFC_HOST,…)

0

20

40

60

80

100

120

140

12

/02

/20

05

19

/02

/20

05

26

/02

/20

05

05

/03

/20

05

12

/03

/20

05

19

/03

/20

05

26

/03

/20

05

02

/04

/20

05

09

/04

/20

05

16

/04

/20

05

23

/04

/20

05

30

/04

/20

05

07

/05

/20

05

14

/05

/20

05

21

/05

/20

05

28

/05

/20

05

04

/06

/20

05

11

/06

/20

05

18

/06

/20

05

25

/06

/20

05

Date

Sit

es

wit

h r

ele

as

e

LCG-2_4_0 LCG-2_3_1 LCG-2_3_0

Page 41: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Assorted issues at the fabric layer

Does it workHow can we make it better

Page 42: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 42

Going from here

Many nice things to do: Most of LCG provides a single OS (RHEL3), but users may

need SLES, Debian, Gentoo, … or specific libraries Virtualisation (Xen, VMware)?

Scheduling user jobs both VO and site wants to set part of the priorities …

Auditing and user tracing in this highly dynamic systemcan we know for sure who is running what where? Or whether a user is DDoS-ing the White House right now? Out of 221 sites, we know for certain there is a compromise!

Page 43: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 43

More things to do …

Sparse file access: access data efficiently over the wide area

Can we do something useful with the large disks in all worker nodes? (our 240 CPUs share ~8 TByte of unused disk space!)

There are new grid software releases every month, and the configuration comes from different sources …how can we combine and validate all these configurations fast and easy?

Page 44: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 44

Job submission live monitor

Source: Gidon Moont, Imperial College, London, HEP and e-Science Centre

Page 45: Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006.

Grid Middleware VIII 45

Outlook

Towards a global persistent grid infrastructure Interoperability and persistency that are project independent

Europe: EGEE-2, ‘European Grid Organisation’ US: Open Science Grid Asia-Pacific: APGrid & PRAGMA, NAREGI, APAC, K*Grid, …

GIN aim: cross-submission and file access by end 2006 Extension to industry

first: industrial engineering, financial scenario simulations

New ‘middleware’ we are just starting to learn how it should work

Extend more in sharing of structured data