Petabye scale data challenge

37
Petabye Scale Data Challenge - Worldwide LHC Computing Grid ASGC/Jason Shih Computex, Jun 2 nd , 2010

description

 

Transcript of Petabye scale data challenge

Page 1: Petabye scale data challenge

Petabye Scale Data Challenge- Worldwide LHC Computing Grid

ASGC/Jason ShihComputex, Jun 2nd, 2010

Page 2: Petabye scale data challenge

Outline

Objectives & MilestonesWLCG experiment and ASGC Tier-1 CenterPetabyte Scale ChallengeStorage Management SystemSystem Architecture, Configuration and

Performance

Page 3: Petabye scale data challenge

Objectives

Building sustainable research and collaboration infrastructureSupport research by e-Science, on data intensive

sciences and applications require cross disciplinary distributed collaboration

Page 4: Petabye scale data challenge

ASGC Milestone

Operational from the deployment of LCG0 since 2002 ASGC CA establish on 2005 (IGTF in same year)Tier-1 Center responsibility start from 2005Federated Taiwan Tier-2 center (Taiwan Analysis Facility, TAF)

is also collocated in ASGCRep. of EGEE e-Science Asia Federation while joining EGEE

from 2004Providing Asia Pacific Regional Operation Center (APROC)

services to regional-wide WLCG/EGEE production infrastructure from 2005

Initiate Avian Flu Drug Discovery Project and collaborate with EGEE in 2006

Start of EUAsiaGrid Project from April 2008

Page 5: Petabye scale data challenge

LHC First Beam – Computing at the Petascale

General Purpose, pp, heavy ions

ATLAS: General Purpose, pp, heavy ions

ALICE: Heavy ions, ppLHCb: B-physics, CP Violation

CMS: General Purpose, pp, heavy ions

Page 6: Petabye scale data challenge

Size of LHC Detector

Bld. 40ATLAS

CMS

Page 7: Petabye scale data challenge

UNESCO Information Preservation debate, April 2007 -

Jamie Shiers@cern ch

7http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html

Standard Cosmology

Good model from 0.01 secafter Big Bang

Supported by considerable observational evidence

Elementary Particle Physics

From the Standard Model into theunknown: towards energies of1 TeV and beyond: the Terascale

Towards Quantum Gravity

From the unknown into the unknown...

Tim

e

Energy, Density, Tem

perature

Page 8: Petabye scale data challenge

WLCG Timeline

First Beam on LHC, Sep. 10, 2008Severe Incident after 3w

operation (3.5TeV)

Page 9: Petabye scale data challenge

21

ASGC - Introduction

Large Hadron Collider (LHC)

Avian Flu Drug DiscoveryGrid Application Platform

A Worldwide Grid Infrastructure

Asia Pacific Regional Operation Center

>250 sites, 48 countries>68,000 CPUs, >25 PetaBytes>10,000 users, >200 VOs>150,000 jobs/day

Best Demo Award of EGEE’07

Lightweight Problem Solving Framework

1. Most Reliable T1: 98.83%2. Very Highly Performing and most Stable Site in CCRC08

Max CERN/T1-ASGC Point2Point Inbound : 9.3 Gbps

Page 10: Petabye scale data challenge

Collaborating e-Infrastructures

“Production” = Reliable, sustainable, with commitments to quality of service

TWGRID

EUAsiaGrid

Potential for linking ~80 countries

Page 11: Petabye scale data challenge

WLCG Computing Model- The Tier Structure

Tier-0 (CERN)Data recordingInitial data reconstructionData distribution

Tier-1 (11 countries)Permanent storage Re-processingAnalysis

Tier-2 (~130 countries)Simulation End-user analysis

Page 12: Petabye scale data challenge

4EGEE07, Budapest, 1-5 October 2007

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 4

ArcheologyAstronomyAstrophysicsCivil ProtectionComp. ChemistryEarth SciencesFinanceFusionGeophysicsHigh Energy PhysicsLife SciencesMultimediaMaterial Sciences…

Page 13: Petabye scale data challenge

Why Petabyte? Challenges

Why Petabyte?Experiment Computing ModelComparing with conventional data management

Challenges Performance: LAN and WAN activities

Sufficient B/W between CPU FarmEliminate Uplink Bottleneck (Switch Tires)

Fast responding of Critical EventsFabric Infrastructure & Service Level Agreement

Scalability and ManageabilityRobust DB engine (Oracle RAC)KB and Adequate Administration (Training)

Page 14: Petabye scale data challenge

Tier Model and Data Management Components

Page 15: Petabye scale data challenge

WLCG Experiment Computing Model

Page 16: Petabye scale data challenge

ATLAS T1 Data Flow

Tier-0

CPUFarm

T1T1Other

Tier-1s

DiskBuffer

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

RAW

ESD2

AODm2

0.044 Hz3.74K f/day44 MB/s3.66 TB/day

RAW

ESD (2x)

AODm (10x)

1 Hz85K f/day720 MB/s

T1T1Other

Tier-1s

T1T1EachTier-2

Tape

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

DiskStorage

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm2

500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm2

500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day

ESD1

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm1

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

AODm1

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

AODm2

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

Plus simulation and Plus simulation and analysis data flowanalysis data flow

Page 17: Petabye scale data challenge

WLCG Tier-1- Defined Minimum Levels of Services.

Define response time refer to max delay before taking action.Mean time repairing the service is also crucial but cover

indirectly through required availability target.

Page 18: Petabye scale data challenge

WLCG MoU & ASGC Resource Level- Pledged Resources and Projection

0

10002000

3000

40005000

6000

2005 2006 2007 2008 2009 2010

(Uni

t KSI

2k)

0

10002000

3000

40005000

6000

Tera

Byt

e

CPU MoUCPUDiskTapeDISK MoUTape MoU

Year CPU (HEP2k6) Disk (PB) Tape (PB)End 2009 29.5K 2.6 2.4Mou 2009 20K 3.0 3.0Mou 2010 28K 3.5 3.5

Page 19: Petabye scale data challenge

Data Management SystemCASTOR V1

CERN Advanced STORageSatisfactorily serving 10s of 1K

Req/day/TB of Disk CacheLimitation: 1M files in cacheTape movement API not flexible

CASTOR V2Centric DB Arch.Scheduling FeatureGSI and KerberosResource MgmtResource Handling

Page 20: Petabye scale data challenge

CASTOR Configurations- Current Infrastructure

Shared cores servicesServing: Atlas and CMSServices:

Stager, NS, DLF, Repack, and LSFDB cluster

Two DB Clusters (SRM and NS)5 Services (DB) split into two clusters 5 Oracle Instances

Total capacity: 0.63PB and 0.7PB for CMS and Atlas resp.Current usage: 63% and 44% for CMS and Atlas

Page 21: Petabye scale data challenge

CASTOR Configurations (cont’)- Disk Cache

Disk pools & serversPerformance (IOPS)

With 0.5kB IO size: 76.4k and 54k for read & write resp.Slightly decrease around 9% for both read and write

when inc. IO size to 4kB.80 disk servers (+6 will be online end of 3rdw Oct)

Total capacity: 1.67PB (0.3PB allocate dynamically)Current usage: 0.79PB (~58% usage)

14 disk pools (8 for atlas and 3 for CMS, another three for bio, SAM, and dynamic)

Page 22: Petabye scale data challenge

050

100150200250300350400450

atlas

GROUPDISK

biomedD1T

0

atlas

HotDisk

cmsW

ANOUT

atlas

PrdD0T

1atl

asStag

e

dteamD0T

0

atlas

MCTAPE

atlas

Scratch

Disk

atlas

PrdD1T

0

cmsL

TD0T1

atlas

MCDISK

cmsP

rdD1T

0Stan

dby

Tota

l Cap

acity

(TB

)

0246810121416Install Capacity Free Capacity

Num of Disk Servers

Disk Pool Configuration- T1 MSS (CASTOR)

Page 23: Petabye scale data challenge

Distribution of Free Capacity- Per Disk Servers vs. per Pool

0 50 100 150 200 250

atlasGROUPDISK

atlasHotDisk

atlasMCDISK

atlasMCTAPE

atlasPrdD0T1

atlasPrdD1T0

atlasScratchDisk

atlasStage

biomedD1T0

cmsLTD0T1

cmsPrdD1T0

cmsWANOUT

dteamD0T0

Standby

Dis

k Po

ol

Free Capacity (TB)

Page 24: Petabye scale data challenge

Storage Server Generation- Drive vs. Total Capacity

18235.5

3774123

683

6238

0100200300400500600700800

0 10 20 30 40

Numer of Raid Subsystem

Tota

l Cap

acity

of S

tora

geG

ener

atio

n (T

B)

TB

TBTB

TB

Page 25: Petabye scale data challenge

CASTOR Configurations (cont’)- Core Service Overview

Services Type

OS Level Release Remark

Core SLC 4.7/x86-64 2.1.7-19 Stager/NS/DLFSRM SLC 4.7/x86-64 2.7-18 3 Head Nodes

Disk Svr. SLC 4.7/x86-64 2.1.7-19 80 Q3 2k9 (20+ in Q4)Tape Svr. SLC 4.7/32 + 64 2.1.8-8 X86-64 OS deployed

Page 26: Petabye scale data challenge

CASTOR Configurations (cont’)- CMS Disk Cache: Current Resource Level

Space Token

Disk PoolCapacity/Job Limit

DiskServers

TapePool/Capacity

cmsLTD0T1 278TB/488 9 *cmsPrdD1T0 284TB/1560 13cmsWanOut 72TB/220 4

* Dep. on tape family.

Page 27: Petabye scale data challenge

CASTOR Configurations (cont’)- Atlas Disk Cache: Current Resource Level

Space Token Cap/JobLimit DiskServers TapePool/Cap.atlasMCDISK 163TB/790 8 -atlasMCTAPE 38TB/80 2 atlasMCtp/39TBatlasPrdD1T0 278TB/810 15 -

atlasPrdD0T1 61TB/210 3 atlasPrdtp/105TB

atlasScratchDisk 28TB/80 1 -atlasHotDisk 2/40TB 2 -

atlasGROUPDISK 19T/40 1 -

Total 950TB/1835 46 -

Page 28: Petabye scale data challenge

IDC CollocationFacility install complete at Mar 27th

Tape system delay after Apr 9th

RealignmentRMA for faulty parts

Page 29: Petabye scale data challenge

Storage Farm~ 110 raid subsystem deployed since 2003.Supporting both Tier1 and 2 storage fabricDAS connection to frontend blade server

Flexible switching front end server upon performance requirement4-8G fiber channel connectivity

Page 30: Petabye scale data challenge

CASTOR Configurations (cont’)- Tape Pool

Tape PoolCapacity

(TB)/UsageDrive

DedicationLTO3/4 Mixed

atlasMCtp 8.98/40% N YatlasPrdtp 101/65% N Y

cmsCSA08cruzet 15.6/46% N NcmsCSA08reco 5/0% N N

cmsCSAtp 639/99% N YcmsLTtp 34.4/44% N N

dteamTest 3.5/1% N N

Page 31: Petabye scale data challenge

MSS Monitoring ServicesStd. Nagios Probes

NRPE + customized pluginsSMS to OSE/SM for all types of critical

alarmsAvailability metricsTape metrics (SLS)Throughput, capacity & scheduler per

VO and Diskpool

Page 32: Petabye scale data challenge

MSS Tape System- Expansion/Upgrade Planning

Before incident:LTO3 * 8 + LTO4 * 4720TB with LTO3530TB with LTO4

May 2009:Two LOT3 drivesMES: 6 LTO4 drives end of MayCapacity: 1.3PB (old, LTO3,4 mixed) + 0.8PB (LTO4)

New S54 model introduce mid of 20092K slots with tier modelRequired:

Upgrade ALMSEnhanced gripper

MES Q3 200918 LTO4 drivesHA implementation resume in Q4

Page 33: Petabye scale data challenge

Expansion Planning

20080.5PB expansion of Tape system in Q2Meet MOU target mid of Nov.1.3MSI2k per rack base on recent E5450 processor.

2009 Q1150 SMP/QC blade serversRaid subsystem consider 2TB per drive42TB net capacity per chassis and 0.75PB in total

2009 Q3-418 LTO4 drives – mid of Oct.330 Xeon QC (SMP, Intel 5450) blades servers2nd phase TAPE MES - 5 LTO4 drives + HA3rd phase TAPE MES – 6 LTO4 drivesETA 0.8PB expansion delivery: mid of Nov

Page 34: Petabye scale data challenge

Computing/Storage System Infrastructure

1

2

3

4

7

8

9

10

6B

5A

Dia

gS

tat

G48T10/100/1000BASE-T

1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 24 48

1

25

41511

Diag

Stat

10GBASE-X1 32 4

10G4X 41611Di

agSt

at

G 4 8X a 4 1 54 2

2 51 2 62 2 73 2 84 2 95 3 06 317 3 28 2 39 3 410 3 51 1 3 61 2 3 71 3 3 81 4 3 91 5 401 6 4 11 7 4 21 8 4 319 4 42 0 4 52 1 4 62 2 4 72 3 4 82 4

Diag

Stat

10GBASE-X1 32 4

10G4X 41611

Dia

gS

tat

10GBASE-X1 32 4

10G4X 41611

Diag

Stat

10GBASE-X1 32 4

10G4X 41611

CASTOR2 Disk servers

Core Services – CE, RB, DPM, PX, BDII etc.

CASTOR2 Tape Servers

4 * GE (SX) to ASGC Distribution Switch in Rack#49

(links to Tier-1 Servers)

BladeCenter

1 2 3 4 5 6 7 8 9 10 11 12 13 14

64 x IBM HS20 Blade system -

WNBladeCenter

1 2 3 4 5 6 7 8 9 10 11 12 13 14

142 x IBM HS21 Blade system -

WN

20 x Quanta Blades -WN

Battery#1 + #2

Battery#3 + #4

DC SMR 48V / 100A

2 * GE (LX) to 4F M160(links to HK, JP Tier-2s)

2 * GE (LX) to 4F TaipeiGigaPoP-7609(links to TW Tier-2s)

Data Center – C3 Archive Room

ASGC CASTOR2 Disk Farm

Page 35: Petabye scale data challenge

Throughput of WLCG ExperimentsThroughput defined as Job Eff. x # Jobs runningCharacteristic of 4 LHC Exp. depicting in-efficiency

is due to poor coding.

Page 36: Petabye scale data challenge

Reliability From Different View Perspective

Page 37: Petabye scale data challenge

Summary

Deploy highly-scalable DM system and performance driven storage infrastructure

Eliminate possible complexity of SRM abstraction layerResource utilization, provisioning and optimization

From POC to Production, the challenges remains:Data Challenge, Service Challenge, CCRC08, STEP09, etc.Motivation appear clear for: Medical, Climate, Cosmological Operation wide:

Robust Database setupKB for fabric infrastructure operationFast enough event processing and documentation

Consider beyond the data management use cases in WLCG: commonality in many other disciplines in EGEE infrastructureactively participate in e-Science collaboration within the region