NCAR’s Data-Centric Supercomputing Environment Yellowstone

NCAR’s Data-Centric Supercomputing

Environment

YellowstoneDecember 21, 2011

Anke Kamrath, OSD Director/[email protected]

2

Overview• Strategy

– Moving from Process to Data-Centric Computing• HPC/Data Architecture

– What we have today at ML– What’s planned for NWSC

• Storage/Data/Networking – Data in Flight– WAN and High-Performance LAN– Central Filesystem Plans– Archival Plans NWSC Now Open!

Evolving the Scientific Workflow

• Common data movement issues– Time consuming to move data between systems– Bandwidth to archive system is insufficient– Lack of sufficient disk space

• Need to evolve data management techniques– Workflow management systems– Standardize metadata information– Reduce/eliminate duplication of datasets (ie – CMIP5)

• User Education– Effective methods for understanding workflow– Effective methods for streamlining workflow

Traditional Workflow

Process Centric Data Model

Evolving Scientific WorkflowInformation Centric Data Model

Current Resources @ Mesa Lab

BLUEFIRE LYNXGLADE

NWSC: Yellowstone Environment

Partner Sites XSEDE Sites

Data TransferServices

Science GatewaysRDA, ESG

High Bandwidth Low Latency HPC and I/O NetworksFDR InfiniBand and 10Gb Ethernet

Remote Vis

1Gb/10Gb Ethernet (40Gb+ future)

NCAR HPSS Archive100 PB capacity

~15 PB/yr growth

Geyser & Caldera

DAV clusters

GLADECentral disk resource

11 PB (2012), 16.4 PB (2014)

YellowstoneHPC resource, 1.55 PFLOPS peak

NWSC-1 Resources in a Nutshell• Centralized Filesystems and Data Storage Resource (GLADE)

– >90 GB/sec aggregate I/O bandwidth, GPFS filesystems– 10.9 PetaBytes initially -> 16.4 PetaBytes in 1Q2014

• High Performance Computing Resource (Yellowstone)– IBM iDataPlex Cluster with Intel Sandy Bridge EP† processors with Advanced

Vector Extensions (AVX)– 1.552 PetaFLOPs – 29.8 bluefire-equivalents – 4,662 nodes – 74,592 cores– 149.2 TeraBytes total memory– Mellanox FDR InfiniBand full fat-tree interconnect

• Data Analysis and Visualization Resource (Geyser & Caldera)– Large Memory System with Intel Westmere EX processors

• 16 nodes, 640 cores, 16 TeraBytes memory, 16 NVIDIA Kepler GPUs– GPU-Computation/Vis System with Intel Sandy Bridge EP processors with AVX

• 16 nodes, 128 SB cores, 1 TeraByte memory, 32 NVIDIA Kepler GPUs– Knights Corner System with Intel Sandy Bridge EP processors with AVX

• 16 nodes, 128 SB cores, 992 KC cores, 1 TeraByte memory - Nov’12 delivery

† “Sandy Bridge EP” is the Intel® Xeon® E5-2670 processor

GLADE• 10.94 PB usable capacity 16.42 PB usable (1Q2014)

Estimated initial file system sizes– collections ≈ 2 PB RDA, CMIP5 data– scratch ≈ 5 PB shared, temporary space– projects ≈ 3 PB long-term, allocated space– users ≈ 1 PB medium-term work space

• Disk Storage Subsystem– 76 IBM DCS3700 controllers & expansion drawers

• 90 2-TB NL-SAS drives/controller • add 30 3-TB NL-SAS drives/controller (1Q2014)

• GPFS NSD Servers– 91.8 GB/s aggregate I/O bandwidth; 19 IBM x3650 M4 nodes

• I/O Aggregator Servers (GPFS, GLADE-HPSS connectivity)– 10-GbE & FDR interfaces; 4 IBM x3650 M4 nodes

• High-performance I/O interconnect to HPC & DAV– Mellanox FDR InfiniBand full fat-tree– 13.6 GB/s bidirectional bandwidth/node

NCAR Disk Storage Capacity Profile

0

2

4

6

8

10

12

14

16

18

Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16

Tota

l Usa

ble

Stor

age

(PB)

Total Centralized Filesystem Storage (PB) GLADE (NWSC) GLADE (ML) bluefire

GLADE (Mesa Lab)

GLADE (at NWSC)

Yellowstone NWSC High-Performance Computing Resource• Batch Computation

– 4,662 IBM dx360 M4 nodes – 16 cores, 32 GB memory per node– Intel Sandy Bridge EP processors with AVX – 2.6 GHz clock– 74,592 cores total – 1.552 PFLOPs peak– 149.2 TB total DDR3-1600 memory– 29.8 Bluefire equivalents

• High-Performance Interconnect– Mellanox FDR InfiniBand full fat-tree– 13.6 GB/s bidirectional bw/node– <2.5 µs latency (worst case)– 31.7 TB/s bisection bandwidth

• Login/Interactive– 6 IBM x3650 M4 Nodes; Intel Sandy Bridge EP processors with AVX– 16 cores & 128 GB memory per node

NCAR HPC Profile

0.0

0.5

1.0

1.5

Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16

Peak PFLOPs at NCAR

IBM iDataPlex/FDR-IB (yellowstone)

Cray XT5m (lynx)

IBM Power 575/32 (128) POWER6/DDR-IB (bluefire)

IBM p575/16 (112) POWER5+/HPS (blueice)

IBM p575/8 (78) POWER5/HPS (bluevista)

IBM BlueGene/L (frost)

IBM POWER4/Colony (bluesky)

bluesky

bluefire

frost

lynx

yellowstone

bluevista blueice frost upgrade

30x Bluefire performance

Geyser and CalderaNWSC Data Analysis & Visualization Resource

• Geyser: Large-memory system– 16 IBM x3850 nodes – Intel Westmere-EX processors– 40 cores, 1 TB memory, 1 NVIDIA Kepler Q13H-3 GPU

per node– Mellanox FDR full fat-tree interconnect

• Caldera: GPU computation/visualization system– 16 IBM x360 M4 nodes – Intel Sandy Bridge EP/AVX– 16 cores, 64 GB memory, 2 NVIDIA Kepler Q13H-3 GPUs per node– Mellanox FDR full fat-tree interconnect

• Knights Corner system (November 2012 delivery)– Intel Many Integrated Core (MIC) architecture– 16 IBM Knights Corner nodes – 16 Sandy Bridge EP/AVX cores, 64 GB memory,

1 Knights Corner adapter per node– Mellanox FDR full fat-tree interconnect

ErebusAntarctic Mesoscale Prediction System (AMPS)

• IBM iDataPlex Compute Cluster– 84 IBM dx360 M4 Nodes; 16 cores, 32 GB– Intel Sandy Bridge EP; 2.6 GHz clock– 1,344 cores total – 28 TFLOPs peak– Mellanox FDR InfiniBand full fat-tree– 0.54 Bluefire equivalents

• Login Nodes– 2 IBM x3650 M4 Nodes– 16 cores & 128 GB memory per node

• Dedicated GPFS filesystem– 57.6 TB usable disk storage– 9.6 GB/sec aggregate I/O bandwidth

Erebus, on Ross Island, is Antarctica’s most famous volcanic peak and is one of the largest volcanoes in the world –

within the top 20 in total size and reaching a height of 12,450 feet.

0°

90° W 90° E

180°

Yellowstone Software• Compilers, Libraries, Debugger & Performance Tools

– Intel Cluster Studio (Fortran, C++, performance & MPI libraries, trace collector & analyzer) 50 concurrent users

– Intel VTune Amplifier XE performance optimizer 2 concurrent users– PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 conc. users– PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof) for DAV

systems only, 2 concurrent users– PathScale EckoPath (Fortran C, C++, PathDB debugger)

20 concurrent users– Rogue Wave TotalView debugger 8,192 floating tokens– IBM Parallel Environment (POE), including IBM HPC Toolkit

• System Software– LSF-HPC Batch Subsystem / Resource Manager

• IBM has purchased Platform Computing, Inc. (developers of LSF-HPC)– Red Hat Enterprise Linux (RHEL) Version 6– IBM General Parallel Filesystem (GPFS)– Mellanox Universal Fabric Manager– IBM xCAT cluster administration toolkit

NCAR HPSS Archive Resource• NWSC

– Two SL8500 robotic libraries (20k cartridge capacity)– 26 T10000C tape drives (240 MB/sec I/O rate each) and

T10000C media (5 TB/cartridge, uncompressed) initially; +20 T10000C drives ~Nov 2012

– >100 PB capacity– Current growth rate ~3.8 PB/year– Anticipated NWSC growth rate ~15 PB/year

• Mesa Lab– Two SL8500 robotic libraries (15k cartridge capacity)– Existing data (14.5 PB):

• 1st & 2nd copies will be ‘oozed’ to new media @ NWSC, begin 2012

– New data @ Mesa:• Disaster-recovery data only

– T10000B drives & media to be retired– No plans to move Mesa Lab SL8500 libraries (more costly

to move than to buy new under AMSTAR Subcontract)

Plan to release an “AMSTAR-2” RFP 1Q2013, with target for first equipment delivery during 1Q2014 to further augment the NCAR HPSS Archive.

Yellowstone Physical Infrastructure

Total Power Required ~2.13 MWYellowstone ~1.9 MW

GLADE 0.134 MW

Geyser & Caldera 0.056 MW

Erebus (AMPS) 0.087 MW

Resource # Racks

Yellowstone 65 - iDataPlex Racks (72 nodes per rack)9 - 19” Racks (9 Mellanox FDR core switches)1 - 19” Rack (login, service, management nodes)

GLADE 20 - NSD Server, Controller and Storage Racks1 - 19” Rack (I/O aggregator nodes, management , IB & Ethernet switches)

Geyser & Caldera

1 - iDataPlex Rack (GPU-Comp & Knights Corner)2 - 19” Racks (Large Memory, management , IB switch)

Erebus (AMPS)

1 - iDataPlex Rack1 - 19” Rack (login, IB, NSD, disk & management nodes)

Yellowstone allocations (% of resource)NCAR’s 29% represents 170 million core-hours per year for Yellowstone alone (compared to less than 10 million per year on Bluefire) plus a similar fraction of the DAV and GLADE resources.

Yellowstone Schedule31

Oct

14 N

ov

28 N

ov

12 D

ec

26 D

ec

9 Ja

n

23 J

an

6 Fe

b

20 F

eb

5 M

ar

19 M

ar

2 A

pr

16 A

pr

30 A

pr

14 M

ay

28 M

ay

11 J

un

25 J

un

9 Ju

l

23 J

ul

6 A

ug

20 A

ug

3 Se

p

17 S

ep

1 O

ct

15 O

ct

Infrastructure Prepara-tion

Storage & InfiniBand Delivery & Installation

Test Systems Delivery & Installation

Production Systems Delivery

Integration & Checkout

Acceptance Testing

ASD & early users

Production Science

Current Schedule

20

Data in Flight to NWSC• NWSC Networking• Central Filesystem

– Migrating from GLADE-ML to GLADE-NWSC• Archive

– Migrating HPSS data from ML to NWSC

22

BiSON to NWSC Initially three 10G circuits active

Two 10G connections back to Mesa Lab for internal traffic One 10G direct to FRGP for general Internet2 / NLR /

Research and Education traffic Options for dedicated 10G connections for high performance

computing to other BiSON members System is engineered for 40 individual lambdas

Each lambda can be a 10G, 40G, or 100G connection Independent lambdas can be sent each direction around the

ring (two ADVA shelves at NWSC – one for each direction) With a major upgrade system could support 80 lambdas 100Gbps * 80 channels * 2 paths = 16Tbps

23

High performance LAN Data Center Networking (DCN)

high speed data center computing with 1G and 10G client facing ports for supercomputer, mass storage, and other data center components

redundant design (e.g., multiple chassis and separate module connections)

future option for 100G interfaces Juniper awarded RFP after NSF approval

Access switches: QFX3500 series Network core and WAN edge: EX8200 series

switch/routers Includes spares for lab/testing purposes

Juniper training for NETS staff early Jan-2012 Deploy Juniper equipment late Jan-2012

24

Moving Data… Ugh

25

Migrating GLADE Data• Temporary work spaces (/glade/scratch, /glade/user)

– No data will automatically be moved to NWSC• Allocated project spaces (/glade/projxx)

– New allocations will be made for the NWSC– No data will automatically be moved to NWSC– Data transfer option so users may move data they need

• Data collections (/glade/data01, /glade/data02)– CISL will move data from ML to NWSC– Full production capability will need to be maintained during the

transition Network Impact

- Current storage max performance is 5GB/s- Can sustain ~2GB/s for reads while under a production load- Will move 400TB in a couple of days, however we will saturate a

20Gb/s network link

26

Migrating Archive• What Migrates????

• Data: 15PBs and counting…• Format: MSS to HPSS• Tape: Tape B (1TB tapes) to Tape C (5TB tapes)• Location: ML to NWSC

• HPSS at NWSC to become primary site in Spring 2012 • 1 day outage when metadata servers get moved• ML HPSS will remain as Disaster Recovery Site• Data Migration will take until early 2014 • Will throttle migration to not overload network

WHEW… A LOT OF WORK AHEAD!

QUESTIONS?

NCAR’s Data-Centric Supercomputing Environment Yellowstone

Documents

Transcript of NCAR’s Data-Centric Supercomputing Environment Yellowstone