Dell Research In HPC GridPP7 1 st July 2003 Steve Smith HPC Business Manager Dell EMEA...

Dell Research In HPC

GridPP7

1st July 2003

Steve SmithHPC Business ManagerDell [email protected]

HPCC Team2

Grid

Shared Shared ResourcesResources

Rich ClientRich Client

Cluster, SMP, BladesCluster, SMP, Blades

Distributed Distributed ApplicationsApplications

ApplicationsApplicationsOSOS

MiddlewareMiddlewareHardwareHardware

Traditional Traditional

HPC ArchitectureHPC Architecture

Standard BasedStandard BasedClustersClusters

SMPSMP

Future HPC Future HPC ArchitectureArchitecture

Current Current HPC ArchitectureHPC Architecture

ProprietaryProprietaryRISCRISC

VectorVectorCustomCustom

The Changing Models of

High Performance Computing

© Copyright 2002-2003 Intel Corporation

HPCC Team3

HPCC Building Blocks

PowerEdge & Precision (IA32 & IA64)

Parallel Benchmarks (NAS, HINT, Linpack…) Parallel Benchmarks (NAS, HINT, Linpack…) and Parallel Applicationsand Parallel Applications

VIA

Fast Ethernet

TCP

PlatformPlatformPlatformPlatform

InterconnectInterconnectInterconnectInterconnect

ProtocolProtocolProtocolProtocol

OSOSOSOS

MiddlewareMiddlewareMiddlewareMiddleware

BenchmarkBenchmarkBenchmarkBenchmark

Gigabit Ethernet Myrinet

GM

Linux Windows

MPI/Pro PVMMPICH MVICH

Quadrics

Elan

HPCC Team4

HPCC Components and Research Topics

Vertical Solutions: application Prototyping / Sizing

- Energy/Petroleum - Life Science

- Automotives – Manufacturing and Design

- Custom application benchmarks

- Standard benchmarks

- Performance studies

Resource Monitoring / Management

Resource dynamic allocation

Checkpoint restarting and

Job redistributing

Compilers and math library

Performance tools

- MPI analyzer / profiler

- Debugger

- Performance analyzer and optimizer

MPI 2.0 / Fault Tolerant MPI

MPICH, MPICH-GM, MPI/LAM, PVM

Interconnect Technologies

- FE, GbE, 10GE… (RDMA)

- Myrinet, Quadrics, Scali

- Infiniband Management Hardware

Interconnects Hardware

Interconnect Protocols

Operating Systems

Middleware / API

ClusterHardwareSoftware

Monitoring &Management

Application

Node Monitoring & Management

Benchmark

Development Tools

Job Scheduler

Platform Hardware

ClusterInstallation

ClusterFile System

- Reliable PVFS

- GFS , GPFS …

- Storage Cluster Solutions

IA-32, IA64 (Processor / Platform) comparison

Standard rack mounted, blade and brick servers / workstations

Cluster monitoring

Load analysis and

Balancing-Remote access-Web-based GUI

Cluster monitoring

Distributed System Performance Monitoring

Workload analysis and

Balancing-Remote access-Web-based GUI

Remote installation / configuration

PXE support

System Imager

LinuxBIOS

HPCC Team5

128-node Configuration with Myrinet

HPCC Team6

HPCC Technology Roadmap

Q2 FY04 Q3 FY04 Q4 FY04Q3 FY03 Q4 FY03 Q1 FY04 Q1 FY05 Q2 FY05 Q3 FY05

IB Prototyping

Interconnects

Myrinet 2000

Quadrics

Scali

Myrinet hybrid switch

iSCSI

TOP500 (Nov 2002) TOP500 (June 2003) TOP500 (Nov 2003) TOP500 (June 2004)

Platform Baselining Big Bend 2P 1U

Yukon 2P 2U

Everglades 2P 1U

Ganglia

Clumon (NCSA)

10GbE

MiddlewareCycle Stealing MPICH-G2

Cluster Monitoring

PVFS2

Lustre File System 1.0

Lustre File System 2.0

Global File System

File Systems ADIC

Globus Toolkit

Platform Computing Data Grid

Grid Computing

Manufacturing: Fluent, LS-DYNA, Nastran

Life Science: BLASTs

Energy: Eclipse, LandMark VIP

Financial: MATLAB

Vertical Solutions

Condor-G

Grid Engine (GE)

Qluster

NFS

HPCC Team7

Single Processor vs Dual Processor Comparison (533 MHz FSB) without Goto

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

2000 4000 8000 10000

Problem Size

Gfl

op

s

PE1750 1x1 3.06 GHz Processor

PE1750 1x2 3.06 GHz Processor

In the box scalability of Xeon Servers

71 %71 % Scalability in the box

HPCC Team8

HPL Comparison with and without Goto on PE1750

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

2000 4000 8000 10000

Problem Size

Gfl

op

s

PE1750 1x2 3.06 GHz with Goto

PE1750 1x2 3.06 GHz

In the BOX – XEON (533 MHz FSB) Scaling

32 %32 % Performance improvement

http://www.cs.utexas.edu/users/flame/goto/

HPCC Team9

Goto Comparison on Myrinet

64 nodes (128 processors) HPL comparison using different Libraries

0

50

100

150

200

250

300

350

400

450

Linpack number with Goto Linpack number with ATLAS

Gfl

op

s

Linpack number with Goto Linpack number with ATLAS

37%37% Improvement Improvement with Goto’s librarywith Goto’s library

HPCC Team10

HPL comparison on 64 node GB with and Without Goto

0

50

100

150

200

250

300

350

400

8000 12000 24000 48000 56000 72000 112000

Problem Size

Gfl

op

s

With Goto

Without Goto

Goto Comparison on Gigabit Ethernet

25%25% Improvement with Improvement with Goto’s libraryGoto’s library

64 Nodes / 128 Processors

HPCC Team11

Process-to-Processor Mapping

Process Mapped

CPU1 CPU2 CPU1 CPU2

Process 1 Process 3 Process 4Process 2

Switch

Node 2Node 1

CPU1 CPU2 CPU1 CPU2

Process 1 Process 2 Process 4Process 3

Switch

Round Robin (Default)

Node 1 Node 2

Process 1 Node 1 Node 1




HPCC Team12

Messages Count for HPL 16-process Run

HPCC Team13

Messages Length of HPL 16-process Run

HPCC Team14

HPL Results on the XEON Cluster

Linpack on 8-node Xeon cluser (Gigabit Ethernet)

0

5

10

15

20

25

30

35

2000

4000

8000

1200

0

1600

0

2000

0

2400

0

2800

0

3200

0

3600

0

Problem Size (Blocks)

GF

lop

sRound robin

Frequency major

Size major

Linpack on the 8-node Xeon cluster (Fast Ethernet)

0

5

10

15

20

25

30

35

2000

4000

8000

1200

0

1600

0

2000

0

2400

0

2800

0

3200

0

3600

0

Problem size (Blocks)

GF

lop

s

Round robin

Frequency major

Size major

Linpack on 8-node Xeon cluser (Myrinet)

0

5

10

15

20

25

30

35

Problem Size (Blocks)

GF

lop

s

Round robin

Frequency major

Size major

Size-major is

35% better than

Round Robin

Balanced system

designed for HPL type

of applications

Fast Ethernet Gigabit Ethernet Myrinet

Size-major is

7% better than

Round Robin

HPCC Team15

Reservoir Simulation – messages statistics

HPCC Team16

Reservoir Simulation – Process Mapping – Gigabit Ethernet

DELL PE 2650

0

8

16

24

32

40

48

56

64

72

0 8 16 24 32 40 48 56 64Number of Processors

Spe

edup

SINGLE

DUAL

DUAL-PM

11% improvement

with GigE

HPCC Team17

How Hyper-Threading Technology Works

Second Thread/TaskFirst Thread/Task

Execution ResourceUtilization

Time

Both Threads/Tasks without Hyper-Threading Technology

Both Threads/Tasks with Hyper-Threading Technology

Time saved,

up to 30%

Greater resource utilization equals greater performance© Copyright 2002-2003 Intel Corporation

HPCC Team18

HPL Performance Comparison

Linpack Performance Results on a 16-node Dual-Xeon 2.4 GHz cluster

0

10

20

30

40

50

60

70

80

90

2000 4000 6000 10000 14000 20000 28000 40000 48000 56000

Problem size

GF

LO

PS

16x4 processes with HT on

16x2 processes without HT

16x2 processes with HT on

16x1 processes without HT

Hyper-threading provides ~6% Improvement on a 16 node 32 processors cluster

Hyper-threading provides ~6% Improvement on a 16 node 32 processors cluster

HPCC Team19

NPB-FT (Fast Fourier Transformation)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT) 32x2 (32x4 with HT)

Configuration

Mo

p/s

ec

without HT

with HT

Cache (L2) misses increased

Without HT: 68%

With HT: 76%

Cache (L2) misses increased

Without HT: 68%

With HT: 76%

Number of nodes X Number of processors

HPCC Team20

NPB-EP (Embarrassingly Parallel)

0

100

200

300

400

500

600

700

800

900

1000

1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT)32x2 (32x4 with HT)

Configuration

Mo

p/s

ec

Number of nodes X Number of processors

EP requires almost no communication

SSE and x87 utilization increased

Without HT: 94%

With HT: 99%

EP requires almost no communication

SSE and x87 utilization increased

Without HT: 94%

With HT: 99%

without HT

with HT

HPCC Team21

Observations

Computational intensive applications with fine-tuned floating-point operations have less chance to be

improved in performance from Hyper-Threading, because the CPU resources could already be highly

utilized

Cache-friendly applications might suffer from Hyper-Threading enabled, because processes running on

logical processors might be competing for the shared cache access, which might result in performance

degradation

Communication-bound or I/O-bound parallel applications may benefit from Hyper-Threading, if the

communication and computation can be performed in an interleaving fashion between processes.

The current version of Linux OS’s support on Hyper-Threading is limited, which could cause performance

degradation significantly if Hyper-Threading is not applied properly.

To the OS, the logical CPUs are almost undistinguishable from physical CPUs

The current Linux scheduler treats each logical CPU as a separate physical CPU - which does not

maximize multiprocessing performance

A patch for better HT support is available (Source: "fully HT-aware scheduler" support – 2.5.31-BK-curr , by

Ingo Molnar)

Thank You

Steve SmithHPC Business ManagerDell [email protected]

Dell Research In HPC GridPP7 1 st July 2003 Steve Smith HPC Business Manager Dell EMEA...

Documents

Transcript of Dell Research In HPC GridPP7 1 st July 2003 Steve Smith HPC Business Manager Dell EMEA...