Dell Research In HPC GridPP7 1 st July 2003 Steve Smith HPC Business Manager Dell EMEA...
-
Upload
nathaniel-regan -
Category
Documents
-
view
216 -
download
1
Transcript of Dell Research In HPC GridPP7 1 st July 2003 Steve Smith HPC Business Manager Dell EMEA...
HPCC Team2
Grid
Shared Shared ResourcesResources
Rich ClientRich Client
Cluster, SMP, BladesCluster, SMP, Blades
Distributed Distributed ApplicationsApplications
ApplicationsApplicationsOSOS
MiddlewareMiddlewareHardwareHardware
Traditional Traditional
HPC ArchitectureHPC Architecture
Standard BasedStandard BasedClustersClusters
SMPSMP
Future HPC Future HPC ArchitectureArchitecture
Current Current HPC ArchitectureHPC Architecture
ProprietaryProprietaryRISCRISC
VectorVectorCustomCustom
The Changing Models of
High Performance Computing
© Copyright 2002-2003 Intel Corporation
HPCC Team3
HPCC Building Blocks
PowerEdge & Precision (IA32 & IA64)
Parallel Benchmarks (NAS, HINT, Linpack…) Parallel Benchmarks (NAS, HINT, Linpack…) and Parallel Applicationsand Parallel Applications
VIA
Fast Ethernet
TCP
PlatformPlatformPlatformPlatform
InterconnectInterconnectInterconnectInterconnect
ProtocolProtocolProtocolProtocol
OSOSOSOS
MiddlewareMiddlewareMiddlewareMiddleware
BenchmarkBenchmarkBenchmarkBenchmark
Gigabit Ethernet Myrinet
GM
Linux Windows
MPI/Pro PVMMPICH MVICH
Quadrics
Elan
HPCC Team4
HPCC Components and Research Topics
Vertical Solutions: application Prototyping / Sizing
- Energy/Petroleum - Life Science
- Automotives – Manufacturing and Design
- Custom application benchmarks
- Standard benchmarks
- Performance studies
Resource Monitoring / Management
Resource dynamic allocation
Checkpoint restarting and
Job redistributing
Compilers and math library
Performance tools
- MPI analyzer / profiler
- Debugger
- Performance analyzer and optimizer
MPI 2.0 / Fault Tolerant MPI
MPICH, MPICH-GM, MPI/LAM, PVM
Interconnect Technologies
- FE, GbE, 10GE… (RDMA)
- Myrinet, Quadrics, Scali
- Infiniband Management Hardware
Interconnects Hardware
Interconnect Protocols
Operating Systems
Middleware / API
ClusterHardwareSoftware
Monitoring &Management
Application
Node Monitoring & Management
Benchmark
Development Tools
Job Scheduler
Platform Hardware
ClusterInstallation
ClusterFile System
- Reliable PVFS
- GFS , GPFS …
- Storage Cluster Solutions
IA-32, IA64 (Processor / Platform) comparison
Standard rack mounted, blade and brick servers / workstations
Cluster monitoring
Load analysis and
Balancing-Remote access-Web-based GUI
Cluster monitoring
Distributed System Performance Monitoring
Workload analysis and
Balancing-Remote access-Web-based GUI
Remote installation / configuration
PXE support
System Imager
LinuxBIOS
HPCC Team5
128-node Configuration with Myrinet
HPCC Team6
HPCC Technology Roadmap
Q2 FY04 Q3 FY04 Q4 FY04Q3 FY03 Q4 FY03 Q1 FY04 Q1 FY05 Q2 FY05 Q3 FY05
IB Prototyping
Interconnects
Myrinet 2000
Quadrics
Scali
Myrinet hybrid switch
iSCSI
TOP500 (Nov 2002) TOP500 (June 2003) TOP500 (Nov 2003) TOP500 (June 2004)
Platform Baselining Big Bend 2P 1U
Yukon 2P 2U
Everglades 2P 1U
Ganglia
Clumon (NCSA)
10GbE
MiddlewareCycle Stealing MPICH-G2
Cluster Monitoring
PVFS2
Lustre File System 1.0
Lustre File System 2.0
Global File System
File Systems ADIC
Globus Toolkit
Platform Computing Data Grid
Grid Computing
Manufacturing: Fluent, LS-DYNA, Nastran
Life Science: BLASTs
Energy: Eclipse, LandMark VIP
Financial: MATLAB
Vertical Solutions
Condor-G
Grid Engine (GE)
Qluster
NFS
HPCC Team7
Single Processor vs Dual Processor Comparison (533 MHz FSB) without Goto
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
2000 4000 8000 10000
Problem Size
Gfl
op
s
PE1750 1x1 3.06 GHz Processor
PE1750 1x2 3.06 GHz Processor
In the box scalability of Xeon Servers
71 %71 % Scalability in the box
HPCC Team8
HPL Comparison with and without Goto on PE1750
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
2000 4000 8000 10000
Problem Size
Gfl
op
s
PE1750 1x2 3.06 GHz with Goto
PE1750 1x2 3.06 GHz
In the BOX – XEON (533 MHz FSB) Scaling
32 %32 % Performance improvement
http://www.cs.utexas.edu/users/flame/goto/
HPCC Team9
Goto Comparison on Myrinet
64 nodes (128 processors) HPL comparison using different Libraries
0
50
100
150
200
250
300
350
400
450
Linpack number with Goto Linpack number with ATLAS
Gfl
op
s
Linpack number with Goto Linpack number with ATLAS
37%37% Improvement Improvement with Goto’s librarywith Goto’s library
HPCC Team10
HPL comparison on 64 node GB with and Without Goto
0
50
100
150
200
250
300
350
400
8000 12000 24000 48000 56000 72000 112000
Problem Size
Gfl
op
s
With Goto
Without Goto
Goto Comparison on Gigabit Ethernet
25%25% Improvement with Improvement with Goto’s libraryGoto’s library
64 Nodes / 128 Processors
HPCC Team11
Process-to-Processor Mapping
Process Mapped
CPU1 CPU2 CPU1 CPU2
Process 1 Process 3 Process 4Process 2
Switch
Node 2Node 1
CPU1 CPU2 CPU1 CPU2
Process 1 Process 2 Process 4Process 3
Switch
Round Robin (Default)
Node 1 Node 2
Process 1 Node 1 Node 1
Process 2 Node 2 Node 1
Process 3 Node 1 Node 2
Process 4 Node 2 Node 2
HPCC Team12
Messages Count for HPL 16-process Run
HPCC Team13
Messages Length of HPL 16-process Run
HPCC Team14
HPL Results on the XEON Cluster
Linpack on 8-node Xeon cluser (Gigabit Ethernet)
0
5
10
15
20
25
30
35
2000
4000
8000
1200
0
1600
0
2000
0
2400
0
2800
0
3200
0
3600
0
Problem Size (Blocks)
GF
lop
sRound robin
Frequency major
Size major
Linpack on the 8-node Xeon cluster (Fast Ethernet)
0
5
10
15
20
25
30
35
2000
4000
8000
1200
0
1600
0
2000
0
2400
0
2800
0
3200
0
3600
0
Problem size (Blocks)
GF
lop
s
Round robin
Frequency major
Size major
Linpack on 8-node Xeon cluser (Myrinet)
0
5
10
15
20
25
30
35
Problem Size (Blocks)
GF
lop
s
Round robin
Frequency major
Size major
Size-major is
35% better than
Round Robin
Balanced system
designed for HPL type
of applications
Fast Ethernet Gigabit Ethernet Myrinet
Size-major is
7% better than
Round Robin
HPCC Team15
Reservoir Simulation – messages statistics
HPCC Team16
Reservoir Simulation – Process Mapping – Gigabit Ethernet
DELL PE 2650
0
8
16
24
32
40
48
56
64
72
0 8 16 24 32 40 48 56 64Number of Processors
Spe
edup
SINGLE
DUAL
DUAL-PM
11% improvement
with GigE
HPCC Team17
How Hyper-Threading Technology Works
Second Thread/TaskFirst Thread/Task
Execution ResourceUtilization
Time
Both Threads/Tasks without Hyper-Threading Technology
Both Threads/Tasks with Hyper-Threading Technology
Time saved,
up to 30%
Greater resource utilization equals greater performance© Copyright 2002-2003 Intel Corporation
HPCC Team18
HPL Performance Comparison
Linpack Performance Results on a 16-node Dual-Xeon 2.4 GHz cluster
0
10
20
30
40
50
60
70
80
90
2000 4000 6000 10000 14000 20000 28000 40000 48000 56000
Problem size
GF
LO
PS
16x4 processes with HT on
16x2 processes without HT
16x2 processes with HT on
16x1 processes without HT
Hyper-threading provides ~6% Improvement on a 16 node 32 processors cluster
Hyper-threading provides ~6% Improvement on a 16 node 32 processors cluster
HPCC Team19
NPB-FT (Fast Fourier Transformation)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT) 32x2 (32x4 with HT)
Configuration
Mo
p/s
ec
without HT
with HT
Cache (L2) misses increased
Without HT: 68%
With HT: 76%
Cache (L2) misses increased
Without HT: 68%
With HT: 76%
Number of nodes X Number of processors
HPCC Team20
NPB-EP (Embarrassingly Parallel)
0
100
200
300
400
500
600
700
800
900
1000
1x2 (1x4 with HT) 2x2 (2x4 with HT) 4x2 (4x4 with HT) 8x2 (8x4 with HT) 16x2 (16x4 with HT)32x2 (32x4 with HT)
Configuration
Mo
p/s
ec
Number of nodes X Number of processors
EP requires almost no communication
SSE and x87 utilization increased
Without HT: 94%
With HT: 99%
EP requires almost no communication
SSE and x87 utilization increased
Without HT: 94%
With HT: 99%
without HT
with HT
HPCC Team21
Observations
Computational intensive applications with fine-tuned floating-point operations have less chance to be
improved in performance from Hyper-Threading, because the CPU resources could already be highly
utilized
Cache-friendly applications might suffer from Hyper-Threading enabled, because processes running on
logical processors might be competing for the shared cache access, which might result in performance
degradation
Communication-bound or I/O-bound parallel applications may benefit from Hyper-Threading, if the
communication and computation can be performed in an interleaving fashion between processes.
The current version of Linux OS’s support on Hyper-Threading is limited, which could cause performance
degradation significantly if Hyper-Threading is not applied properly.
To the OS, the logical CPUs are almost undistinguishable from physical CPUs
The current Linux scheduler treats each logical CPU as a separate physical CPU - which does not
maximize multiprocessing performance
A patch for better HT support is available (Source: "fully HT-aware scheduler" support – 2.5.31-BK-curr , by
Ingo Molnar)
Thank You
Steve SmithHPC Business ManagerDell [email protected]