Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server...

22
Hotfoot HPC Cluster March 31, 2011

Transcript of Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server...

Page 1: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Hotfoot HPC ClusterMarch 31, 2011

Page 2: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Topics

• Overview• Execute Nodes• Manager/Submit Nodes• NFS Server• Storage• Networking• Performance

Page 3: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Overview - Hotfoot Pilot

• Launched May 2009

• Original Partnership– Astronomy– Statistics– CUIT– Office of the Executive Vice President for Research

Page 4: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Overview - Hotfoot Expansion

• Expanded March 2011– More Nodes– More Storage– Changed Scheduler

• New Participant– Social Science Computing Committee (SSCC)

Page 5: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Overview – Cluster Components

• 52 Execute Nodes

• 520 Total Cores

• 2 Manager Nodes

• 1 NFS Server (1 Cold Spare)

• 52 TB Storage (72 TB Raw)

Page 6: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Overview

Page 7: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Overview - Architecture

Manager/SubmitNode 1

(Haddock)

RAID

NFS Server(Herring)

Manager/SubmitNode 2

(Mahimahi)

Hotfoot Components

Blade Chassis32 Execute Nodes

NFS Server(Sardine)

Original blade chassis

containing 32 Execute nodes.

New blade chassiscontaining 24

Execute nodes.

One Manager/Submit node is active. Failover is manual.

Second server available to provide NFS services.

Currently not connected.

72TB raw storage. Approximately 52TB usable

under RAID 5.

NFS server provides working storage for all other systems

in cluster.

Page 8: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Execute Nodes

Model Quantity CPU Cores Total Cores Memory

BL2x220c G5 32 Dual 4 core 256 16 GB

BL2x220c G6 14 Dual 6 core 168 24 GB

BL2x220c G6 8 Dual 6 core 96 96 GB

Page 9: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Manager/Submit Nodes

• HP DL360 G5, 4 GB RAM

• Torque Resource Manager (OpenPBS descendent)

• Maui Cluster Scheduler

• User Access via virtual interface (vif)

• Failover via Torque High Availability (HA)

Page 10: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

NFS Servers

• Primary– HP DL360 G7– 2 x 4 cores– 16 GB RAM

• Backup– HP DL360 G5– 1 x 2 cores– 8 GB RAM

Page 11: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Storage

• HP P2000 Storage Array

• 32 x 2 TB Drives

• RAID 5

• ~52 TB Usable

Page 12: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Networking

• Execute Nodes

– Channel-bonding mode 2 (load-balancing and fault tolerance)

– 1 Gb connection to chassis switches

– Usage records suggested this was sufficient

Page 13: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Networking

Sample Traffic for an Execute Node

Page 14: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Networking

• Chassis

– Each chassis has four Cisco 3020 switches

– 1 Gb connection to Edge switches

– Usage records suggested this was sufficient

Page 15: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Networking

Sample Traffic for a Chassis Switch

Page 16: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Networking

Original Chassis, Showing Network Connections for Two Servers

Page 17: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Performance

• Concern about the ability of NFS to handle i/o demands.

• Reviewed performance of pilot system.

• Ran tests on expanded system.

Page 18: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Performance

Memory Usage on Old NFS Server

Page 19: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Performance

Load Average on Old NFS Server

Page 20: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Performance

Page 21: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Performance

Page 22: Hotfoot HPC Cluster March 31, 2011. Topics Overview Execute Nodes Manager/Submit Nodes NFS Server Storage Networking Performance.

Questions?

• Questions?

• Comments?

• Contact: [email protected]