on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM...

46
© 2007 IBM Tutorial: Optimizing Applications for Performance on POWER Hardware and Software Overview SCICOMP13 July 2007

Transcript of on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM...

Page 1: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

© 2007 IBM

Tutorial: Optimizing Applications for Performance

on POWER

Hardware and Software OverviewSCICOMP13

July 2007

Page 2: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

2© 2007 IBM Corporation

Agenda

• Hardware• Software

• Documentation

Page 3: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

3© 2007 IBM Corporation

Hardware OverviewCore

“CPU”“processor”

Chip“socket”

“processor”

Node

Cluster

Page 4: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

4© 2007 IBM Corporation

IBM Product Naming

zSeriesMainframezSeriesES9000System z

IntelAMD

PowerPC

Server, technicalxSeriesIA-32System x

POWER3POWER4POWER5

POWER5+POWER6

Server,technical

RS6000SP

pSeriesSystem p

RS64POWER5

CommercialiSeries,AS400System i

ProcessorMarketOld NamesNew Name

Page 5: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

5© 2007 IBM Corporation

Autonomic Computing Enhancements

2001

POWER4

2007

POWER6

65 nm

L2 caches

4.7GHz

Core

Advanced SystemFeatures & Switch

Chip Multi Processing- Distributed Switch- Shared L2Dynamic LPARs (16)

180 nm

1.3 GHz

Core

1.3 GHz

Core

Distributed Switch

Shared L2

2002-3

POWER4+

1.7 GHz

Core

1.7 GHz

Core

130 nm

Reduced sizeLower powerLarger L2More LPARs (32)

Shared L2

Distributed Switch

2004

POWER5

2005-06

POWER5+

Simultaneous multi-threadingSub-processor partitioningDynamic firmware updatesEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem

90 nm

Shared L2

2.2 GHz

Core

2.2 GHz

Core

Distributed Switch

130 nm

1.9 GHz

Core

1.9 GHz

Core

Distributed Switch

Shared L2

POWER Server Roadmap

*

4.7 GHz

Core

Ultra High FrequencyVery Large L2Robust Error RecoveryHigh ST and HPC PerfHigh throughput PerfMore LPARs (1024)Enhanced memory

subsystem

Page 6: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

6© 2007 IBM Corporation

POWER5 Modules

MCM

•95mm %%%% 95mm

•Four POWER5 chips

•Four cache chips

•4,491 signal I/Os

•89 layers of metal

DCM

•One POWER5 chip

•Single or Dual Core

•One L3 cache chip

Page 7: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

7© 2007 IBM Corporation

POWER5 Processor Systems

MCM

Chip

Core

DCMp5-575

p5-595

Cluster

Page 8: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

8© 2007 IBM Corporation

Cluster 1600

Multi ProcessorNodes

Physical View

Logical View

Network,Disk System

Page 9: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

9© 2007 IBM Corporation

System p5 “Nodes” – partial list

200020001.65 1.65 -- 2.3*2.3*1616--6464p5 595

321.5, 1.65*1,2p5 505

321.65, 1.9*1,2p5 520

1281.5*4-16p5 560Q

5121.9, 2.2*2-16p5 570

2561.9, 2.2*8-16p5 575

10001.65,2.1*8-32p5 590

Max Memory(x 2^30 byte)

Clock Rate(GHz)

ProcessorsModel

* - POWER5+

Page 10: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

10© 2007 IBM Corporation

POWER5 Features

• Multi-processor• Cache

• Private L1 cache

• Shared L2 cache

• Shared L3 cache

• Interleaved memory• Hardware Prefetch• Multiple Page Size support• Superscalar• Speculative out-of-order instructions • Up to 8 outstanding cache line misses• Large number of instructions in flight• Branch prediction

Page 11: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

11© 2007 IBM Corporation

Instruction-level Parallelism

• Speculative superscalar organization• Out-of-Order execution

• Large rename pools

• 8 instruction issue, 5 instruction complete

• Large instruction window for scheduling

• 8 Execution pipelines• 2 load / store units

• 2 fixed point units

• 2 DP multiply-add execution units

• 1 branch resolution unit

• 1 CR execution unit

• Aggressive branch prediction• Target address and outcome

prediction

• Static prediction / branch hints used

• Fast, selective flush on branch mispredict

FX1

ExecUnit

FX2

ExecUnit

FP1

ExecUnit

FP2

ExecUnit

CR

ExecUnit

BR

ExecUnit

BR/CR

Issue Q

FX/LD 1

Issue Q

FX/LD 2

Issue Q

FP

Issue Q

D-cache

StQ

LD2

ExecUnit

LD1

ExecUnit

Decode,

Crack &

Group

Formation

Instr Q

IFAR

I-cache

GCT

BR

Scan

BR

Predict

Processor Core

Page 12: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

12© 2007 IBM Corporation

Multiple Functional Units

• Symmetric functional units

• Two Floating Point Units (FPU)

• Three Fixed Point Units (FXU)

• Two Integer

• One Control

• Two Load/Store Units (LSU)

• One Branch Processing Unit (BPU)

Load/Store

Load/Store

Branch

FM

AF

MA

Fixed Pt.

Fixed Pt.

•CR

Page 13: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

13© 2007 IBM Corporation

Register Renaming

• Architecture has 32 registers

•Legacy

• Cases which require additional registers:

•Tight loops

• Computationally intensive

•“Broad” loops

• Many variables involved

•Deep pipe lines

• Renaming registers are increasingly important with Simultaneous MultiThreading

Page 14: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

14© 2007 IBM Corporation

Register Renaming

Read after write

R13 = R14 + R15

R16 = R13 + R12

Write after write

R13 = R14 + R15

R13 = R16 + R17

R19 = R13 + R18

Write after read

R14 = R13 + R15

R42 = R16 + R17

R19 = R42 + R18

Read after write

A(i) = B(i) + C(i)

….

D(i) = A(i) + E(i)

Page 15: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

15© 2007 IBM Corporation

Effect of Registers

12080GP Registers

90% of burst60% of burstDGEMM speed

12072FP Registers

POWER5POWER4

Enhances performance ofcomputationally intensive kernels

Page 16: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

16© 2007 IBM Corporation

Multi-threading Evolution

Thread 1

Executing

Thread 0

ExecutingNo Thread

Executing

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Single Thread

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Coarse Grain Threading

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Fine Grain Threading

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Simultaneous Multi-Threading

Page 17: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

17© 2007 IBM Corporation

Simultaneous Multi-Threading in POWER5

• Each chip appears as a 4-way SMP to software

• Processor resources optimized for enhanced SMT performance

• Software controlled thread priority

• Dynamic feedback of runtime behavior to adjust priority

• Dynamic switching between single and multithreaded mode

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Simultaneous Multi-Threading

Thread 0 active Thread 1 active

Page 18: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

18© 2007 IBM Corporation

Simultaneous multi-threading

• Utilizes unused execution unit cycles• Symmetric multiprocessing (SMP) programming model • Natural fit with superscalar out-of-order execution core• Dispatch two threads per processor. Net result:

• Better processor utilization

POWER4 (Single Threaded)

CRL

FX0

FX1

LSO

LS1

FP0

FP1

BRZ

Thread1 active

Thread0 active

No thread active

Appears as 4 CPUs per chip to the

operating system (AIX 5L V5.3 and

Linux)

Syste

mth

rou

gh

pu

t

SMTST

POWER5 Simultaneous Multi Threading

Page 19: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

19© 2007 IBM Corporation

Resource Sizes

Results based on simulation of an online transaction processing application

Vertical axis does not originate at 0

~~

• Analysis done to optimize every micro-architectural resource size

• GPR/FPR rename pool size

• I-fetch buffers

• Reservation Station

• SLB/TLB/ERAT

• I-cache/D-cache

• Many Workloads examined

• Associativity also examined

50 60 70 80 90 100 110 120 13080 120

Number of GPR Renames

IPC

ST SMT

Page 20: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

20© 2007 IBM Corporation

• Processor Core

• High single-thread performance with ultra high frequency (13FO4) and optimized pipelines

• Higher instruction throughput: improved SMT

• Cache and Memory Subsystem

• Increase cache sizes and associativity

• Low memory latency and increased bandwidth

• System Architecture

• Fully integrated SMP fabric switch

• Predictive subspace snooping for significant reduction of snoop traffic

• Higher coherence bandwidth

• Excellent scalability

• Ultra-high frequency buses• High bandwidth per pin

• Enables lower cost packaging

• Power

• Minimize latch count

• Dynamic Power management

POWER6 Objectives

Page 21: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

21© 2007 IBM Corporation

• Ultra-high frequency dual-core chip• 7-way superscalar, 2-way SMT core

•up to 5 instr. for one thread, up to 2 for other

• 8 execution units• 2LS, 2FP, 2FX, 1BR, 1VMX

• 790M transistors, 341 mm2 die

• Up to 64-core SMP systems

• 2x4MB on-chip L2 – point of coherency

• On-chip L3 directory and controller

• Two memory controllers on-chip

• Technology• CMOS 65nm lithography, SOI Cu

• High-speed elastic bus interface at 2:1 freq

• I/Os: 1953 signal, 5399 Power/Gnd

• Full error checking and recovery

IFU

LSU

SDU

FXU

RU

VMXFPU

L2

QUAD

L2

QUAD

L2

QUAD

L2

QUADIFU

LSU

SDU

FXU

RU

VMXFPU

L2

CNTL

L2

CNTL

MC MC

L3

CNTL

L3

CNTL

GXCFBC

Core0

Core1

POWER6 Chip

Page 22: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

22© 2007 IBM Corporation

Processor Design

Mostly in-order with special case out-of-order execution

General out-of-order executionStyle

2FX, 2LS, 2FP, 1BR/CR, 1VMX

2FX, 2LS, 2FP, 1BR, 1CRUnits

2 SMT threadsPriority-based dispatchSimultaneous dispatch from two threads (up to 7 instructions)

2 SMT threadsAlternate ifetch

Alternate dispatch (up to 5 instructions)

Threading

POWER6POWER5+

Page 23: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

23© 2007 IBM Corporation

POWER5+ and POWER6 Storage Hierarchy

64 KB, 4-way64 KB, 2-wayICache capacity, associativity

L1 Cache

64 KB, 8-way32 KB, 4-wayDCache capacity, associativity

8-way, LRU10-way, LRUAssociativity, replacement

2 x 4 MB, 128 B line1.9 MB, 128 B lineCapacity, line size

8 TB maximum4x DRAM frequency

4 TB maximum2x DRAM frequency

MemoryMemory bus

16-way, LRU12-way, LRUAssociativity, replacement

32 MB, 128 B line36 MB, 256 B lineCapacity, line size

Off-chip L3 Cache

L2 Cache

POWER6POWER5+

Page 24: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

24© 2007 IBM Corporation

Network

Page 25: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

25© 2007 IBM Corporation

Switch Technology

• Internal network

• In lieu of Gigabit Ethernet, Myrinet, Quadrics, etc.

•Fourth generation

•HPS Switch (POWER2 generation)

•SP Switch (POWER2 -> POWER3)

•SP Switch 2 (POWER3 -> POWER4)

•HPS (POWER4 -> POWER5)

• Multiple links per node

•Typically 2 links per node

•Recently available: 1 link per node

Page 26: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

26© 2007 IBM Corporation

High Performance Switch (HPS)

• Also known as “Federation”• Follow on to SP Switch2

•Also known as “Colony”

• Specifications:•2 Gbyte/s (bidirectional)

• < 5 microsecond latency

• Configuration:•Up to four adaptors per node

• 1 or 2 links per adaptor• Up to 16 Gbyte/s per node

Page 27: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

27© 2007 IBM Corporation

HPS Specifications

HPS

SP Switch 2

TPMX

2003

2001

1998

Year

12512525

Bandwidth,

multiple [Mbyte/s]

Bandwidth, single [Mbyte/s]

Latency[microsec.]

4.7

15

1800

350

3300

550

Page 28: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

28© 2007 IBM Corporation

Software Overview

• Operating System• AIX

• Compilers• C

• C++

• Fortran

• Batch queue scheduler• LoadLeveler (IBM)

• LSF (Platform)

• PBS

• Gridware

Page 29: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

29© 2007 IBM Corporation

AIX

• Current Version: AIX 5.3• Processors:

• POWER3

• POWER4

• POWER5

• Linux Affinity• Logical PARtitions (LPAR) Nodes

• Operating system

• Memory

• Network connections

• Kernel Address Size:• 64-bit

• 32-bit

Page 30: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

30© 2007 IBM Corporation

Linux on POWER

• Native Linux, SLES9, RHA4• Rpm's and package managers• Cluster Systems Manager• 64-bit kernel

xlf, xlf90_rFortran

xlC,xlC_rC++

xlc,xlc_rC

User NameCompiler

Page 31: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

31© 2007 IBM Corporation

Compilers

C and C++• XL C and C++

Professional for AIX

• Versions 6, 7, 8

• ANSI C options

• C++

• Compiler names:

• xlc

• xlC

Fortran• XL Fortran for AIX

• Versions 8, 9, 10

• Fortran 77

• Fortran 90/95/2003

• Compiler names:

• xlf77

• xlf90

• xlf95

Page 32: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

32© 2007 IBM Corporation

Compiler Names

xlcC

xlCC++

xlf_r, xlc_rReentrant

mpxlf, mpccMPI compile

xlf90Fortran 90

xlfFortran 77

User NameCompiler

AIX uses different compiler names to perform some tasks which are handled by compiler flags on many other systems

Page 33: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

33© 2007 IBM Corporation

Compiler Usage

.fMPImpxlfMPI fortran

.fThread safe

xlf90xlf90_r

Fortran 90

.fThread safe

xlfxlf_r

Fortran 77

.C .cc .cppThread safe

xlCxlC_r

C++

.cMPImpccMPI, C

.cPre-ANSIccExtended C

.cANSI

Thread safexlc

xlc_rANSI C

ExtensionFeatureCommandLanguage

Page 34: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

34© 2007 IBM Corporation

User Limits

• Set by the system administrator• ulimit:

•C or K shell built-in

•Sets or reports resource limits

•Limits are defined in /etc/security/limits

• Sizes are in 512 byte blocks

• Times are in seconds

•/usr/bin/ulimit is available from any shell

Page 35: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

35© 2007 IBM Corporation

ulimit Defaults

No. files

stack

data

cpu

core

fsize

Limit

Value

2000

65536

262144

-1 (unlimited)

2097151

2097151

Default

2000

*Unlimited (-1)

Unlimited (-1)

Unlimited (-1)

Unlimited (-1)

Unlimited (-1)

Typical

File Descriptor Limit

Stack Segment Size

Data Segment Size

Per Process limit

Core File Size

File Size

Definition

* 64-bit address mode

Page 36: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

36© 2007 IBM Corporation

Other Environment Variables

• Thread control

• May be set for default in /etc/environment

• AIXTHREAD_SCOPE=S

• AIXTHREAD_MNRATIO=1:1

• AIXTHREAD_COND_DEBUG=OFF

• AIXTHREAD_GUARDPAGES=4

• AIXTHREAD_MUTEX_DEBUG=OFF

• AIXTHREAD_RWLOCK_DEBUG=OFF

Page 37: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

37© 2007 IBM Corporation

Batch Queuing

• Compile on any AIX node

•Use –qarch=pwr5 –qtune=pwr5

• Submit job with available batch utility• Use appropriate queue name• Available queuing systems:

•LoadLeveler

•PBS

•Gridware

•LSF

Page 38: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

38© 2007 IBM Corporation

Cluster Layout

CompileAnd

SubmitNode

Node 0 Node 1

Network

Node 2

Page 39: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

39© 2007 IBM Corporation

Batch Queue: LoadLeveler

• Job Management

•Build, Submit, Schedule, Monitor

• Workload Balancing

•Resource scheduling

• Control

•Centralized system administration

Page 40: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

40© 2007 IBM Corporation

LoadLeveler Cluster

• Central Manager

• Central resource manager and workload balancer, but not a central point of failure

• Execute Node

• Runs work (serial job steps or parallel job tasks) dispatched

by the Central Manager

• Scheduler Node (public or local)

• Manages jobs from submission through completion

• Submit-only Node

• Submits jobs to LoadLeveler from outside the cluster. Runs no daemons.

Page 41: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

41© 2007 IBM Corporation

LoadLeveler Commands

• llsubmit - submits a job to LoadLeveler• llcancel - cancels a submitted job step

• llq - queries the status of jobs in the queue• llstatus - queries the status of machines in

the cluster• llclass - returns information about available

classes

• llprio - changes the user priority of a job step

Page 42: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

42© 2007 IBM Corporation

LoadLeveler: Sequential Job

#!/bin/ksh# # @ error = Error# @ output = Output# @ notification = complete# @ notify_user = [email protected]# @ wall_clock_limit = 01:30:00

# @ job_type = serial# @ class = small# @ queue

for i in a b cdo a.out < input.$i > Output.$i

done

Page 43: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

43© 2007 IBM Corporation

LoadLeveler: Parallel MPI Job

#!/bin/ksh# # @ error = Error# @ output = Output# @ notification = complete# @ notify_user = [email protected]# @ wall_clock_limit = 01:30:00

# @ job_type = parallel# @ node = 1# @ tasks_per_node = 16# @ class = small# @ queue

poe a.out

Page 44: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

44© 2007 IBM Corporation

LoadLeveler: Parallel OpenMP Job

#!/bin/ksh# # @ error = Error# @ output = Output# @ notification = complete# @ notify_user = [email protected]# @ wall_clock_limit = 01:30:00

# @ job_type = serial# @ resources = consumableCpus(4)# @ class = small# @ queue

export OMP_NUM_THREADS=4a.out

Page 45: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

45© 2007 IBM Corporation

Submit, Monitor, Cancel

$ llsubmit LLllsubmit: The job "v60n129.pbm.ihost.com.4173" has been submitted.

….

$ llqId Owner Class Running On

------------------------ ---------- -----------------------------

v60n129.4172.0 myuid small v60n129

$ llcancel v60n129.4172.0llcancel: Cancel command has been sent to the central manager.

Page 46: on POWER Hardware and Software Overview › ScicomP13 › Presentations › IBM › ...9 © 2007 IBM Corporation System p5 “Nodes” – partial list p5 595 16 -64 1.65 - 2.3* 2000

46© 2007 IBM Corporation

Documentation

• Software:• http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp

• Compilers

• /usr/vac/doc/en_US/pdf

• /usr/vacpp/doc/en_US/pdf

• /usr/lpp/xlf/doc/en_US/pdf

• Redbooks:

•www.redbooks.ibm.com/

• IBM eServer p5 590 and 595 System Handbook