High Performance Computing: Concepts, Methods, & Means An

CSC 7600 Lecture 10 : Enabling TechSpring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &

MEANS

ENABLING TECHNOLOGIES

Prof. Thomas Sterling

Department of Computer Science

Louisiana State University

February 17, 2011

http://www.csc.lsu.edu/


Topics

• Enabling Technologies - Introduction

• Processing Logic - CPU

• Memory Technology

• Communication - Concepts

• Intranode Communication (PCI, HT)

• Networking – Concepts

• Internode Communication (Ethernet, InfiniBand)

• Summary Materials for Test

2



Topics









3



Enabling Technologies for Computing

Definition of Technology

• Merriam-Webster: the practical application of knowledge especially in particular area

• Greek etymology: techne (craft) and logia (saying)

• Technology breakthroughs enable leaps in computational performance and efficiency

• The necessity to overcome particular system bottlenecks drives technological development

• The performance limitations occur at certain classes of system components– Resolving them frequently shifts them to a different part of the system

– Applying solution to only one technological class is not going to work in the long term (synergy)

• The bottlenecks arise because one or more of the “four horsemen” become dominant factor(s) at a given technology point– Starvation

– Latency

– Overhead

– Waiting (Contention)

4


CSC 7600 Lecture 10 : Enabling TechSpring 2011 5

Evolution of HPC

1 103 106 109 10121015

KiloOPS MegaOPS GigaOPS TeraOPS PetaOPSOne OPS

1951Univac 1

1949Edsac

1976Cray 1

1982Cray XMP

1988Cray YMP

1964CDC 6600

1996T3E

1823 Babbage Difference

Engine

1991Intel Delta

1997ASCI Red

2001Earth

Simulator

2003Cray X1

1943Harvard Mark 1

1959IBM 7094

2006BlueGene/L

2009Cray XT5



System Node Diagram

6

MPU

L1L2

MPU

L1L2

L3

M1 M2 Mn

Controller

S

S

NIC NICUSBPeripherals

JTAG

Legend MPU: MicroProcessor UnitL1, L2, L3: CachesM1, M2,..: Memory BanksS: StorageNIC: Network Interface Card

Ethernet

PCI-e

MPU

L1L2

MPU

L1L2

L3



Computing Technology Classes

• Processing logic

– Responsible for execution of programs

– Control logic, data paths & functional logic

– TTL, ECL, MOS, CMOS, RSFQ

• Memory

– Providing storage for computational data

– Caches, main memory, secondary storage, tertiary storage

– Flash memory

• Communication

– Takes care of data movement in the computing system

– Intranode communication using buses (PCI, HT)

– Internode communication using networks (Ethernet, InfiniBand)

7



Technology Classes and Related Metrics

• Microprocessors

– Clock rate

– Instructions per Cycles (CPI)

– Cores per socket

– Power

• Memory

– Access Times

– Bandwidth

– Capacity, Size

• Networking

– Bandwidth

– Latency

– Protocol (e.g., TCP/IP)

8



Topics









9



Digital Logic

• Processing logic is a circuitry that governs the transformation of computational state through execution of instructions encoded in stored computer programs

• Various aspects of digital logic undergo different technological transformations; it is helpful to distinguish:– Functional logic

• Defines capabilities of a processor

• Includes functional units and blocks, such as floating-point unit or ALU

– Control logic• Relates to instruction decoding and interpretation

• Includes instruction decoders and schedulers, execution logic, and various circuits helping optimize the execution

– Data paths• Determine throughput of temporary processing data as well as the rates

of computed information output and raw information input

• Define bit-widths of internal data and control buses as well as the topology of the interconnects

10



Modern CPU Core Comparison

11

AMD K8 Intel Core2



Logic Technology Metrics*

• Clock speed (impacts latency of individual operations)– Processors used in supercomputers run at several hundred MHz to a few GHz

– Related metrics include Flops (floating-point operations per second) and Ips (integer operations per second)

• Number of cores or concurrent threads (prevent starving applications with sufficient level of parallelism)

– Dual-core desktop processor are in common use; research designs reached tens of cores per die

• Data bus width (impacts the rate of data transfer, and hence the total latency of processing)

– Current high-performance processors are 32- and 64-bit

• Cache sizes (improve the average latency of memory accesses)– While they belong more to memory section, they are closely associated with processing logic

– Range from 16 KB for data L1 in Xeon 5000 series to 32 MB L3 in Power6

• Instructions per cycle (addresses the overhead of instruction stream decoding)– Most CPUs are superscalar, with degree ranging from 2 to 8

– Very Long Instruction Word (VLIW) architectures shift some of the overhead to the compiler

• Power consumption– Range from 4.5 W per PowerPC 440GX to 150 W in Xeon 7140M

• Price– Impacted by product yields, which depend on manufacturing process type (e.g 45nm, 65nm),

process maturity, etc.

12



Evolution of Electrically Powered

Processing Logic

13

Relay computer Z3

Berlin, 1941

Clock 5.3 Hz (up to ~10 Hz)

4000 W power

Valve computer Colossus Mark II

Bletchley Park, 1944

Eff. 1 boolean operation per 0.4 µs Discrete transistor computer TRADIC

Bell Labs, 1955

1 mil. logical operations per sec.

100 W power usage

PDP-11/20, based on SSI integrated

circuits

Digital Equipment Corp., 1970

800 ns cycle

~400 W (more with expanded

configuration)

Single LSI microprocessor 4004

Intel, 1971

Clock 740 kHz, 0.07 MIPS

Power consumption < 500 mW



Clock Scaling of Intel CPUs

14



Current State of Multicore

15

Sun UltraSPARC T2

(8 cores)AMD Barcelona (4 cores)

STI Cell (8+1 cores)

IBM Power6

(2 cores)

½ Intel Clovertown

(2 cores per die, 2 dies)



Processor Parameter Comparison

16

Clock Speed(#cores)

Cache Sizes(per core)

IPC(per core)

Power

AMDOpteron (64-bit)

3.0 GHz (2) L1: 64+64KBL2: 1MB

2 FP, 3 Integer 119 W

IBM PowerPC 450 (32-bit)

0.85 GHz (4) L1: 32+32KBL2: ?L3: 8MB


Intel Itanium 29150M (64-bit)

1.66 GHz (2) L1: 16+16KBL2: 1024+256KBL3: 12MB


Intel Xeon 7140M (64-bit)

3.4 GHz (2) L1: 32+32KBL2: 2MB


MIPS64 (64-bit) 0.5 GHz (6) L1: 32+32KBL2: 256KB

2 FP, 2 Integer <6 W



Power Problem: Current Leakage

17

Gate oxide leakageSubthreshold leakage

V. Salapura, Next Generation Supercomputers



Power Scaling Problem

18

Static vs dynamic power

Courtesy: ITERContribution of leakage factors vs

decrasing gate length



Tackling Power Issues

• Switched capacitance loses– Reduction of core voltage (energy loses proportional to V2)

– Low-k interconnects (SiO2doped with F or carbon, porous SiO2)

• Parasitic capacitance loses– Silicon on insulator (SOI), created through high-temperature

annealing of substrate containing embedded oxygen-rich layer

• Charge mobility impediments– Strained silicon (e.g. grown on a Ge substrate)

• Subthreshold leakage– Source biasing

– Multi-threshold CMOS

– Sleep transistor sizing

• Gate oxide leakage (tunneling)– High-k material to increase gate capacitance (hafnium and

zirconium silicates and oxides)

19



Component Price

20

_________________________Dies per wafer * Die yield

where die yield is the percentage of good dies

in the wafer.

Relationship between number of dies per wafer and yield (at constant fault density)

Cost per waferCost of die =

200mm wafer with

Athlon 64 chips in 90nm



Topics









21



Memory• Serves as storage of computational state and instructions to transform

that state

• Organized in a hierarchy, ordered by increasing access latency and decreasing bandwidth:– Primary storage

• Directly accessible by the CPU

• Volatile (loses contents after power-down) in most cases today

• Includes: CPU registers, CPU caches, main memory

– Secondary storage• Not directly accessible by the CPU

• Data transfers accomplished through an intermediate area in primary storage and dedicated I/O channels

• Non-volatile

• Includes: hard disks, removable optical disks (CD and DVD drives), USB sticks, tape drives, etc.

– Tertiary storage• Involves automated (robotic) mechanisms mounting removable media on demand

• Includes: tape libraries, optical jukeboxes

– Off-line storage• Involves human interaction to handle the storage media

• Used for physical information transfer, e.g. to ensure information security

22



Memory Technology Metrics*

• Memory access latencies impact how quickly the information is available to the processor

– 100s of ps (CPU registers) to 10s…100s of ns (main memory) for primary storage

– Around 100µs for FLASH memory, several ms for hard drives, hundreds of ms for optical disks

– Single to tens of seconds for robotic storage libraries

• Memory bandwidth defines the maximum rate at which information is available for processing, preventing processor starvation

– 10s of GB/s for registers and L1 cache, single GB/s for main memory

– Ranges from 10s MB/s (optical media and FLASH) to 100s MB/s (high performance hard disks)

– Low 10s to over 100 MB/s for single tape drive; aggregate throughput may be much higher depending on aggregate number of devices

• Capacity: determines maximum problem size possible to compute (otherwise memory starvation and potential contention between competing processes/threads may result)

– Capacities range from few bytes per register, few of KB to tens of MB for CPU caches, 100s of GB to 1 TB for hard disks, and up to tens of PB for large tertiary storage systems

• Capacity density: determines additional costs associated with assembling storage of required capacity, such as volume, or device footprint

23



Evolution of Main Memory Technology

24

Mercury delay line memory

1000 words of 12 characters

400 µs latency

UNIVAC I, 1951

Williams tube

500 to 1000 bits

300..400 µs latency

Manchester Mark I, 1949

Drum memory

10000..20000 characters

2.4 ms average latency

IBM 650, 1954Core memory

Example capacity: 32x32 array

Cycle time: from 6 µs in 1951 down

to 600 ns in late 1970s

Semiconductor memory

Up to 4 Gb per DRAM

device

Cycle time: sub-10 ns

(SRAM)

Shown: Intel 1103 (1970)



Main Memory Implementations*

• Static Random Access Memory (SRAM)

– Fastest access time

– Modest power consumption per bit

– Relatively expensive to manufacture

– Discrete SRAM is slowly displaced in favor of PSRAM, which is

based on a DRAM equipped with an SRAM-like interface

• Dynamic Random Access Memory (DRAM)

– Cheapest production cost per bit

– Highest capacity per die area

• FLASH Memory

– Non-volatile

– Capacities approach those of DRAM

25



Static RAM*

• Uses a 6 MOSFET (6T) cell to store each bit– M5 and M6 are control transistors, biased through the

word line (WL), which is selected by address decoder

– M1..M4 form a bi-stable latch, with two possible stable states:

• M2, M3 on, M1, M4 off, corresponds to bit value 0

• M1, M4 on, M2, M3 off, represents bit value 1

– Each cell is connected to a bit line (BL) to communicate stored bit values to and from I/O logic

• Complementary bit lines (as in the figure) are frequently used to improve noise immunity and performance of some operations

• Memory with m address lines and n data lines operates 2m×n bit matrix

• Multiported memories are possible through addition of multiple BL pairs, WLs and control transistors (M5, M6) to each cell

• Operational and interface variants:– Asynchronous: no clock signal used

– Synchronous: data controlled by external clock signal

– ZBT (Zero Bus Turnaround): no additional latency when changing from read to write operation, and vice versa

– Pipeline Burst: may stream data effectively every clock cycle, but with latency exceeding single clock tick

26

SRAM Cell

Dual-ported SRAM from Cypress



Sample SRAM Block Diagrams

27

http://download.cypress.com.edgesuite.net/design_resources/datasheets/contents/cy7c1311cv18_8.pdf

Modern dual-ported 300 MHz QDR-II SRAMSimple asynchronous SRAM

http://download.cypress.com.edgesuite.net/design_resources/datasheets/contents/cy7c1009bn_8.pdf


http://download.cypress.com.edgesuite.net/design_resources/datasheets/contents/cy7c1311cv18_8.pdf



Dynamic RAM*

• Stores each bit of data using only one transistor and one capacitor

• Lines connecting each row are word lines, the column lines are called bit lines

• Due to capacitor leakage, requires periodic refresh of stored charge (hence dynamic)

• The bit cells are organized in roughly square matrices, frequently multiple per die

– Banking permits parallel access to multiple matrices (bandwidth increase/latency reduction)

• DRAMs are typically designed with redundancy built-in

– Providing substitute rows in case of permanent faults (during production and use)

– Supplying storage for corrective codes to detect and correct soft errors (ECC)

• Several interface variants developed over years– Asynchronous (clock-less) vs. synchronous (SDRAM)

– Now practically obsolete: FPM, EDO, BEDO, “plain” SDRAM, Rambus DRAM (RDRAM)

– DDR (Double Data Rate), with DDR2, DDR3 variants

– XDR and XDR2 (next incarnation of Rambus interface)

– GDDR-n (n = 2, 3, 4, 5): specialized graphic memory

28

Schematic drawing based on the original DRAM patent (1968)



Currently Used SDRAM Variants*• DDR2

– Operates at twice the bus speed of DDR

– DDR2-1066 achieves 1066 million 64-bit transfers per second using 533 MHz bus clock (or 266 MHz memory clock)

• The corresponding module PC2-8500 has a peak bandwidth of over 8.5 GB/s, Cycle time 3.75 ns

– Module capacity: up to 4 GB

– Features 4-bit deep prefetch buffers

• DDR3

– Current successor of DDR2 with 30% lower power consumption (1.5 V supply) than DDR2

– Double the effective range of I/O clock to 800 MHz (with 200 MHz memory clock)

– 8-bit deep prefetch buffers

– DDR3-1600 (PC3-12800) achieves peak 12.8 GB/s with cycle time of 5 ns

• DDR4

– Will be the successor of DDR 3 and is expected to be available in 2012

– Expected to run at a supply voltage of 1.2V or less

– Expected to support an excess of 2 billion data transfers per second

29



Currently Used SDRAM Variants*

• GDDR5

– Provides twice I/O throughput compared to its immediate predecessor GDDR4

– Uses 8n prefetch architecture

– Clock rate up to 1.125 GHz, supporting data rate of 4.5GB/s

• XDR

– Next incarnation of Rambus RDRAM (differential open-collector driver)

– Clock rate up to 800 MHz, with planned 1066 MHz

– 8 or 16 lanes per chip, providing 3.2 or 6.4 GB/s at 400 MHz

– Eight memory banks, up to 4 bank-interleaved transactions at full bandwidth

– Used in small outline, high-bandwidth consumer applications (game consoles)

30



Block Diagram of Asynchronous DRAM

31

http://www.issi.com/pdf/41LV16105B.pdf




Block Diagram of DDR3 DRAM

32

http://download.micron.com/pdf/datasheets/dram/ddr3/1Gb%20DDR3%20SDRAM.pdf


http://download.micron.com/pdf/datasheets/dram/ddr3/1Gb%20DDR3%20SDRAM.pdf


Common DRAM Packaging

33

184-pin DIMM with DDR SDRAM

200-pin DDR (bottom) and DDR2 (top)

modules (note different notch locations)

240-pin DIMM with DDR2 SDRAM

DIMM (Dual In-line Memory Modules):

240-pin DIMM with DDR3 SDRAM

(different notch from DDR2 DIMM)

SO-DIMM (Small Outline DIMMs) for laptops:

204-pin DDR3 module

232-pin XDIMM with XDR memory:



SRAM and DRAM Control Waveform

Comparison

34



Read operation in

asynchronous SRAM

Read operation in

asynchronous DRAM

(Fast Page Mode);

Note the addition of

RAS and CAS signals





FLASH Memory*

• Invented by Dr. Fujio Matsuoka at Toshiba in 1984

• Technically a kind of EEPROM (Electrically Erasable Programmable Read-Only

Memory)

• Bridges the gap between mechanical and solid state storage

– More resilient to shock than mechanical devices

– Slower access (esp. write) than other solid state devices

• Non-volatile

• Two major implementations:

– NOR

• Long write and erase times

• Provides random read access to any location (but sector-sized erase, write and unlock operations)

• Relies on asymmetric programming and erase processes (hot electron injection/quantum tunneling)

• Endures 10,000 to 100,000 erase cycles

• Debuted in CF (Compact Flash) cards

• Frequently used as reprogrammable instruction memories in microcontrollers

– NAND

• Requires smaller area per cell (hence is cheaper)

• Faster erase times and endurance than NOR

• Relies on symmetric programming/erase process (Fowler-Nordheim tunneling)

• Provides only block-granularity access

• First available in SmartMedia cards, now practically all FLASH media moved to this technology

(SD, MMC, xD, MemoryStick, and derivatives)

• Control logic performs wear leveling to uniformly distribute writes over all available blocks

35



NOR and NAND FLASH Structures

36

NOR FLASH

Layout

NAND FLASH

Layout



• Storage cards (below: xD, SD card family, MemoryStick Pro Duo)

• USB sticks

• Solid state drives

FLASH Based Devices

37



Secondary Storage: Hard Drives*

• Storage capacity: 2TB per drive

• Areal density: 400 Gbit/in2 (perpendicular recording)

• Rotational speed: 15,000 RPM

• Seek time– Track-to-track: 0.2 ms

– Average: 3.5 ms

– Full stroke: 6.7 ms

• Sustained transfer rate: up to 125 MB/s

• Non-recoverable error rate: 1 in 1017

• Interface bandwidth:– Fibre channel: 400 MB/s

– Serially Attached SCSI (SAS): 300 MB/s

– Ultra320 SCSI: 320 MB/s

– Serial ATA (SATA): 300 MB/s

38



Secondary Storage: Tape Drives

• Data are stored on magnetic tape– Only sequential access

– Long average seek times (average rewind time is around a minute)

• Streaming bandwidth rivals that of a single hard drive– SUN’s T10000 tape drive achieves a native

transfer rate of 120 MB/s

• Formats– Older proprietary format DLT (Digital Linear

Tape) featured up to 800 GB (DLT-S4) stored on a half-inch metal particle tape in a color-coded enclosure

– Linear Tape-Open (LTO) is an open alternative to DLT, initiated by Seagate, HP and IBM

• LTO-4 generation stores 800 GB per tape

• Two form factors: Accellis (8 mm, dual reel, oriented on fast access), and Ultrium (1/2”, single reel, promotes capacity)

• Supports data transfers up to 120 MB/s

• WORM and drive-level encryption capable

39

IBM TS1120 Tape Drive

HP SB920c Tape Blade

SUN T10000 Tape Drive



Tertiary Storage

• Tape library example: StorageTek SL8500– Modular system (extensible)

– Scales from 64 to 448 tape drives and from 1,448 to over 70,000 tape slots

– 50 tape slots per ft2

– In maximum configuration (32 libraries):

• Throughput: 193.2 TB/h (with T10000 drives)

• Capacity (uncompressed): 35 PB

• Avg. cell to drive per HandBot: <11 s

• Mean exchanges between failures (per bot): 2,000,000

• Footpad: 296.5”x2,338”

• Weight: 247,830 lb

– Robotics run oval path layout, achieving up to 1000 tape loads per hour; multiple robots possible

– Supports mixed tape (LTO, DLT) and drive types

– Hot swappable and redundant

40



Topics









41



Communication Concepts*

• Latency is the delay of a transmission between the sender and the receiver.

• Throughput can be thought of as the maximum channel capacity between a sender and a receiver

• Intranode communication– Buses

– Northbridge / Southbridge

– PCI

– HT

• Internode communication– Network controller cards

– Network switch

– Network cables

42



Bus : History

• First generation– Microcomputer bus systems – essentially

passive backplane connected directly or through buffer amplifiers to the pins of the CPU

– Memory and other devices connected in parallel, to the bus using the same address & data pins as used in the CPU

– 8 bit parallel

– All the equipment using the bus had to talk at the same speed, as they share a single clock, increasing the speed to the CPU was hard as the devices had to be able to work at higher speeds as well

– Addition of expansion cards, involved jumpers, to set the memory address, I/O addresses, interrupt priorities, and interrupt numbers

43

8080 CPU

4K RAM

Tape Controller

S-100 Altair



Bus : History

• Second generation– E.g. : NuBus

– CPU & memory on one side and devices on the otherside

– Bus controller controlled access to CPU, Memory by these devices

– Allowed CPU to change speed without affecting the devices

– 16 – 32 bit parallel

– As number of devices increased, bus speed became a limiting factor.

– Data intensive activities such as graphics processing overwhelmed the bus controllers

– AGP was introduced to address data needs of graphics processing.

• Third generation– Current day busses including, PCIe, USB,

Hyper-transport, etc

44



Buses

• Bus is a subsystem that transfers data between various components within a node

• Modern Bus subsystems can be used as both parallel and bit-serial connections – Can be wired as daisy chain topology,

connected by switched hubs, eg. USB

• Buses contain 3 main types of signal: Data, Address, Control

• Most buses are backplane components – Circuit board with parallel connectors into

which other boards are plugged perpendicularly

• Others are cable based components – Devices in separate chassis connected

together by cables that carry signals

45

PCI & ISA slots

PCI riser card



Bus Control Mechanisms

• 2 main bus control mechanisms – Synchronous

• Includes a clock in the control lines that validates each signal. Fixed protocol for communication relative to the clock

– Asynchronous• Uses handshake protocol instead

• In a synchronous bus design, there is an assumption of a basic clock rate and increasing that rate can cause older devices to fail to operate properly. In an asynchronous design, older devices may simply reduce performance.

• An asynchronous protocol is usually more complex, requiring both more hardware and more overhead for each transaction. Synchronous buses can operate with lower latency and higher bandwidth for a given number of signals.

46

http://www.cs.umass.edu/~weems/CmpSci635/635lecture12.html



Ways to Transfer Data between*

Memory and Device

47

http://courses.cs.tamu.edu/cpsc321/



Overhead of Polling

48




Overhead of Polling

49




Topics









50



Northbridge & Southbridge*

• Nortbridge– The Northbridge handles the communication

between the CPU, RAM, AGP or PCI Express and the southbridge

– Northbridge regulates the number, speed and type of CPU(s) and the amount, speed and type of RAM that can be used

– In AMD64 systems the memory controller handles the communication between the CPU and RAM

• Southbridge– Also known as the I/O controller hub, the

southbridge traditionally handles devices with slower capabilities

– Manages PCI bus, ISA bus, DMA controller, IDE (SATA / PATA Controller), onboard network controller and more

51

Northbridge Southbridge



PCI*

• PCI stands for Peripheral Component Interconnect

• The PCI standard specifies a subsystem for attaching peripheral devices to a motherboard

• Mainly two types of devices connected via PCI– Integrated circuit incorporated on to the

motherboard

– Expansion cards that fit into the PCI Slots

• PCI is being super-seeded by PCI Express which provides higher bandwidth for the PCI-E.

• In most systems, during boot time, the operating systems / firmware queries PCI buses to determine the devices that are present and to determine the system resource needs for each device.

52



53

PCI enhanced systems

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1



54

PCI-express

Lane

width

Clock

speed

Throughput

(duplex,

bits)

Throughput

(duplex, bytes)

Initial expected uses

x1 2.5 GHz 5 Gbps 400 MBps Slots, Gigabit Ethernet

x2 2.5 GHz 10 Gbps 800 MBps

x4 2.5 GHz 20 Gbps 1.6 GBps Slots, 10 Gigabit Ethernet, SCSI,

SAS

x8 2.5 GHz 40 Gbps 3.2 GBps

x16 2.5 GHz 80 Gbps 6.4 GBps Graphics adapters

http://www.redbooks.ibm.com/abstracts/tips0456.html

PCIe 2.0 doubles the bus standard's bandwidth from 0.25

GByte/s to 0.5 GByte/s



55

PCI-X

Bus Width Clock Speed Features Bandwidth

PCI-X 66 64 Bits 66 MHz Hot Plugging, 3.3 V 533 MB/s

PCI-X

13364 Bits 133 MHz Hot Plugging, 3.3 V 1.06 GB/s

PCI-X

266

64 Bits, optional

16 Bits only

133 MHz

Double Data

Rate

Hot Plugging, 3.3 & 1.5 V, ECC

supported2.13 GB/s

PCI-X

533

64 Bits, optional

16 Bits only

133 MHz

Quad Data

Rate

Hot Plugging, 3.3 & 1.5 V, ECC

supported4.26 GB/s



56

Bandwidth ComparisonsCONNECTION BITS BYTES

PCI 32-bit/33 MHz 1.06666 Gbit/s 133.33 MB/s





PCI Express (x1 link)[6] 2.5 Gbit/s 250 MB/s

PCI Express (x4 link)[6] 10 Gbit/s 1 GB/s



PCI Express 2.0 (x32 link)[6] 80 Gbit/s 8 GB/s

PCI-X DDR 16-bit 4.26666 Gbit/s 533.33 MB/s

PCI-X 133 8.53333 Gbit/s 1.06666 GB/s

PCI-X QDR 16-bit 8.53333 Gbit/s 1.06666 GB/s

PCI-X DDR 17.066 Gbit/s 2.133 GB/s

PCI-X QDR 34.133 Gbit/s 4.266 GB/s

AGP 8x 17.066 Gbit/s 2.133 GB/s



57

HyperTransport : Context

• Northbridge-Southbridge device connection facilitates

communication over fast processor bus between

system memory, graphics adaptor, CPU

• Southbridge operates several I/O interfaces, through

the Northbridge operating over another proprietary

connection

• This approach is potentially limited by the emerging

bandwidth demands over inadequate I/O buses

• HyperTransport is one of the many technologies

aimed at improving I/O.

• High data rates are achieved by using enhanced, low-

swing, 1.2 V Low Voltage Differential Signaling

(LVDS) that employs fewer pins and wires

consequently reducing cost and power requirements.

• HyperTransport also helps in communication between

multiple AMD Opteron CPUs

http://www.amd.com/us-en/Processors/ComputingSolutions/0,,30_288_13265_13295%5E13340,00.html



58

Hyper-Transport (continued)

• Point-to-point parallel topology uses 2

unidirectional links (one each for upstream and

downstream)

• HyperTransport technology chunks data into

packets to reduce overhead and improve efficiency

of transfers

• Each HyperTransport technology link also contains

8-bit data path that allows for insertion of a control

packet in the middle of a long data packet, thus

reducing latenc

• In Summary : “HyperTransport™ technology

delivers the raw throughput and low latency

necessary for chip-to-chip communication. It

increases I/O bandwidth, cuts down the number of

different system buses, reduces power

consumption, provides a flexible, modular bridge

architecture, and ensures compatibility with PCI.” http://www.amd.com/us-en/Processors/ComputingSolutions

/0,,30_288_13265_13295%5E13340,00.html



Topics









59



Networking : A Historical Perspective

• Early machines were:

– Collection of microprocessors.

– Communication was performed using bi-directional queues

between nearest neighbors.

• Messages were forwarded by processors on path.

– “Store and forward” networking

• There was a strong emphasis on topology in algorithms, in

order to minimize the number of hops = minimize time

www.mat.unimi.it/users/pavarino/par_comp/

60



Network Analogy

• To have a large number of transfers occurring at once, you

need a large number of distinct wires.

• Networks are like streets:

– Link = street.

– Switch = intersection.

– Distances (hops) = number of blocks traveled.

– Routing algorithm = travel plan.

• Properties:

– Latency: how long to get between nodes in the network.

– Bandwidth: how much data can be moved per unit time.

• Bandwidth is limited by the number of wires and the rate at

which each wire can accept data.


61



InterNode Communication (OSI Model)*

• Open Systems Interconnect reference model (OSI Model)– Application Layer

• Interfaces directly to and performs application services for the application processes.

– Presentation Layer• Establishes a context between application layer entities

– Session Layer• Establishes, manages and terminates connections

– Transport Layer• Provides transparent transfer of data between end users,

TCP, UDP and other protocol layers

– Network Layer• Provides functional and procedural means of transferring

data from source to destination via one or more networks while maintaining the transport layer. eg. IP layer

– Data Link Layer• Provides the functional and procedural means to transfer

data between network entities and detect and correct errors in physical layers

– Physical Layer• Hubs, switches, cables etc

62



Network Cabling : FiberOptic / Copper*

• Connections use two main

physical implementation

mediums :

– Fiber optic channels

– Copper wires

• Fiber channel infrastructure

consists of 2 main classes :

– Directors – high port count,

modular, no single point failure,

high availability

– Switches – smaller fixed

configuration semi-modular, less

redundant devices

63



Design Characteristics of a Network*• Topology (how things are connected):

– Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle ....

• Routing algorithm (path used):

– Example in 2D torus: all east-west then all north-south (avoids deadlock).

• Switching strategy:

– Circuit switching: full path reserved for entire message, like the telephone.

– Packet switching: message broken into separately-routed packets, like the post office.

• Flow control (what if there is congestion):

– Stall, store data temporarily in buffers, re-route data to other nodes, tell source node to temporarily halt, discard, etc.


64



Performance Properties of a Network: Latency*

• Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes.

• Latency: delay between send and receive times

– Latency tends to vary widely across architectures

– Vendors often report hardware latencies (wire time)

– Application programmers care about software latencies (user program to user program)

• Observations:

– Hardware/software latencies often differ by 1-2 orders of magnitude

– Maximum hardware latency varies with diameter, but the variation in software latency is usually negligible

• Latency is important for programs with many small messages www.mat.unimi.it/users/pavarino/par_comp/

65



Performance Properties of a Network: Bandwidth

• The bandwidth of a link = w * 1/t

– w is the number of wires

– t is the time per bit

• Bandwidth typically in Gigabytes (GB), i.e., 8* 220 bits

• Effective bandwidth is usually lower than physical link

bandwidth due to packet overhead.Routing and control header

Data payload

Error code

Trailer

• Bandwidth is important for applications with mostly large messages


66



Performance Properties of a Network: Bisection Bandwidth*

• Bisection bandwidth: bandwidth across smallest cut that divides network

into two equal halves

• Bandwidth across “narrowest” part of the network

bisection cut

not a bisectioncut

bisection bw= link bw bisection bw = sqrt(n) * link bw

• Bisection bandwidth is important for algorithms in which all processors need to communicate with all others


67



Network Topology

• In the past, there was considerable research in network topology and in mapping algorithms to topology.

– Key cost to be minimized: number of “hops” between nodes (e.g. “store and forward”)

– Modern networks hide hop cost (i.e., “wormhole routing”), so topology is no longer a major factor in algorithm performance.

• Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec.

• Need some background in network topology

– Algorithms may have a communication topology

– Topology affects bisection bandwidth.


68



69



Linear and Ring Topologies

• Linear array

– Diameter = n-1; average distance ~n/3.

– Bisection bandwidth = 1 (in units of link bandwidth).

• Torus or Ring

– Diameter = n/2; average distance ~ n/4.

– Bisection bandwidth = 2.

– Natural for algorithms that work with 1D arrays.


70



Meshes and Toroids*

Two dimensional mesh

– Diameter = 2 * (sqrt(n ) – 1)

– Bisection bandwidth = sqrt(n)

• Generalizes to higher dimensions (Cray T3D used 3D Torus).

• Natural for algorithms that work with 2D and/or 3D arrays.

Two dimensional torus

• Diameter = sqrt(n )

• Bisection bandwidth = 2* sqrt(n)


71



Hypercubes

• Number of nodes n = 2d for dimension d.– Diameter = d.

– Bisection bandwidth = n/2.

• 0d 1d 2d 3d 4d

• Popular in early machines (Intel iPSC, NCUBE).– Lots of clever algorithms.

• Greycode addressing:– Each node connected to

d others with 1 bit different.

001000

100

010 011

111

101

110


72



Trees*• Diameter = log n.

• Bisection bandwidth = 1.

• Easy layout as planar graph.

• Many tree algorithms (e.g., summation).

• Fat trees avoid bisection bandwidth problem:

– More (or wider) links near top.

– Example: Thinking Machines CM-5.


73



Butterflies with n = (k+1)2^k nodes

• Diameter = 2k.

• Bisection bandwidth = 2^k.

• Cost: lots of wires.

• Used in BBN Butterfly.

• Natural for FFT.

O 1O 1

O 1 O 1

butterfly switch

multistage butterfly network


74



Topologies in Real Machines

Road Runner Fat tree

Red Storm (Opteron + Cray network, future)

3D Mesh

Blue Gene/L & Blue Gene /P 3D Torus

SGI Altix Fat tree

Cray X1 4D Hypercube*

Myricom (Millennium) Arbitrary

Quadrics (in HP Alpha server clusters)

Fat tree

IBM SP Fat tree (approx)

SGI Origin Hypercube

Intel Paragon (old) 2D Mesh

BBN Butterfly (really old) Butterfly

old

er

new

er

Many of these are approximations:E.g., the X1 is really a “quad bristled hypercube” and some of the fat trees are not as fat as they should be at the top


75



Top 500 : Interconnect Family

76

Interconnect Family / Systems November 2010

Interconnect Family / Performance November 2010



Top 500 : Interconnect

77



Topics









78



Ethernet

This diagram was hand drawn by Robert M.

Metcalfe and photographed by Dave R.

Boggs in 1976 to produce a 35mm slide

used to present Ethernet to the National

Computer Conference in June1976

79



Ethernet – Cards / Controllers*

• Gigabit Ethernet

– Approximate speed of 1000 Mbits/s

– Transmission range 5 km

– IEEE 802.3 standard

– Some times integrated with the mother board

– PCI based extension cards available

– Common chipset manufacturers: Broadcom,

intel, Cisco, D-Link, 3com etc

• 10 Gigabit Ethernet

– Approximate speed of 10 Gbits/s

– Transmission range 40 km

– Works with IEEE 802.3 standard

– PCIe / PCIx based cards available

– Common chipset manufacturers include intel

80

Gigabit Ethernet PCI Card

10 Gigabit Ethernet PCIX Card

Dual-port Multithreaded 10 Gigabit Ethernet PCIe Card



Ethernet Switches & Cables

• Networking device that operates at any of the OSI layers, physical, datalink, network or transport. (Multilayer switch, one that operates at several OSI layers)

• Switch ports almost always default to full-duplex operation, unless there is a requirement for interoperability with devices that are strictly half duplex

• Half-duplex means that the device can only send or receive at any given time, whereas full-duplex can send and receive at the same time

• Switches come in several form factor: rack mounted, non-rack mounted, chassis

• Managed or Unmanaged switching– Unmanaged switches do not have

configuration interface or options

– Managed switches allow access to one or more interfaces for maintenance and management

81

10 GigE Switch

1 GigE Switch

Chassis Switch



InfiniBand Switches and HCAs*

• InfiniBand is a point-to-point bidirectional serial link intended for the connection of processors with high speed peripherals such as disks.

• It supports several signaling rates and, as with PCIe, links can be channel bonded together for additional bandwidth.

• The serial connection's signaling rate is 2.5 Gbits/s in each direction per connection.

• InfiniBand supports double (DDR – 5Gbits/s) and quad data (QDR –10Gbits/s) speeds, at the same data-clock rate.

• The single data rate switch chips have a latency of 200ns, and DDR switch chips have a latency of 140ns

• InfiniBand uses a switched fabric topology, as opposed to a hierarchical switched network like Ethernet.

• Data is transmitted in packets of up to 4 kB that are taken together to form a message. A message can be:

– a direct memory access read from or, write to, a remote node (RDMA)– a channel send or receive– a transaction-based operation (that can be reversed)– a multicast transmission.– An atomic operation

82



InfiniBand Cabling

• CX4 Copper (15m)

• Flexible 30-Gauge Copper (3m)

• Fiber Optics up to 150m

83

Effective theoretical throughput in different configurations Single Double Quad

1X 2 Gbit/s 4 Gbit/s 8 Gbit/s4X 8 Gbit/s 16 Gbit/s 32 Gbit/s12X 24 Gbit/s 48 Gbit/s 96 Gbit/s



520 Dual CPU Nodes1,040 CPUs

NCSANational Center for Supercomputing Applications

Tungsten 2: 520 Node Supercomputer

Core Fabric

Edge Fabric

6 72-portTS270

29 24-portTS120

174 uplinkcables

512 1mcables

18 Compute Nodes

18 Compute Nodes

Parallel MPI codes for commercial clients

Point to point 5.2us MPI latency

Deployed: November 2004

84



IBM Roadrunner

85



InfiniBand Price / Performance

InfiniBandPCI-Express

10GigE GigE Myrinet D Myrinet E

Data Bandwidth(Large Messages)

950MB/s 900MB/s 100MB/s 245MB/s 495MB/s

MPI Latency(Small Messages)

5us 50us 50us 6.5us 5.7us

HCA Cost(Street Price)

$550 $1.5K Free $535 $880

Switch Port $250 $1K-$2K $100-$300 $400 $400

Cable Cost(3m Street Price)

$100 $100 $25 $175 $175

•Myrinet pricing data from Myricom Web Site (Dec 2004) ** InfiniBand pricing data based on Topspin avg. sales price (Dec 2004)*** Myrinet, GigE, and IB performance data from June 2004 OSU study

• Note: MPI Processor to Processor latency – switch latency is less

86



The InfiniBand Driver Architecture

BSD Sockets FS API

TCP

SDP

IP

DriversVERBS

ETHER INFINIBAND HCA

DAT

FILE SYSTEM

SCSI

SRP

FC

FCP

SDP

INFINIBAND SAN

API

BSD Sockets NFS-RDMA

LAN/WAN SERVER FABRICSAN

INFINIBAND SWITCHETHER

SWITCHFC

SWITCHFC GWEETH GW

NETWORK

APPLICATION

UDAPL

TS TS

IPoIB

User

Kernel

87



Topics









88



89

Summary – Material for the Test

• Logic technology & metrics – 12• Memory technology & metrics – 23• Main memory implementation – 25• Static and Dynamic RAM – 26,28• Current SDRAM variants – 29,30• Flash memory – 35• Secondary Storage hard drives – 38• Communication concepts – 42• Ways to transfer data from memory & devices – 47• Northbridge & Southbridge – 51• PCI technology - 52• OSI model – 62• Cabling technologies: Fiber vs Cu - 63• Design characteristics – 64• Performance properties of Network: Latency – 65• Bisection bandwidth – 67• Network topologies: Mesh & Toroids – 71• Network topologies: Tree – 73• Ethernet Cards/Controller – 80• IB switch & HCA – 82



90


High Performance Computing: Concepts, Methods, & Means An

Documents

Transcript of High Performance Computing: Concepts, Methods, & Means An