The BlueGene Supercomputer and LOFAR/LOIS Bruce Elmegreen IBM Watson Research Center 914 945 2448...

The BlueGene Supercomputer and LOFAR/LOIS

Bruce ElmegreenIBM Watson Research Center914 945 [email protected]

LOFAR

LOFAR

LOFAR = Low Frequency ArrayLOIS = LOFAR Outrigger in Sweden

LOIS

BlueGene/L at LLNL

Current Radio Telescope Design

parabolic receivers Onsala Space Observatory

connected element interferometersWesterbork, Netherlands

Very Long Baseline Interferometers

EVN

LOFAR and LOIS: A Radical New Design(Radio Telescopes as Sensor Networks)

10's of thousands of simple, fixed antennae

Sample antenna for LOIS

Sample design for LOFAR

The LOFAR/LOIS way to focus:electronically addsignals with timedelays chosen toreinforce thedesired direction

+

Points to Sky Electronically

Current way to focus

76 m Lovell telescope at Manchester UK

Advantages of the Sensor Network Approach

Antennae are cheap (no precision surfaces needed)No moving parts (no expensive mounts, no pointing problems)

Low profile (no radome needed, no wind/weather problems)

Can change pointing directions quicklyCan observe multiple directions at the same time (i.e., with several sums)

Can extend for large distances by placing sensors anywhere

Provides a country-wide optical fiber grid for general use

Disadvantages of the Sensor Network Approach

Enormous data flows (parabolic focus replaced by digital sum)

Enormous central processors needed to handle data

Enormous Data Flows from Antenna Stations

LOFAR will have 46 remote stations and 64 stations in the central core

Each remote station transmits:ƒ 32000 channels/ms in one beam

–or 8 beams with 4000 channelsƒ 8+8 bit (or 16+16) complex dataƒ 2 polarizations.

– 1-2 Gbps from each station

Each central core station transmits the same data rate in several independent sky directions (for epoch of recombination experiment)

110 - 320 Gbps input rates to central processor

sample LOFARstation array andantenna array for each station

Enormous Central Processors Needed

To make a single, highly-sensitive sky beam, the central processor does a weighted sum of all the station data (faint source detection, pulsars, ...)

To make a high-resolution map with a field of view equal to that of a single station, the central processor cross-correlates all station data.

Cross correlation of 110 LOFAR Station inputs requiresƒ 110*109/2 product pairs and 4 polarization product pairs

for each of the 32,000 frequency channels every millisecond

= 0.8 Tera (1012) complex products/second

For Epoch of Recombination maps using the LOFAR Central Core

ƒ 64*63/2 pairs * 4 pol. prod. * 32,000 ch./ms * 5 beams= 1.3 Tera complex products/second

LOFAR and LOIS: A Radical New Frequency

Will observe at 10-250 MHz, which is good becauseƒ neutral hydrogen is cosmologically redshifted (z~10)ƒ pulsars are brightƒ galactic cosmic rays radiateƒ radar echoes off the solar windƒ ...

but is bad because of:ƒ terrestrial emission from FM radio, TV, aircraft,

satellites, ...ƒ ionospheric variations from solar irradiation and wind

irregularities

Need streaming noise mitigation for each antenna, ionospheric phase corrections for each antennae station, high bit count for data sampling, and flexible programming environment (LINUX).

Black hole+jet of 3C236, the largest known radio galaxy (Schilizzi et al. 2001)

Quasar (z=3.2) lensed by foreground galaxy (Koopmans et al. 2002)

BlueGene/L for LOFAR

These requirements of LOFAR can be matched by BlueGene/L, a massively parallel computer designed by IBM in collaboration with the Lawrence Livermore National Laboratory.

BlueGene/L will also provide a top-ranked supercomputer facility in the Netherlands, useful for a wide range of problems.

November 2001, partnership with LLNLNovember 2002, acquisition of BG/L by LLNL announced. November 2003, 1/2-rack achieves 1.4 TFlop/s on Linpack

ƒ 73 on TOP500February 2004, IBM and ASTRON announce joint research into high data volume supercomputing using BG/L

May 2004, Argonne Nat. Lab announce plan to get BG/L June 2004, 4 rack prototype achieves 11.68 TFlop/s on Linpack

July 2004, AIST (Japan) announce use of BG/L for drug design

September 2004, 8 racks achieves 36 Tfs on LINPACK, surpassing NEC's Earth Simulator to become world's fastest computer

BlueGene/L: History

NEC Earth Simulator vs 64-rack BG/L at LLNL

40 Tf/s10 TBy memory640 nodes500 MHz5 MWatts power34000 Sq foot floor space

$350 M cost

360 Tf/s32 Tbytes memory65,536 nodes700 MHz1.5 MWatts power2500 Sq foot floor space

$100 M cost64 racks of BG/L will have much more for much less

IBM CU-11, 0.13 µm11 x 11 mm die size25 x 32 mm CBGA474 pins, 328 signal1.5/2.5 Volt

BlueGene/L: "System on a Chip"2 CPUs (4 FPUs) +

memory + connection controls

in each chip

0.5 GB RAM/node ECC

10/21/2003 IBM Confidential Information 5

Dual Node Compute Card

9 x 256Mb DRAM; 16B interface

Heatsinks designed for 15W (measuring ~13W @1.6V)

54 mm (2.125”)

206 mm (8.125”) wide, 14 layers

Metral 4000 connector

64 MBy DRAM

Heatsinks

2 chips per card

Four Gbps ethernet connectors

IO nodes

Processornodes

16 cards per tray+ 4 IO nodes

512 Way BG/L Prototype

midplane = half rack = full torus

16 trays per half-

rack = 512 nodes + 64

IOs

Cables fortorus in 3dimensions

6 ASTRON racks have ~1 km of cables, each 1/2" thick

2 half-racks per

rack + cables, power, cooling,

etc

8 racks, 36 Tf: #1 in the world

Rochester, MNSeptember 2004

Two CPU cores per chip (used independently or as comm/comp)

Each CPU can do 4 floating point ops (2 multiply/adds) per clockcycle

ƒ "double hummer" with quad-word loadsƒ double precision complex product takes 2 clock cycles/

CPU, 1 cycle/chipFrequency is 700 MHz

ƒ Peak performance is 5.6 GFlops per chip, or 700 M c.p./s (using both CPUs)

On chip memory: L1 (4kB/FPU) , L2(prefetch), L3( 4MB/chip)

2 chips/card, 16 cards/tray, 16 trays/midplane, 2 midplanes/rack

ƒ 1024 chips/rackƒ = 720 Giga complex products/second (peak, using both

CPUs/chip)–tests with streaming data achieve 50%-90% efficiency for comp. prods.

ƒ peak rate for LOFAR (6 racks) = 4.3 T comp.prod./s

Hardware Summary

A torus has independentconnections in 3 dimensions.

The torus bandwidthis 1.4 Gbps each way in all 3 dimensionssimultaneously

Wiring actually jumps over the physical neighbors to prevent large timing mismatches between distant edges.

Chips are connected by a 3-Dimensional torus network

IO Network4 IO connections / tray, 128 / rack, each is 1Gbps bi-directional

ƒ peak IO for LOFAR (6 racks) = 768 Gbpseach IO feed goes to an IO node, which is connected to 8 compute nodes by a hierarchical tree that pipes data at 2.8 Gbps bi-directional

tree also does global communication & combine/broadcast for all nodes

Two other internal networks

Barrier network to all nodes: allows programs to stay in synch

Control network to all nodes: boot, monitor, partitionƒ partitions set up in software using "linkchips"ƒ smallest torus partition is a midplane (512 nodes, half

rack)ƒ may partition system into tori as midplanes, racks,

multiple racksƒ or may partition into smaller units (32-node tray) which

connect nodes as a meshƒ each partition runs a different job

LOFAR EXAMPLE: Cross Correlation of Station Beams

64 Virtual core stations and 45 Remote Stations transmit data at ~220 Gbps

32,000 frequency channels are divided into 8 parts, each a separate sky beam

Each beam of data goes to a separate midplane:ƒ 220/8=27.5 Gbps input/midplane, 64 Gbps ethernet IOs

available

Midplane processors must do ƒ 110*109/2 station products * 4000 ch. per ms * 4 pol.

products == 96 Giga complex products/second compared to peak of 360 Giga comp.prod./s in virtual node mode

LOFAR EXAMPLE: Internal Data Flow

One beam of data per midplane meansƒ 4000 channels, 2 polarizations, 1000 ms time integration,

4 Byte data (16+16)ƒ = 62.5 k Bytes of data into each of the 512 nodes in a

midplane each second

ƒ (NOTE: 4 MBy storage available in L3 cache per node )

MPI_alltoall redistributes data so each node has some of the channels in both polarizations from all telescopes

ƒ midplane has 8x8x8 nodes, longest hop on torus is 4 nodes on each axis, average is 2 nodes per axis (x,y,z)

ƒ hopping speed = 1.4 Gbps = 175 MBps each direction, 350 MBps each axis

ƒ all-to-all time is 62.5 kB * 2 hops / 350 MBps = 3.6E-4 second .–(NOTE: plenty of time for computations virtual node mode OK)

LOIS Example: 15 antennae with IBM JS20

Prototype LOIS will have 10-15 antenna and 50,000 frequency channels

Will perform calculations on an IBM JS20 Blade Center–28 blades, 56 CPUs @ 1.6 GHz, 2 FPU/CPU, 2 L&S /CPU–each FPU does complex product (2L,4M,2A,2S) in 4 clock cycles–each CPU does comp. prod. in 2 clock cycles, each blade in 1 cycle–28 blades does complex products at peak rate of 44.8 Giga c.p./s–required rate is 15*14/2*50000/ms=5.3 G c.p./s per EM componentx 9 for 3x3 matrix products of electric vectors

–IO rate from antennae is 15 Gbps, total capacity of JS20 is 16 Gbps

ƒ with inefficiencies in computation and IO, initial configuration is ~1/2 the full requirement.

Enzo (San Diego) Flash (ANL)CTH (Sandia)

MM5 Amber7GAMESSQMC (Caltech)

LJ (Caltech)PolyCrystal (Caltech)

PMEMD (LBL)Miranda (LLNL)

SEAM (NCAR)QCD (IBM)

SAGE (LANL)SPPM (LLNL)UMT2K (LLNL)Sweep3d (LANL)MDCASK (LLNL)GP (LLNL)CPMD (IBM)TLBE (LBL)HPCMW (RIST) DD3d (LLNL)SPHOT (LLNL)LSMS (ORNL) NIWS (NISSEI)

Some External Applications on BG/L

June 2004 - Robert Walkup: [email protected]

Gyan Bhanot: [email protected]

Limitations to Applications

0.5 GB (card)/node, 4MB DRAM(chip)/node, 32kB/CPU L1 Cache

ƒ cannot pass all data to each node, no shared memoryƒ quad loads required for streaming float operations, 16B-

aligned dataƒ MPI the most effective way to program

UNIX only on the IO nodes, not on the compute nodes:–no UNIXIPC, MMAP, forks, sockets, openMP, shell creation, thread creation, UNIX system calls, etc. on a compute node

Existing MPI codes port easily, but may need changes to use double hummer

Off-shelf MPI code ~5x slower/processor on BG/L than Power4 1.7 GHz

ƒ 1.7 GHz * 4 Flop/cycle (2FPU/CPU PowerPC) = 6.8 Gflops peak/CPU

ƒ vs. 0.7 GHz * 2 Flop/cycle (co-proc, w/o double hummer) = 1.4 Gfs/CPU

But low power: 64 W/CPU for JS20 Power 4 blades, 12 W/CPU for BG/L

Main BlueGene advantage is packaging: high CPU density and connectivity

Summary

BlueGene/L can handle LOFAR station data rates and c.p. rates

configuration can change with software commandshandles data with high bit counts also usable as general purpose computer (~20 Tf sustained)

ƒ good for Dutch Infrastructure, an attraction for Industry partners, science ...

BlueGene/L Innovations:–system on a chip design (2 processors, all networks, memory)

–4 independent networks: tree, torus, barrier, control–variable torus size (controlled by software using link chips)

–moderate clock speed (700 MHz) good for RAM reads, good for low power consumption (25 kW/rack)

–LINUX kernel on IO nodes

The BlueGene Supercomputer and LOFAR/LOIS Bruce Elmegreen IBM Watson Research Center 914 945 2448...

Documents

Transcript of The BlueGene Supercomputer and LOFAR/LOIS Bruce Elmegreen IBM Watson Research Center 914 945 2448...