The BlueGene Supercomputer and LOFAR/LOIS Bruce Elmegreen IBM Watson Research Center 914 945 2448...
-
Upload
reginald-neal -
Category
Documents
-
view
217 -
download
1
Transcript of The BlueGene Supercomputer and LOFAR/LOIS Bruce Elmegreen IBM Watson Research Center 914 945 2448...
The BlueGene Supercomputer and LOFAR/LOIS
Bruce ElmegreenIBM Watson Research Center914 945 [email protected]
LOFAR
LOFAR
LOFAR = Low Frequency ArrayLOIS = LOFAR Outrigger in Sweden
LOIS
BlueGene/L at LLNL
Current Radio Telescope Design
parabolic receivers Onsala Space Observatory
connected element interferometersWesterbork, Netherlands
Very Long Baseline Interferometers
EVN
LOFAR and LOIS: A Radical New Design(Radio Telescopes as Sensor Networks)
10's of thousands of simple, fixed antennae
Sample antenna for LOIS
Sample design for LOFAR
The LOFAR/LOIS way to focus:electronically addsignals with timedelays chosen toreinforce thedesired direction
+
Points to Sky Electronically
Current way to focus
76 m Lovell telescope at Manchester UK
Advantages of the Sensor Network Approach
Antennae are cheap (no precision surfaces needed)No moving parts (no expensive mounts, no pointing problems)
Low profile (no radome needed, no wind/weather problems)
Can change pointing directions quicklyCan observe multiple directions at the same time (i.e., with several sums)
Can extend for large distances by placing sensors anywhere
Provides a country-wide optical fiber grid for general use
Disadvantages of the Sensor Network Approach
Enormous data flows (parabolic focus replaced by digital sum)
Enormous central processors needed to handle data
Enormous Data Flows from Antenna Stations
LOFAR will have 46 remote stations and 64 stations in the central core
Each remote station transmits:ƒ 32000 channels/ms in one beam
–or 8 beams with 4000 channelsƒ 8+8 bit (or 16+16) complex dataƒ 2 polarizations.
– 1-2 Gbps from each station
Each central core station transmits the same data rate in several independent sky directions (for epoch of recombination experiment)
110 - 320 Gbps input rates to central processor
sample LOFARstation array andantenna array for each station
Enormous Central Processors Needed
To make a single, highly-sensitive sky beam, the central processor does a weighted sum of all the station data (faint source detection, pulsars, ...)
To make a high-resolution map with a field of view equal to that of a single station, the central processor cross-correlates all station data.
Cross correlation of 110 LOFAR Station inputs requiresƒ 110*109/2 product pairs and 4 polarization product pairs
for each of the 32,000 frequency channels every millisecond
= 0.8 Tera (1012) complex products/second
For Epoch of Recombination maps using the LOFAR Central Core
ƒ 64*63/2 pairs * 4 pol. prod. * 32,000 ch./ms * 5 beams= 1.3 Tera complex products/second
LOFAR and LOIS: A Radical New Frequency
Will observe at 10-250 MHz, which is good becauseƒ neutral hydrogen is cosmologically redshifted (z~10)ƒ pulsars are brightƒ galactic cosmic rays radiateƒ radar echoes off the solar windƒ ...
but is bad because of:ƒ terrestrial emission from FM radio, TV, aircraft,
satellites, ...ƒ ionospheric variations from solar irradiation and wind
irregularities
Need streaming noise mitigation for each antenna, ionospheric phase corrections for each antennae station, high bit count for data sampling, and flexible programming environment (LINUX).
Black hole+jet of 3C236, the largest known radio galaxy (Schilizzi et al. 2001)
Quasar (z=3.2) lensed by foreground galaxy (Koopmans et al. 2002)
BlueGene/L for LOFAR
These requirements of LOFAR can be matched by BlueGene/L, a massively parallel computer designed by IBM in collaboration with the Lawrence Livermore National Laboratory.
BlueGene/L will also provide a top-ranked supercomputer facility in the Netherlands, useful for a wide range of problems.
November 2001, partnership with LLNLNovember 2002, acquisition of BG/L by LLNL announced. November 2003, 1/2-rack achieves 1.4 TFlop/s on Linpack
ƒ 73 on TOP500February 2004, IBM and ASTRON announce joint research into high data volume supercomputing using BG/L
May 2004, Argonne Nat. Lab announce plan to get BG/L June 2004, 4 rack prototype achieves 11.68 TFlop/s on Linpack
July 2004, AIST (Japan) announce use of BG/L for drug design
September 2004, 8 racks achieves 36 Tfs on LINPACK, surpassing NEC's Earth Simulator to become world's fastest computer
BlueGene/L: History
NEC Earth Simulator vs 64-rack BG/L at LLNL
40 Tf/s10 TBy memory640 nodes500 MHz5 MWatts power34000 Sq foot floor space
$350 M cost
360 Tf/s32 Tbytes memory65,536 nodes700 MHz1.5 MWatts power2500 Sq foot floor space
$100 M cost64 racks of BG/L will have much more for much less
IBM CU-11, 0.13 µm11 x 11 mm die size25 x 32 mm CBGA474 pins, 328 signal1.5/2.5 Volt
BlueGene/L: "System on a Chip"2 CPUs (4 FPUs) +
memory + connection controls
in each chip
0.5 GB RAM/node ECC
10/21/2003 IBM Confidential Information 5
Dual Node Compute Card
9 x 256Mb DRAM; 16B interface
Heatsinks designed for 15W (measuring ~13W @1.6V)
54 mm (2.125”)
206 mm (8.125”) wide, 14 layers
Metral 4000 connector
64 MBy DRAM
Heatsinks
2 chips per card
Four Gbps ethernet connectors
IO nodes
Processornodes
16 cards per tray+ 4 IO nodes
512 Way BG/L Prototype
midplane = half rack = full torus
16 trays per half-
rack = 512 nodes + 64
IOs
Cables fortorus in 3dimensions
6 ASTRON racks have ~1 km of cables, each 1/2" thick
2 half-racks per
rack + cables, power, cooling,
etc
8 racks, 36 Tf: #1 in the world
Rochester, MNSeptember 2004
Two CPU cores per chip (used independently or as comm/comp)
Each CPU can do 4 floating point ops (2 multiply/adds) per clockcycle
ƒ "double hummer" with quad-word loadsƒ double precision complex product takes 2 clock cycles/
CPU, 1 cycle/chipFrequency is 700 MHz
ƒ Peak performance is 5.6 GFlops per chip, or 700 M c.p./s (using both CPUs)
On chip memory: L1 (4kB/FPU) , L2(prefetch), L3( 4MB/chip)
2 chips/card, 16 cards/tray, 16 trays/midplane, 2 midplanes/rack
ƒ 1024 chips/rackƒ = 720 Giga complex products/second (peak, using both
CPUs/chip)–tests with streaming data achieve 50%-90% efficiency for comp. prods.
ƒ peak rate for LOFAR (6 racks) = 4.3 T comp.prod./s
Hardware Summary
A torus has independentconnections in 3 dimensions.
The torus bandwidthis 1.4 Gbps each way in all 3 dimensionssimultaneously
Wiring actually jumps over the physical neighbors to prevent large timing mismatches between distant edges.
Chips are connected by a 3-Dimensional torus network
IO Network4 IO connections / tray, 128 / rack, each is 1Gbps bi-directional
ƒ peak IO for LOFAR (6 racks) = 768 Gbpseach IO feed goes to an IO node, which is connected to 8 compute nodes by a hierarchical tree that pipes data at 2.8 Gbps bi-directional
tree also does global communication & combine/broadcast for all nodes
Two other internal networks
Barrier network to all nodes: allows programs to stay in synch
Control network to all nodes: boot, monitor, partitionƒ partitions set up in software using "linkchips"ƒ smallest torus partition is a midplane (512 nodes, half
rack)ƒ may partition system into tori as midplanes, racks,
multiple racksƒ or may partition into smaller units (32-node tray) which
connect nodes as a meshƒ each partition runs a different job
LOFAR EXAMPLE: Cross Correlation of Station Beams
64 Virtual core stations and 45 Remote Stations transmit data at ~220 Gbps
32,000 frequency channels are divided into 8 parts, each a separate sky beam
Each beam of data goes to a separate midplane:ƒ 220/8=27.5 Gbps input/midplane, 64 Gbps ethernet IOs
available
Midplane processors must do ƒ 110*109/2 station products * 4000 ch. per ms * 4 pol.
products == 96 Giga complex products/second compared to peak of 360 Giga comp.prod./s in virtual node mode
LOFAR EXAMPLE: Internal Data Flow
One beam of data per midplane meansƒ 4000 channels, 2 polarizations, 1000 ms time integration,
4 Byte data (16+16)ƒ = 62.5 k Bytes of data into each of the 512 nodes in a
midplane each second
ƒ (NOTE: 4 MBy storage available in L3 cache per node )
MPI_alltoall redistributes data so each node has some of the channels in both polarizations from all telescopes
ƒ midplane has 8x8x8 nodes, longest hop on torus is 4 nodes on each axis, average is 2 nodes per axis (x,y,z)
ƒ hopping speed = 1.4 Gbps = 175 MBps each direction, 350 MBps each axis
ƒ all-to-all time is 62.5 kB * 2 hops / 350 MBps = 3.6E-4 second .–(NOTE: plenty of time for computations virtual node mode OK)
LOIS Example: 15 antennae with IBM JS20
Prototype LOIS will have 10-15 antenna and 50,000 frequency channels
Will perform calculations on an IBM JS20 Blade Center–28 blades, 56 CPUs @ 1.6 GHz, 2 FPU/CPU, 2 L&S /CPU–each FPU does complex product (2L,4M,2A,2S) in 4 clock cycles–each CPU does comp. prod. in 2 clock cycles, each blade in 1 cycle–28 blades does complex products at peak rate of 44.8 Giga c.p./s–required rate is 15*14/2*50000/ms=5.3 G c.p./s per EM componentx 9 for 3x3 matrix products of electric vectors
–IO rate from antennae is 15 Gbps, total capacity of JS20 is 16 Gbps
ƒ with inefficiencies in computation and IO, initial configuration is ~1/2 the full requirement.
Enzo (San Diego) Flash (ANL)CTH (Sandia)
MM5 Amber7GAMESSQMC (Caltech)
LJ (Caltech)PolyCrystal (Caltech)
PMEMD (LBL)Miranda (LLNL)
SEAM (NCAR)QCD (IBM)
SAGE (LANL)SPPM (LLNL)UMT2K (LLNL)Sweep3d (LANL)MDCASK (LLNL)GP (LLNL)CPMD (IBM)TLBE (LBL)HPCMW (RIST) DD3d (LLNL)SPHOT (LLNL)LSMS (ORNL) NIWS (NISSEI)
Some External Applications on BG/L
June 2004 - Robert Walkup: [email protected]
Gyan Bhanot: [email protected]
Limitations to Applications
0.5 GB (card)/node, 4MB DRAM(chip)/node, 32kB/CPU L1 Cache
ƒ cannot pass all data to each node, no shared memoryƒ quad loads required for streaming float operations, 16B-
aligned dataƒ MPI the most effective way to program
UNIX only on the IO nodes, not on the compute nodes:–no UNIXIPC, MMAP, forks, sockets, openMP, shell creation, thread creation, UNIX system calls, etc. on a compute node
Existing MPI codes port easily, but may need changes to use double hummer
Off-shelf MPI code ~5x slower/processor on BG/L than Power4 1.7 GHz
ƒ 1.7 GHz * 4 Flop/cycle (2FPU/CPU PowerPC) = 6.8 Gflops peak/CPU
ƒ vs. 0.7 GHz * 2 Flop/cycle (co-proc, w/o double hummer) = 1.4 Gfs/CPU
But low power: 64 W/CPU for JS20 Power 4 blades, 12 W/CPU for BG/L
Main BlueGene advantage is packaging: high CPU density and connectivity
Summary
BlueGene/L can handle LOFAR station data rates and c.p. rates
configuration can change with software commandshandles data with high bit counts also usable as general purpose computer (~20 Tf sustained)
ƒ good for Dutch Infrastructure, an attraction for Industry partners, science ...
BlueGene/L Innovations:–system on a chip design (2 processors, all networks, memory)
–4 independent networks: tree, torus, barrier, control–variable torus size (controlled by software using link chips)
–moderate clock speed (700 MHz) good for RAM reads, good for low power consumption (25 kW/rack)
–LINUX kernel on IO nodes