Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since...

22
Software Correlator Concept Description Dominic Ford, University of Cambridge Andrew Faulkner, Jongsoo Kim, Paul Alexander 14th April 2011

Transcript of Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since...

Page 1: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Software Correlator Concept Description

Dominic Ford, University of Cambridge

Andrew Faulkner, Jongsoo Kim,Paul Alexander

14th April 2011

Page 2: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Interferometer Block DiagramInterferometer Block Diagram

Antennas Antennas AntennasAnalogue gain/filter

Time-domain FFTDigitisation

g

Cross correlation High data rate.L l it

Coarse Channelisation

TX

Time integration

Low complexity.TX

g

Spatial FFT and ImagingLower sample rate.

Spatial FFT and Imaging High complexity.

Page 3: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Interferometer Block DiagramInterferometer Block Diagram

Analogue gain/filter Channel 1

X-StepA t 1

F-Step

Digitisation

gInputdata Channel 2

Channel 3Outputdata

CMAC

CMAC

FFT/PPFAntenna 1

Antenna 2

Antenna 3FFT/PPF

Coarse Channelisation

TX

Channel 3CMAC

Antenna 3FFT/PPF

TXChannel j

CMACAntenna i

FFT/PPF

NaNpNbeam parallel FFT/PPFs

NfNpNbeam parallel correlations

Page 4: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Correlator architecturesCorrelator architectures

• Hardware correlatorsFPGA-based WIDAR correlators used by the eVLA and eMERLIN.Al d til tl b th GMRTAlso, used until recently by the GMRT.Will be used by the full 512-antenna MWA.High energy efficiency. Longer development time.g gy y g p

• Software correlatorsBeowulf clusters used by the Australian LBA and VLBA.IBM BlueGene used by LOFAR.Custom designed cluster used by GMRTCustom-designed cluster used by GMRT.GPGPU system trialled by the 32-antenna prototype MWA.Reduced energy efficiency. Huge gain in flexibility.

Page 5: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Advantages of Software CorrelatorsAdvantages of Software Correlators

• Rapid development cyclesSoftware correlators use conventional programming languages for which mature development tools and debugging suites already existmature development tools and debugging suites already exist.Can track advances in technology very closely.

• Reduced NREPre-existing hardware is used, which is already mass-produced.

• Easy reconfigurabilityA software correlator can be reconfigured post-deployment to use new algorithms, or in response to hardware failure.Can use same hardware for other tasks, e.g. beamforming.

Page 6: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Potential processing architecturesPotential processing architectures

• Intel massively-parallel x86 chips32-core processor demonstrated in 2010 (Knight’s Bridge); 50-core processor expected to be demonstrated in June 2011 (Knight’s Corner)processor expected to be demonstrated in June 2011 (Knight s Corner).

• Massively-parallel embedded processorsMassively parallel embedded processorse.g. Picochip or Tilera.

• Graphics processing units (GPUs)Produced by ATI (part of AMD); latest generation of Fusion processors.

Also produced by NVIDIA; Tesla cards specifically designed for HPC.

Page 7: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

NVIDIA Tesla CardsNVIDIA Tesla Cards

An NVIDIA Tesla card (FermiAn NVIDIA Tesla card (Fermiseries, 2009) can deliver 1030.4 GFLOP peak

fperformance.

Power dissipation 250W.Power dissipation 250W.

Data transfer via PCI-Express bus at 64 Gbit/s.

But architecture requires highly-parallel code, and typically realBut architecture requires highly parallel code, and typically real applications only achieve around 30% processor utilisation.

Page 8: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

NVIDIA Tesla Card RoadmapNVIDIA Tesla Card Roadmap

Page 9: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

NVIDIA Tesla Card RoadmapNVIDIA Tesla Card Roadmap

×7.6

×2.7

Page 10: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

NVIDIA Tesla Card RoadmapNVIDIA Tesla Card Roadmap

Projecting these numbers forward to the timeframe of SKA1:

GPGPU card Expected release year Performancep yFermi 2009 1.0×Kepler 2011 2.7×Maxwell 2013 7 6×Maxwell 2013 7.6×??? 2015 15.2×??? 2017 30.4×

i.e. GPU cards may be expected to deliver 15-30 TFLOP by 2016-2018.

Power dissipation expected to remain constant at 250W per card.

Page 11: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Making Efficient Use of GPUsMaking Efficient Use of GPUs

GPUs achieve their outstanding performance by reducing the amount of silicon dedicated to flow control. Flow

FPU FPU

GPUs need careful programming to ensure that parallel threads follow

control

FPU FPUcommon execution paths.

In many applications, this means that very low efficiencies (20-30%) are achieves. For radio astronomy imaging processing, efficiencies as low as a few percent have been reported.

But cross-correlation is simple and high efficiencies can be achieved. Greenhill et al. report achieving 79% efficiency with Tesla Fermi cards. I assume 75%efficiency here.y

Page 12: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

FLOP per bit ratioFLOP per bit ratio

Tesla Fermi cards can deliver 1030.4 GFLOP.But, maximum rate of transfer onto card is 64 Gbit/s (PCI-Express 2).

To make efficient use of GPU, need to perform

> 16 floating point operations per bit> 16 floating point operations per bit.

c.f. 0.1 FLOP/bit for a typical CPU in 2011.

GPUs are FLOP-heavy processors

Page 13: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

FLOP per bit ratioFLOP per bit ratio

XF-step

g /g = (3/2)log N / (2N )

X-step

gX FLOP/gX i = 4 N NB/N NbgF-FLOP/gF-in = (3/2)log2Nf / (2Nb)

FFT size Ratio

gX-FLOP/gX-in 4 NpNB/NaNb

Thi i 16 FLOP/bit if th1,024 1.87532,768 2.812380,000 3.475

This is > 16 FLOP/bit if thereAre > 16 antennas.

,

In addition, implementations ofFFTs on GPUs have not yetFFTs on GPUs have not yetachieved high efficiencies.

Page 14: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Software Correlator for SKA1 MidSoftware Correlator for SKA1 Mid

C t l i b k

Analogue gain/filter Bulk delay

and fine SwitchGPGPU

Central processing bunker

Digitisation

Coarse

and fine channelisation GPGPU

Data readout

Coarse Channelisation

TX

Bulk delayand fine

channelisation

N it h =

SwitchGPGPU

GPGPUNa = 250 inputs

Antenna backendBulk delayand fine

channelisation

NGPGPU = 272 NVIDIA GPGPU cards (Maxwell series available

Nswitch16 switches.

inputs.

Fibre link to central processing bunker

Control and monitoring

NFPGA = 125 subsystems.

series, available 2013).

F-Step X-Step Co t o a d o to gF Step X Step

Page 15: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Software Correlator for SKA1 LowSoftware Correlator for SKA1 LowCentral processing bunker

Bulk delayand fine

channelisationSwitch

GPGPU

GPGPU

Beam 1

Analogue gain/filter

Bulk delayand fine

Na = 50 inputs; six

into each F-step

NGPGPU = 3 NVIDIA

GPGPU cardsA single switch

collates data from

Data readoutDigitisation

Station and fine channelisationsubsystem

per beam.NFPGA = 9

subsystems.

GPGPU cards (Maxwell series,

available 2013).

the nine F-step subsystems.

Coarse Channelisation

Station Beamforming

SwitchGPGPU

GPGPU

Beam 2

Channelisation

TX

A t b k d GPGPU

Beam i A total of 408 GPGPU cards are required to

Fibre link to central processing bunker

Antenna backend

F-Step X-Step

cards are required to form 160 beams.

Page 16: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Cost estimate (160 AA beams)Cost estimate (160 AA beams)

F stepX-step Data flow0.17 PFLOP/s (AA)0.002 PFLOP/s (Dishes)

F-step2.5 PFLOP/s (AA)1.6 PFLOP/s (Dishes)

49 Tbit/s (AA)6.2 Tbit/s (Dishes)

Data flow

Page 17: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Cost estimate (160 AA beams)Cost estimate (160 AA beams)

X stepF-step Data flow0.17 PFLOP/s (AA)0.002 PFLOP/s (Dishes)

X-step2.5 PFLOP/s (AA)1.6 PFLOP/s (Dishes)

49 Tbit/s (AA)6.2 Tbit/s (Dishes)

Data flow

240 Tesla cards (2017)

Cost: €0 6m

To buy using infiniband in 2011:Using ROACH

boards in 2011:Cost: €0.6mPower: 60kW€0.304m (Dishes)

€0.104m (AA)1,565 boards€7.8m

T t l < €9Total < €9m

Page 18: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Cost estimate (480 AA beams)Cost estimate (480 AA beams)

X stepF-step Data flow0.51 PFLOP/s (AA)0.002 PFLOP/s (Dishes)

X-step7.4 PFLOP/s (AA)1.6 PFLOP/s (Dishes)

146 Tbit/s (AA)6.2 Tbit/s (Dishes)

Data flow

560 Tesla cards (2017)

Cost: €1 4m

To buy using infiniband in 2011:Using ROACH

boards in 2011:Cost: €1.4mPower: 140kW€0.912m (Dishes)

€0.104m (AA)4445 boards€22m

Page 19: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Alternative Correlator for SKA1 LowAlternative Correlator for SKA1 LowCentral processing bunker

Bulk delayand fine

channelisationSwitch

GPGPU

GPGPU

Beam 1

Analogue gain/filter

Bulk delayand fine

Na = 50 inputs; six

into each F-step

NGPGPU = 3 NVIDIA

GPGPU cardsA single switch

collates data from

Data readoutDigitisation

Station and fine channelisationsubsystem

per beam.NGPGPU = 480

GPGPU cards.

GPGPU cards (Maxwell series,

available 2013).

the nine F-step subsystems.

Coarse Channelisation

Station Beamforming

SwitchGPGPU

GPGPU

Beam 2

Channelisation

TX

A t b k d GPGPU

Beam i A total of 408 GPGPU cards are required to

Fibre link to central processing bunker

Antenna backend

F-Step X-Step

cards are required to form 160 beams.

Page 20: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Cost estimate (480 AA beams)Cost estimate (480 AA beams)

X stepF-step Data flow0.51 PFLOP/s (AA)0.002 PFLOP/s (Dishes)

X-step7.4 PFLOP/s (AA)1.6 PFLOP/s (Dishes)

146 Tbit/s (AA)6.2 Tbit/s (Dishes)

Data flow

560 Tesla cards (2017)

Cost: €1 4m

To buy using infiniband in 2011:

560 Tesla cards(2017)

Cost: €1.4mPower: 140kW

(assuming 75% efficiency)

€0.912m (Dishes)€0.104m (AA)

Cost: €1.4mPower: 140kW

(assuming 75% efficiency)(assuming 5% efficiency)

Estimated NRE: 5-10 man-years.

Page 21: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

Conclusions and Future WorkConclusions and Future Work

• Software running on commodity processors provide a highly flexible way of implementing a correlator.

Th d l t t i l dit t d• The development cost is low as commodity components are used.

• In 2017, we envisage the hardware for a correlator for SKA1 would cost < €4m, and dissipate around 300 kW.p

• Since each GPU card correlates a fraction of the total bandwidth of SKA1, the correlator can be deployed in phases.

• Hardware can potentially be shared with other applications, e.g. the beamformer or UV processor.

• There is significant existing expertise within the SKA communityThere is significant existing expertise within the SKA community, including ASTRON (Romein et al.), Harvard/MWA (Greenhill et al.) and at Cambridge (Ford et al.).

Page 22: Software Correlator Concept Description€¦ · Maxwell 2013 7.6×× ??? 2015 15.2 × ... • Since each GPU card correlates a fraction of the total bandwidth of SKA 1, the correlator

The Way ForwardThe Way Forward

• We plan to design and build a demonstrator system, which will use a small number of Tesla cards to correlate a fraction of the bandwidth of SKA1.S 1

• By evaluating its performance, we will verify (and hope to better) the efficiencies assumed here. We will also demonstrate scalability.

• We envisage that this is achievable within a year• We envisage that this is achievable within a year.

We need a work package post-CoDR dedicated to investigating software correlation.

We need a decision timescale for deployment. Should this work package be part of DSP or S&C?