Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: §...

41
‹#› 07/04/2016 Pulsar Acceleration Searches on the GPU for the Square Kilometre Array Sofia Dimoudi, Wes Armour and Karel Adamek [email protected]

Transcript of Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: §...

Page 1: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

‹#›

07/04/2016

Pulsar Acceleration Searches on the GPU

for the Square Kilometre Array

Sofia Dimoudi, Wes Armour and Karel [email protected]

Page 2: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Overview

§ Introduction– The Square Kilometre Array (SKA)– The Astro-Accelerate library– Binary Pulsars

§The Problem– Effects of orbital motion– Acceleration Search methods– Fourier Domain Acceleration Search (FDAS)– Application parameters

§FDAS on GPU– Using cuFFT– Using a custom FFT– Performance comparisons

§Conclusions

Page 3: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Introduction

Page 4: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

World’s largest radio-telescope

§ collecting area > 1 km2

§ up to 2000 dishes at the final stage (~ 200 at the 1st phase)

§ fast surveys and high sensitivity - 2000 ‘beams’ in the field of view

The Square Kilometre Array

Image credit: SKA 2015

Page 5: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Science Objectives

§ galaxy evolution and Dark Energy§ cosmic magnetism§ first objects in the universe § exo-planet formation and life

signatures

§ tests of general relativity and gravitational waves: pulsar searches

The Square Kilometre Array

Image credit: SKA 2015

Page 6: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

§A library of many-core accelerated real-time data processing modules

§Supporting multiple hardware architectures– CPU, GPU, FPGA, Xeon Phi– Hardware agnostic

§Providing automatic optimisation for given system configuration

The Astro-Accelerate library

Page 7: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

What are pulsars?

Binary pulsars

Page 8: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

What are pulsars?§ Highly magnetised rotating neutron stars

§ Emit radio beam along magnetic axis

Binary pulsars

Image copyright ©Addison Wesley

Page 9: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

What are pulsars?§ Highly magnetised rotating neutron stars

§ Emit radio beam along magnetic axis

§ “cosmic lighthouse” effect

Binary pulsars

Image copyright ©Addison Wesley

Observed radiation is a pulse

Page 10: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

What are pulsars?§ Highly magnetised rotating neutron stars

§ Emit radio beam along magnetic axis

§ “cosmic lighthouse” effect

Binary pulsars

Image copyright ©Addison Wesley

Image Credit: Manchester, R.N. and Taylor, J.H. , Pulsars, Freeman, 1977.

very precise period - cosmic “clocks”Observed radiation is a pulse

Page 11: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

§ Pulsars that orbit a common centre of mass with another object

§ Important tools enabling:

– general relativity investigations and alternative theories of gravity

– study of physics of dense matter

– gravitational wave detection

Image credit: M. Kramer, MPIfR

Binary pulsarsBinary Pulsars

Page 12: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

The Problem

Page 13: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Effect of orbital motion in binaries§ Doppler effect: Observed signal frequency drifts as pulsar

accelerates in orbit

Effects of orbital motion

Image credit: Ralph Eatourgh

2.4. SEARCHING FOR BINARY PULSARS 57

ο0

90ο ο

180ο0

ο0

ο180

ο270

90ο

ο0

ο270

normalisedamplitude

l.o.s. acceleration

l.o.s. jerkl.o.s. velocity

pulse period increasing pulse period decreasing

orbital phase

Fig. 2.6: Cartoon of the orbital motion of a pulsar (solid circle) in a circular orbitaround the common centre of mass of a binary system. A stationary observerviewing the system edge on will see a changing line of sight velocity. The mag-nitude of the orbital velocity, acceleration, and jerk are mapped onto the phasediagram to the right. The red line indicates the motion of the pulsar over theduration of an observation from orbital phase 0 to 180.

moving directly along the line of sight from the observer with a maximum velocity

but with no acceleration, i.e. the velocity is not changing. From Equation 2.8

we would expect a stable pulse frequency at this phase (neglecting higher order

terms). The opposite is true at phases 90 and 270. Here the pulsar experiences

a maximal radial acceleration, and from Equation 2.8 a maximum in the amount

of pulse frequency smearing. The contribution to pulse smearing from jerk effects

follow the opposite path. Around 0 and 180 the acceleration is small but is

changing most rapidly, hence maximum jerk. At phases 90 and 270 the accel-

eration has reached a peak and is approximately constant, i.e. the effects of jerk

are small.

In reality the exact form of v(t) depends on the orbital parameters of the

system and can be calculated from Kepler’s laws, see for example, Lorimer &

Kramer (2005). In this case line of sight velocity is given by

v(AT) = Ωb

apsin i!

(1 − e2)[cos(ω + AT) + e cos ω] , (2.14)

Observer sees changing line of sight velocity

Page 14: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Effect of orbital motion in binaries§ Power “smears” into neighbouring frequency bins

Effects of orbital motion

Page 15: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Effect of orbital motion in binaries§ Power “smears” into neighbouring frequency bins

Effects of orbital motion

§ Periodicity search may then be insensitive to signal

Page 16: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

§ Pulsar binaries orbits are nearly circular, acceleration is sinusoidal

§ If observation time < 1/10 orbital period, we can approximate acceleration as constant

Effects of orbital motion

Page 17: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Two main methods to correct:

§ Time domain acceleration search (e.g. Johnston & Kulkarni 1991)– Resample time series by changing time intervals according to the

Doppler formula, for range of constant acceleration values

𝜏 𝑡 = 𝜏$ 1 + 𝜐(𝑡) 𝑐⁄ ~𝜏$ 1 + 𝛼𝑡 𝑐⁄

– Apply Fourier Transform to each resampled series• Requires many long real FTs

Acceleration search methods

Page 18: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Two main methods to correct:

§ Fourier domain acceleration search - correlation technique (Ransom et al. 2002)- Fourier response found by correlating with frequency reversed

complex conjugate template

– Apply matched filters in the Fourier domain using convolutions• Many short filters applied to one long series• Requires only one long FT and many short FTs

Acceleration search methods

𝑨𝒓𝟎 ≃ 3 𝑨𝒌𝑨𝒓𝟎5𝒌∗

𝒓𝟎7𝒎/𝟐

𝒌;𝒓𝟎5𝒎/𝟐

Page 19: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

§ Frequency bins drifted 𝑧 relates to frequency derivative 𝑓 and acceleration 𝛼:

𝛼 =𝑓𝑓=

𝑧𝑐𝑓𝑇@

𝑧 = 𝑇@⁄§ the resulting power spectrum forms a

uniformly tiled 𝑓 − 𝑓 plane

FDAS: Fourier Domain Acceleration search

Frequency derivative

Page 20: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

𝒚(𝒏) = 𝓕5𝟏 𝓕 𝒙(𝒏) 𝓕 𝒉(𝒏)

DFT assumes periodicity, convolution is cyclical, ’wrap-around’ effect§ Edge effects treated by padding

Convolution in the Fourier domain

N

• Not efficient for very large N with multiple filters

FDAS: Fourier Domain Acceleration search

Page 21: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Convolution in the Fourier domain

§ Overlap-save method:

1. Break input signal in blocks of L points and pad the beginning with M zero's

2. Pick overlapping regions of L+M-1

3. Apply convolution to each block with each template via FFT

4. Discard segment edges and concatenate result

FDAS: Fourier Domain Acceleration search

Page 22: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

– ~8-10 min observation intervals.– Pre-processing also applies

correction trials for interstellar medium effects

– ~4500 8Mpts data series to analyse per beam in real-time, for ~2000 beams

– Each data series may require up to 300 orbital acceleration trials.

Application parameters

Pre-processing

Pulsar Search

4500 data series

...Input beam data

Pulsar candidates

...Data requirements for pulsar searches

~ 80 Petabytes of data in total

Page 23: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

SKA science requirements drive design

FDAS Parameters

Observation period (‘real-time’ restriction) ~ 540 sec

Number of data series to process ~ 4500

Time series size 223 points (float)

Minimum number of templates 96 (under investigation)

Application parameters

Page 24: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

FDAS on the GPU

Page 25: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Using cuFFT

§ Adapted on the GPU from PRESTO code (S. Ransom, NRAO)

§ cuFFT dominates

§ limited by global memory bandwidth

§ Transfers complex plane (~3 GB) 4 times

Page 26: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Find optimal batched cuFFT plan- block size, number of blocks per call- static plan memory allocation- can use more cuFFT executions / less memory for similar performance

0

500

1000

1500

2000

2500

3000

3500

0 20 40 60 80 100

Mem

orystoragerequ

ired(M

B)

NumberofF-FdotrowsinIFFT

23

24

25

26

27

1 2 3 4 6 8 12 16

Executiontim

e(m

s)

NumberofF-FdotrowsinIFFT

1K 2K 4K 8K

FFTSIZE

Using cuFFT

Page 27: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

templates

FFT’d segments

f-fdot plane

Perform point-wise multiplication for each signal segment with a template and write result to corresponding 𝑓 − 𝑓 location

Using cuFFT

Convolution kernel

Page 28: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Convolution kernel

templates

FFT’d segments

f-fdot plane

Each signal segment convolves with all templates

Using cuFFT

Page 29: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Convolution kernel

templates

FFT’d segments

f-fdot plane

Each signal segment convolves with all templates

Loop over segments

Using cuFFT

Page 30: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Execution breakdown

Using cuFFT

Page 31: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Using a custom FFT

• Small filters, per-thread block operations

• FFT, convolution and power spectrum in one kernel on shared memory

• Eliminating 4 complex plane transfers to/from global memory

• Eliminating ~2/3 of storage

• FFT algorithms: Cooley-Tukey, Stockham, Pease... (developed by Karel Adamek)

• Up to 2.8 X speedup on Maxwell

Page 32: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

FDAS with cuFFT:

Global memory bandwidth limited

2

1. Compute, Bandwidth, or Latency BoundThe first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by computation, memory bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "cuda_convolve_reg_1d_halftemps" is most likely limited by memory bandwidth. You should first examine the information in the "Memory Bandwidth" section to determine how it is limiting performance.

1.1. Kernel Performance Is Bound By Memory BandwidthFor device "GeForce GTX TITAN X" the kernel's compute utilization is significantly lower than its memory utilization. These utilization levels indicate that the performance of the kernel is most likely being limited by the memory system. For this kernel the limiting factor in the memory system is the bandwidth of the Device memory.

2

1. Compute, Bandwidth, or Latency BoundThe first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by computation, memory bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "cuda_convolve_customfft_wes" is most likely limited by memory bandwidth. You should first examine the information in the "Memory Bandwidth" section to determine how it is limiting performance.

1.1. Kernel Performance Is Bound By Memory BandwidthFor device "GeForce GTX TITAN X" the kernel's compute utilization is significantly lower than its memory utilization. These utilization levels indicate that the performance of the kernel is most likely being limited by the memory system. For this kernel the limiting factor in the memory system is the bandwidth of the Shared memory.

Using a custom FFT

FDAS with custom FFT:

shared memory bandwidth limited

Page 33: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

§ Optimizations:

- Complex conjugate templates symmetry: compute on the fly (Prabu Thiagaraj, SKA group University of Manchester)

- Perform 2x IFFT per thread block, reducing synchronization costs

Using a custom FFT

6

3.2. Occupancy Is Not Limiting Kernel PerformanceThe kernel's block size, register usage, and shared memory usage allow it to fully utilize all warps on the GPU.

Page 34: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Performance comparison

Parameters

# input series 1

Signal size 223

# of templates 96

Template sizeBASIC 8192

KFFT 512

Page 35: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Real-time performance for 4500 series, 540 sec

Performance comparison

Page 36: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Real-time performance for 4500 series, 540 sec

Performance comparison

2 beams in one card

Page 37: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Real-time performance for 4500 series, 540 sec

Performance comparison

2 beams in one card

2 x M60: 300W1 x TitanX: 250W

Page 38: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Real-time performance for 4500 series, 540 sec

Performance comparison

2 beams in one card

2 x M60: 300W1 x TitanX: 250W

potential for big savings in energy and capital costs

Page 39: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Comparison with CPU OpenMP code (PRESTO)

Performance comparison

Page 40: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Conclusions

• An optimal matched filtering implementation on GPUs for pulsar searches was accomplished, achieving execution at ~1/5 of the required time with current NVIDIA technology

• This greatly increases the potential to enable some of the most exciting science that can be done with pulsars, which was not possible so far

• These computational speedups are also crucial in reducing energy consumption and hence operational costs of the SKA facilities, which is an important factor in the overall project budget.

Page 41: Pulsar Acceleration Searches on the GPU for the Square ... · Two main methods to correct: § Fourier domain acceleration search - correlation technique (Ransom et al. 2002) - Fourier

Acknowledgements

This work is funded by a Leverhulme Trust Project Grant:“ARTEMIS: Real-time discovery in Radio Astronomy”.PI: A. Karastergiou (Oxford Astrophysics) co-Is: W. Armour, M. Giles (OeRC).

The software developed forms part of the Astro-Accelerate library software development, PI W. Armour ([email protected]), at the Oxford e-Research centre (www.oerc.ox.ac.uk).

The work is supported by additional members of the Oxford Pulsar Group, namely Christopher Williams and Jayanth Chennamangalam, as well as from the Time Domain Team, a collaboration between Oxford, Manchester and MPIfR Bonn, to design and build the SKA pulsar search capabilities.Special thanks to Scott Ransom from NRAO for his help and advice.