Post on 24-Jul-2020
© Copyright Khronos Group & CSIRO, 2013 - Page 1
Accelerated Science – use of OpenCL in Land Down Under
Tomasz Bednarz Computational Research Scientist, CSIRO
© Copyright Khronos Group & CSIRO, 2013 - Page 2
Outline • CSIRO - Positive Impact
• Bragg – the CSIRO GPU Cluster
• CSIRO Accelerated Service
• MASSIVE
• Radiation therapy
• Level set segmentation
• CFD
• Cloud based imaging
• Other projects
• OpenCL for Game AI Pro
• Sydney Khronos Chapter
• ACW at OzViz 2013
© Copyright Khronos Group & CSIRO, 2013 - Page 3
CSIRO - Positive Impact • In the world’s Top 10 institutions for 3
research fields.
• Top 1% of global research institutions
in 14 of 22 research fields.
• Trusted advisor, innovative.
• 2000 doctorates, 500 masters.
© Copyright Khronos Group & CSIRO, 2013 - Page 4
Bragg – the CSIRO GPU Cluster • A bit of history
- Heterogeneous compute cluster (CPUs + GPUs)
- Funded by CSS TCP (2009) - first of its kind in AU
• Continuously evolving:
- GPUs upgraded in 2010 to NVIDIA Fermi Teslas
- CPUs upgraded in 2012 to Intel Sandy Bridge and 3rd
GPU added to each node
- GPUs upgraded in 2013 to NVIDIA Kepler Teslas
• Spec:
- 128 Dual Xeon 8-core E5-2650 (= 2048 cores)
- 128 GB RAM, FDR10 InfiniBand Interconnect
- 132 Fermi Tesla M2050 GPUs (= 59,136 cores)
- 254 Kepler Tesla M20 GPUs (= 633,984 cores)
- 16 Intel SC5110P Xenon PHI (= 960 cores)
- 289 on the Top500 list (as per June 2013)
© Copyright Khronos Group & CSIRO, 2013 - Page 5
CSIRO Accelerated Computing Service • About the Service
- Supported by ASC Applications and Novel Technology Staff
- Not limited to GPU support
- Delivers: - Accelerated Computing Projects
- HPC Workshops and Training
- Benchmarking
• Activities include
- Porting and Parallelising - Targeting HPC clusters, GPUs
- Profiling
- Performance Tuning
- Debugging
- Numerical Error Analysis
- Algorithmic Improvements Sam Moskwa – ACW at OzViz 2011
OpenCL CUDA Python R OpenMP MPI C/C++ OpenACC Fortran
© Copyright Khronos Group & CSIRO, 2013 - Page 6
MASSIVE - Two high performance computing
facilities, located at the Australian
Synchrotron and Monash University.
- Specialised imaging and visualisation
software and databases.
- Expertise in visualisation, image
processing, image analysis, HPC and
GPU computing.
- Training and Outreach program.
- NCI Specialised Facility for Imaging
and Visualisation.
For more info contact Dr Wojtek James Goscinski
© Copyright Khronos Group & CSIRO, 2013 - Page 7
Scientific Visualisation – CT reconstruction
Insect CT scan, rendered using Drishi (http://anusf.anu.edu.au/Vizlab/drishti/) by Sherry Mayo (CSIRO)
© Copyright Khronos Group & CSIRO, 2013 - Page 8
Radiation therapy applications • Modern radiation therapy is to a large extent a computational discipline and can
greatly benefit from use of task- and data-parallelism. Some applications were
demonstrated on GPUs already:
- CT reconstructions
- Image registrations
- Treatment planning
- Dose computations (e.g. X Gu, U Jelen et al 2011 PMB 56)
• Need for speed: imaging and treatment verification can be used as feedback to improve the
treatment (adaptive radiotherapy), currently offline (mostly population-based), one day online.
• Particle (proton/carbon ion) therapy with raster scanning @ University of Marburg:
- most precise external beam technique (only 5 centers worldwide: 3 active, 2 to start)
- increased precision = increased need for verification (more computations)
- longer computational times (small head case: 1 hour on single-thread)
• Collaborative project between CSIRO and University of Marburg
(Adapted from Schlegel and Mahr 2001)
Ammazzalorso, Bednarz, Jelen
© Copyright Khronos Group & CSIRO, 2013 - Page 9
Plan robustness in radiation therapy • Automatic discovery of robust beam setups.
• Results (mean and sd for a single beam):
- 4-core Intel Xeon W3530 2.8GHz 12GB RAM + NVIDIA Tesla C2050 3GB RAM
- 10 skull base cases, 42 beams directions (10 runs each for timing stats)
- 4k-40k pencils of 120-350 samples, 2 mm analysis radius (0.5 mm step)
- Single-precision floating-point operations only (sufficient precision)
mean(sd) ms P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Pool
Native (1 thread)
21299 (6628)
9891 (2837)
6258 (1485)
15768 (4959)
4342 (1136)
10888 (3179)
10117 (2849)
5464 (1470)
8155 (2195)
11388 (3936)
10357 (5941)
GPU OpenCL
219 (109)
122 (51)
88 (38)
148 (56)
61 (24)
160 (65)
151 (64)
52 (22)
109 (46)
126 (61)
124 (75)
Gain 119 x (36)
98 x (34)
87 x (30)
123 x (36)
83 x (25)
81 x (24)
82 x (30)
124 x (42)
90 x (31)
106 x (29)
99 x (36)
CPU OpenCL
6498 (1996)
2552 (615)
1898 (438)
4810 (1495)
1324 (331)
3280 (944)
3051 (841)
1396 (310)
2481 (649)
2935 (818)
3022 (1798)
Gain 3.3 x (0.0)
3.8 x (0.4)
3.3 x (0.0)
3.3 x (0.0)
3.3 x (0.0)
3.3 x (0.0)
3.3 x (0.0)
3.9 x (0.4)
3.3 x (0.0)
3.8 x (0.4)
3.5 x (0.3)
F. Ammazzalorso (Uni-Marburg), T. Bednarz (CSIRO) and U. Jelen (Uni-Marburg)
- Oral poster presentation at Int. Conference on the Use of Computers in Radiation Therapy ICCR2013 (Melbourne, 6-9 May 2013)
- Accepted for journal publication in IOP JPCS (upcoming)
© Copyright Khronos Group & CSIRO, 2013 - Page 10
Fast particle therapy dose computation • How OpenCL 2.0 can help?
- Radiation therapy (RT) has need for both task- and data-parallelism and it represents a
perfect potential application field for heterogeneous parallel systems.
- OpenCL 2.0 can help exploiting these system in RT applications and open the way to
new/simpler usage scenarios.
- The robust planning approach requires continuous exchange of large dataset: for now
latency can be hidden with ad-hoc synchronization
→ Great opportunity with OCL2.0 shared virtual memory (SVM).
- Dose computation (especially for ions, where biological weighing is involved) is a more
complex problem, which OCL2.0 dynamic parallelism (DP) can help breaking into simpler
blocks, enabling scheduling of different kernels without host-side and communication
overhead.
- A combination of SVM and DP can open the way to treatment plan optimization on massively
parallel architectures: large working datasets so far unsuitable for GPU computation (at
decent resolutions) and complex scheduling to enable e.g. intermediate dose computation at
each optimization step.
© Copyright Khronos Group & CSIRO, 2013 - Page 11
Image segmentation • What is image segmentation?
- Wiki: Segmentation is the process of partitioning a digital image into multiple segments
(sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or
change the representation of an image into something that is more meaningful and easier to
analyze.
• Our Goal: fast, interactive and accurate segmentations even with noisy data
HCA-Vision
© Copyright Khronos Group & CSIRO, 2013 - Page 12
Multiple Myeloma deadly cancer of blood plasma cells • Facts
- Despite a rash of new drugs and advances in stem-cell therapy, this
rare bloodborne cancer is still an almost certain death sentence
- A cure remains a long way off
- Plasma cell cancer
- Collections of abnormal cells accumulate in bones, where they cause
bone lesions (abnormal areas of tissue), and in the bone marrow
where they interfere with the production of normal blood cells
- The disease develops in 1–4 per 100,000 people per year.
Micro-CT scans reveal bone
damage from myeloma in a
5T2MM-bearing mouse.
PETER CROUCHER & KARIN VANDERKERKEN
© Copyright Khronos Group & CSIRO, 2013 - Page 13
Level sets segmentation - applications
© Copyright Khronos Group & CSIRO, 2013 - Page 14
Segmentation using Level Sets • Embed a seed surface in an image
• Iteratively deform the surface along normal according to local properties of the
surface and the underlying image
• Good: competitive accuracy compared to manual segmentation.
• Advantages: the arbitrary complex shapes can be modeled and topological changes such as
merges and splitting are handled implicitly.
iterations
© Copyright Khronos Group & CSIRO, 2013 - Page 15
Level Set Workflow
Initialize f0 to Signed Euclidean Distance Transform from mask m
Calculate Data Speed Term D(I)
Calculate First and Second Order Spatial Derivatives
Calculate Curvature Terms
Calculate Gradient of f
Calculate Speed Term F
Update Time Step, and Reinitialize f if needed
Rep
eat
Un
til C
on
verg
ed
• Governing equation ¶f
¶t= - Ñf aD(x )+ (1-a)Ñ×
Ñf
Ñf
é
ëêê
ù
ûúú
the velocity term the mean curvature
(smoothing term)
© Copyright Khronos Group & CSIRO, 2013 - Page 16
Input Image Applied Mask f0 as SDF
iterations
© Copyright Khronos Group & CSIRO, 2013 - Page 17
Level Set 2D Effect of a
• Governing equation:
¶f
¶t= - Ñf aD(x )+ (1-a)Ñ×
Ñf
Ñf
é
ëêê
ù
ûúú
900 iterations, a = 0.06 900 iterations, a = 0.95 1800 iterations, a = 0.001
0
20
40
60
80
100
120
140
160
C2050 opt C2050 Quadro FX580
Xeon E5520
152.24 sec
15.43 sec 8.94 sec 2.12 sec
Max speedup ~70 X
© Copyright Khronos Group & CSIRO, 2013 - Page 18
Level Set Method • Computational Problems
- Computationally intensive for moderate sized data sets
- SDF is not maintained
- Explicit schemes give CFL time step restriction
• Current Apprach
- OpenCL to accelerate adaptive timestepping PDE solver
- OpenMPI to distribute processing to engage multiple GPUs
- Qt – interactive user interface
- Liar/C2Liar – Quantitative Imaging toolbox (CSIRO internal)
- OpenGL + OpenCL for interactive real time volume rendering
• Further Implementation Notes
- The Level Set PDE solver mapped across multiple GPUs using OpenCL and MPI
- 3D volume split evenly among MPI processors
- Ghost slices transmitted between processes at each iteration – minimal overhead
- Simple boundary extension is performed at each iteraction
“Safe Interactions” with bacteria. level set segmentation researched also to track biofilm formation.
© Copyright Khronos Group & CSIRO, 2013 - Page 19
Bone Model C4-data-set
Interactive volume visualisation – OpenGL + OpenCL
© Copyright Khronos Group & CSIRO, 2013 - Page 20
Bone Model C4-data-set
Interactive volume visualisation – OpenGL + OpenCL
© Copyright Khronos Group & CSIRO, 2013 - Page 21
GPU Speed-up Results
0
50
100
150
200
250
0
0.2
0.4
0.6
0.8
1
1.2
3 6 9 12
Number Of Processes
Execution Time (sec per 50 Iterations)
GPU
CPU
© Copyright Khronos Group & CSIRO, 2013 - Page 22
Cloud based image analysis and processing •Available now: http://cloudimaging.net.au, see our demos
© Copyright Khronos Group & CSIRO, 2013 - Page 23
Cloud based image analysis and processing • Currently using WebGL based 3D viewer Slice:Drop http://slicedrop.com
• Need for WebCL/WebGL for interactive parameters tuning
© Copyright Khronos Group & CSIRO, 2013 - Page 24
Experiments with Fluids
Courtesy of High Field Magnet Laboratory, NL
Air Air
N2
gas
Wakayama Jet
Ozoe Lab, Candle
Ozoe Lab, 10T magnet
© Copyright Khronos Group & CSIRO, 2013 - Page 25
Verification – Physics vs Simulations
PIT
numerical
© Copyright Khronos Group & CSIRO, 2013 - Page 26
Methodology • Governing equations Discretization Implementation Verification Results
Transfer CPU code to heterogeneous architectures Verification Results
• Method tested
- Finite Difference (for discretization), HSMAC Method (for mutual iteration of pressure and
velocity fields), BFC (to solve N-S on complex geometry), Upwind and UTOPIA (for getting
more stable and accurate results), OpenCL (speedup)
• Speedups achieved: ~200-230x
• Easily extended, 2D, 3D, applied to simulate different cases, e.g. smoke, free
surface flows, ventilation, turbulent flows, etc.
• See: https://www.massive.org.au/images/stories/workshops-and-tutorials/computational-fluid-dynamics.pdf
Boundary Fitted Coordinates
© Copyright Khronos Group & CSIRO, 2013 - Page 27
Ground Penetrating Radar and FDTD • Finite-difference time-domain method (FDTD) is a computational technique used to model electromagnetic
fields (in our case, radio waves). It provides an approximation to both the spatial and temporal derivatives
that appear in Maxwell’s equations.
• FDTD models how radio waves propagate through space and reflect at boundaries depending on the
properties of the material.
• FDTD code is used to generate synthetic data for a ground
penetrating radar (GPR) system.
• FDTD simulated data can be compared to a real radar system
data from a test environment with known geometry.
With an aim to get modelled information to match the real
system so that predictions of unknown geometry can be made.
Implementation: Based on original code by Andrew Strange. OpenCL version used mixed precision code with single precision image buffers used for constant matrices.
Contact: Josh Bowden CSIRO ASC
© Copyright Khronos Group & CSIRO, 2013 - Page 28
Other projects using OpenCL
PCA NIPALS
algorithm –
rapid estimation of
wood properties
Sparse
Hydrodynamic
Ocean Code
(SHOC)
Finite Difference
Time Domain
(FDTD) - Ground
Penetrating Radar
PCA
AWDA-DA
Ground water data
assimilation
Contact: Josh Bowden CSIRO ASC
© Copyright Khronos Group & CSIRO, 2013 - Page 29
GPGPU for Games • GPU heavily used for game physics
- Performance vs Accuracy
• Nvidia PhysX and FLEX
- Particles, Fluids and Rigid Bodies on
the GPU
• Bullet Physics
- Version 3.x to have 100% GPU support
• Not all games require cutting edge graphics, which leaves processing time
available for other areas
- Audio Processing, Artificial Intelligence
Conan Bourke AIE
© Copyright Khronos Group & CSIRO, 2013 - Page 30
GPGPU Artificial Intelligence • Game Artificial Intelligence doesn’t easily lend itself to massively
parallel implementations
- Inter-Agent Communication
- Complex decision making typically controlled by a non-programmer
designer resulting in lots of dynamic branching
- Numerous levels of world state data required
• Some aspects do…
- Flow Fields
- Influence Maps
- Path Finding
- Steering Behaviors
Conan Bourke, Tomasz Bednarz
© Copyright Khronos Group & CSIRO, 2013 - Page 31
GPGPU Steering Behavior • Steering Behaviors are a set of
commonly used algorithms for agent motion - Craig Reynolds “Steering Behaviors for
Autonomous Characters” GDC1999
• Easily adaptable to parallel implementations due to close similarities to particle physics - Position, Velocity etc
• Simple brute-force algorithm can be implemented in OpenCL within a couple hours for vast performance gains - Conan Bourke & Tomasz Bednarz
“Introduction to GPGPU for AI”, Game AI Pro 2013 0
500
1000
1500
2000
2500
3000
3500
4000
51
2
10
24
40
96
81
92
16
38
4
Intel i7-2670QM
NvidiaGTX560
NvidiaTeslaC2050
Conan Bourke, Tomasz Bednarz
© Copyright Khronos Group & CSIRO, 2013 - Page 32
Acknowledgments • Collaborators, coworkers, coauthors:
- John Taylor
- Josh Bowden
- Luke Domanski
- Filippo Ammazzarolso
- Urszula Jelen
- Marc Piggott
- Wojtek Goscinski
- Conan Bourke
- Tim Gureyev
- Darren Thompson
- NeCTAR RT035 Project team
- Sam Moskwa
- Matt Adcock
- and many others…
© Copyright Khronos Group & CSIRO, 2013 - Page 33
Sydney Khronos Chapter http://www.meetup.com/Sydney-GPU-Users
Last meetup, 17th Oct Tomasz Bednarz: Welcome, the Khronos Group Conan Bourke: The Fall and Rise of OpenGL for Games Rob Manson: The Augmented Web Visit to the CSIRO Vis Lab Pizza
© Copyright Khronos Group & CSIRO, 2013 - Page 34
Accelerated Computing Workshop @ OzViz • ACW and OzViz 2013 to be held on 8th-
10th December 2013 in Melbourne
- Technologies: OpenCL, CUDA, etc.
- Hardware architectures: GPU, APU, MIC,
etc.
• Previous events
- OpenCL Workshop at OzViz 2010, Brisbane - Presenters: Mike Houston, Mark Harris, Derek
Gerstmann, Tomasz Bednarz, Craig James,
Con Caris, John Taylor
- ACW at OzViz 2011, Sydney - Keynote: Prof. Takayuki Aoki
- ACW at OzViz 2012, Perth
https://sites.google.com/site/ozvizworkshop/ozviz-2013