Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer...

1
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) • Explore the use of new technology for solving intensive computational problems Objective • Help to improve the efficiency of early breast cancer detection • Minimize the processing cost of the Digital Breast Tomosynthesis Mammography technique Tomosynthesis reconstruction process Tomosynthesis reconstruction process • Reconstructs a 3D image from multiple x-ray radiograph images Detects and diagnoses breast cancer and abnormalities NVIDIA GPU - GeForce 8800 NVIDIA GPU - GeForce 8800 • Data-parallel programming On-chip • SIMD Porting Tomosynthesis reconstruction to the GPU Evaluation environments Tomosynthesis reconstruction Execution time (sec) vs. number of iterations Simplicity • All software development stages – design, implementation testing and deployment are done on one single environment • Allow novice users to run, execute and work with Tomosynthesis algorithm on windows. Summary GPU’s performance comparable to HPC Exploit inherent parallelism in algorithm Reduce communication and synchronization Launch high number of threads per multiprocessor Hide memory latency (Implementation is memory bound) • First implementation of algorithm Further development can improve performance on both CPU and GPU Improve memory allocation Reduce CPU/GPU communication overhead Optimize kernel threads (running on GPU) Future work • Optimize threads running on GPU, Improve CPU/GPU interaction • Current performance enables further development of Tomosynthesis algorithm – reducing image noise • Explore opportunities for speeding up additional applications using GPU " " Acceleration of Digital Tomosynthesis Mammography using Graphics Acceleration of Digital Tomosynthesis Mammography using Graphics Processors Processors " " Diego Rivera, Micha Moffie, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, MA {drivera, mmoffie, dschaa, kaeli}@ece.neu.edu Acknowledgement This project is supported by the Gordon Center for Subsurface Sensing and Imaging Systems. Many thanks to Juemin Zhang (ECE NEU) and Leo Hill (ATS NEU) for their help during the early stages of this work Gordon-CenSSIS is a National Science Foundation Engineering Research Center supported in part by the Engineering Research Centers Program of the National Science Foundation (Award # EEC-9986821). Taken From: National Cancer Institute From presentation “GeForce 8800 & NVIDIA CUDA: A New architecture for Computing on the GPU” by Ian Buck, NVIDIA Corporation at Supercomputing '06 Workshop "General-Purpose GPU Computing: Practice And Experience“, November 13 2006 Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Execution Manager Input Assembler Host Load/store Device Memory 128 Stream Processors 768 MB from $530 Taken From presentation “Acceleration of Maximum Likelihood for Tomosynthesis Mammography” by Juemin Zhang, Waleed Meleis, David Kaeli, Tao Wu. ICPADS’06 detector X-ray source Y Set 3D volume Compute projections Correct 3D volume 3D volume Satisfied ? No Yes Exit Initialization Forward Backward X-ray projections X Z Y Serial Code Serial Code do i=0 .. 15 begin do j=0 .. 1196 begin do k=0 .. 2304 begin kernel code… CUDA Code CUDA Code do i=0 .. 15 begin Call GPU Thread Computation Create 1196 x 2304 threads Nvidia GTX8800 (GPU) 128 Stream Processors, 1.35 GHz 768 MB Device memory (86.4 GB/Sec) PCI-E x16 TeraCluste r (Cluster) 33 Servers 4 nodes per server (dual processor, dual core) Intel Xeon, 2.0 GHz (Pentium M) 8/16GB RAM per server Gigabit Ethernet interconnect (among servers) Opportunit y (Cluster) 65 servers 2 nodes per server (dual processor) Xeon EMT 64, 3.2 GHz (Pentium IV) 4 GB RAM per server Gigabit Ethernet interconnect (among servers) Workstatio n Intel Core2 CPU (Using only 1), 1.86 GHz 3GB RAM 72 191 349 664 27 72 131 250 65 291 565 1248 539 2091 4157 8278 0 1000 2000 3000 1 4 8 16 Numberofiterations Execution tim e (sec) G TX8800 TeraC luster-32 nodes TeraC luster-16 nodes TeraC luster-8 nodes O pportunity -32 nodes O pportunity -16 nodes O pportunity -8 nodes W orkstation (Serial)

Transcript of Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer...

Page 1: Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Motivation“Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006)

• Explore the use of new technology for solving intensive computational problems

Objective• Help to improve the efficiency of early breast cancer detection

• Minimize the processing cost of the Digital Breast Tomosynthesis Mammography technique

Tomosynthesis reconstruction processTomosynthesis reconstruction process• Reconstructs a 3D image from multiple x-ray radiograph images

Detects and diagnoses breast cancer and abnormalities

NVIDIA GPU - GeForce 8800NVIDIA GPU - GeForce 8800• Data-parallel programming On-chip

• SIMD

• Compute Unified Device Architecture (CUDA) –a programming interface

Execute C code on NVIDIA GPU

CUDA libraries: FFT and BLAS

Porting Tomosynthesis reconstruction to the GPU

Evaluation environments

Tomosynthesis reconstructionExecution time (sec) vs. number of iterations

Simplicity• All software development stages – design, implementation testing and deployment are done on one single environment

• Allow novice users to run, execute and work with Tomosynthesis algorithm on windows.

Summary• GPU’s performance comparable to HPC

Exploit inherent parallelism in algorithm

Reduce communication and synchronization

Launch high number of threads per multiprocessor

Hide memory latency (Implementation is memory bound)

• First implementation of algorithm

Further development can improve performance on both CPU and GPU

Improve memory allocation

Reduce CPU/GPU communication overhead

Optimize kernel threads (running on GPU)

Future work• Optimize threads running on GPU, Improve CPU/GPU interaction • Current performance enables further development of Tomosynthesis algorithm – reducing image noise

• Explore opportunities for speeding up additional applications using GPU

" " Acceleration of Digital Tomosynthesis Mammography using Graphics ProcessorsAcceleration of Digital Tomosynthesis Mammography using Graphics Processors " " Diego Rivera, Micha Moffie, Dana Schaa and David Kaeli

Department of Electrical and Computer Engineering Northeastern University, Boston, MA

{drivera, mmoffie, dschaa, kaeli}@ece.neu.edu

AcknowledgementThis project is supported by the Gordon Center for Subsurface Sensing and Imaging Systems. Many thanks to Juemin Zhang (ECE NEU) and Leo Hill (ATS NEU) for their help during the early stages of this work

Gordon-CenSSIS is a National Science Foundation Engineering Research Center supported in part by the Engineering Research Centers Program of the National Science Foundation (Award # EEC-9986821).

Taken From: National Cancer Institute

From presentation “GeForce 8800 & NVIDIA CUDA: A New architecture for Computing on the GPU” by Ian Buck, NVIDIA Corporation at Supercomputing '06 Workshop "General-Purpose GPU Computing: Practice And Experience“, November 13 2006

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Processors

Parallel Data Cache

Thread Execution Manager

Input Assembler

Host

Load/store

Device Memory

128 Stream Processors 768 MB from $530

Taken From presentation “Acceleration of Maximum Likelihood for Tomosynthesis Mammography” by Juemin Zhang, Waleed Meleis, David Kaeli, Tao Wu. ICPADS’06

detector

X-ray sourceYSet 3D volume

Compute projections

Correct 3D volume

3D volume

Satisfied ?

NoYesExit

Initialization

Forward

BackwardX-ray

projections

X

Z Y

Serial CodeSerial Code

do i=0 .. 15 begin

do j=0 .. 1196 begin

do k=0 .. 2304 begin

kernel code…

CUDA CodeCUDA Code

do i=0 .. 15 begin Call GPU

Thread Computation

Create 1196 x 2304 threads

Nvidia

GTX8800

(GPU)

128 Stream Processors, 1.35 GHz

768 MB Device memory (86.4 GB/Sec)

PCI-E x16

TeraCluster

(Cluster)

33 Servers

4 nodes per server (dual processor, dual core)

Intel Xeon, 2.0 GHz (Pentium M)

8/16GB RAM per server

Gigabit Ethernet interconnect (among servers)

Opportunity

(Cluster)

65 servers

2 nodes per server (dual processor)

Xeon EMT 64, 3.2 GHz (Pentium IV)

4 GB RAM per server

Gigabit Ethernet interconnect (among servers)

Workstation

Intel Core2 CPU (Using only 1), 1.86 GHz

3GB RAM

72

191

349

664

27

72

131 25

0

65

291

565

1248

539

2091

4157

8278

0

1000

2000

3000

1 4 8 16

Number of iterations

Exe

cutio

n ti

me

(se

c)

GTX8800TeraCluster - 32 nodesTeraCluster - 16 nodesTeraCluster - 8 nodesOpportunity - 32 nodesOpportunity - 16 nodesOpportunity - 8 nodesWorkstation (Serial)