Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for...

145
Implementing Boolean matrix multiplication on a GPU Alexander Okhotin Department of Mathematics, University of Turku, Finland Academy of Finland DESY, Hamburg, Germany 12 April 2010 A. D. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18

Transcript of Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for...

Page 1: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Implementing Boolean matrix multiplication on a GPU

Alexander Okhotin

Department of Mathematics, University of Turku, FinlandAcademy of Finland

DESY, Hamburg, Germany12 April 2010 A. D.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18

Page 2: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 3: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 4: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 5: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 6: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.

I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 7: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.

I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 8: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 9: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.

F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 10: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 11: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 12: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18

Page 13: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Part I

GPU programming

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 3 / 18

Page 14: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.

I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 15: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 16: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 17: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 18: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 19: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 20: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.

General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 21: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 22: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.

I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 23: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.

I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 24: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 25: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 26: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Graphics Processing Units

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18

Page 27: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.

I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 28: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.

I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 29: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.

I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 30: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 31: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 32: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.

I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 33: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.

I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 34: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 35: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 36: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.I Allocate GPU memory.

I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 37: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.

I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 38: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18

Page 39: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 40: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 41: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 42: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 43: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.

I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 44: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 45: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 46: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.

I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 47: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 48: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 49: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 50: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.I 1d, 2d or 3d grid of work-items.

I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 51: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18

Page 52: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 53: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 54: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 55: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.

5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 56: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 57: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 58: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works. . . . though very inefficiently:

I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 59: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works. . . . though very inefficiently:I Reading 4 times.

I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 60: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works. . . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18

Page 61: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Part II

Boolean matrix multiplication

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 8 / 18

Page 62: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 63: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 64: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 65: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 66: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 67: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;

Sum: disjunction;Product: conjunction;

Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 68: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 69: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;

Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 70: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 71: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18

Page 72: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices?

8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 73: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 74: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 75: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 76: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 77: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.

I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 78: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.

I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 79: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 80: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 81: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 82: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 83: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

)×(

b11 b12b21 b22

)=

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

)

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

)×(

B11 B12

B21 B22

).

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18

Page 84: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.

(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Page 85: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.

(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Page 86: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)

=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Page 87: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Page 88: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Page 89: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.(1 01 1

)×(

0 11 1

)=

(0 11 2

)︸ ︷︷ ︸

in Z3

=

(0 11 1

)︸ ︷︷ ︸

in B

One bit → dlogn+1e bits.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18

Page 90: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)

Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 91: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.

Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 92: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 93: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 94: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.

Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 95: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).

Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 96: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,

Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 97: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 98: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸ ︷︷ ︸

making the table

+n3

k︸︷︷︸multiplication

2n3

log n operations for k = log n.Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Page 99: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Part III

Boolean matrix multiplication on a GPU

Joint work with Christian Reitwießner (Wurzburg)

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 13 / 18

Page 100: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:

I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 101: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,

I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 102: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 103: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 104: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 105: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 106: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 107: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 108: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 109: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.

I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 110: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 111: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 112: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!

I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 113: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.I Processing by parts.

Direct n3 multiplication.I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18

Page 114: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 115: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 116: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 117: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 118: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.I 2k disjunctions of longs.

I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 119: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 120: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18

Page 121: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.

nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 122: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 123: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 124: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 125: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 126: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 127: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 128: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 129: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 130: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18

Page 131: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time

234 ms 17.4 ms 3.3 ms

Memory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 132: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time

234 ms 17.4 ms 3.3 ms

Memory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 133: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 134: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 135: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 136: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 137: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 138: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.I Local memory: usually 16 KB per core.

I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 139: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.I Local memory: usually 16 KB per core.I Compute the table by parts.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18

Page 140: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Page 141: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Page 142: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Page 143: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Page 144: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.

I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18

Page 145: Implementing Boolean matrix multiplication on a GPU · I Per pixel e ects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18