How to Write Fast Code · Carnegie Mellon How to Write Fast Code 18-645, spring 2008 6 thLecture,...

Carnegie Mellon

How to Write Fast Code18-645, spring 20086th Lecture, Feb. 4th

Instructor: Markus Püschel

TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred)

Carnegie Mellon

Technicalities

Research project

First steps: Precise problem statement

Correct implementation (create verification environment for future use)

Analyze arithmetic cost

Measure runtime and create a performance plot

If algorithm consists of several steps: identify bottleneck(s) w.r.t. both cost and runtime

Carnegie Mellon

Temporal and Spatial Locality Properties of a program

Temporal locality: Data that is referenced is likely to be referenced again in the near futurePromotes data reuse:

Spatial locality: If data is referenced, data in proximity (address) is likely to be referenced in the near futurePromotes neighbor use:

Exists because: 1) this is how humans think; 2) structure of numerical algorithms

History of locality

Carnegie Mellon

Linear algebra software: history, LAPACK and BLAS

Blocking: key to performance

ATLAS: MMM program generator

Carnegie Mellon

Linear Algebra Algorithms: Examples

Solving systems of linear equations

Eigenvalue problems

Singular value decomposition

LU/Cholesky/QR/… decompositions

… and many others

Make up most of the numerical computation across disciplines (sciences, computer science, engineering)

Efficient software is extremely relevant

Carnegie Mellon

The Path to LAPACK EISPACK and LINPACK

Libraries for linear algebra algorithms

Developed in the early 70s

Jack Dongarra, Jim Bunch, Cleve Moler, Pete Stewart, …

LINPACK still used as benchmark for the TOP500 (Wiki) list of most powerful supercomputers

Problem: Implementation “vector-based,” i.e., no locality in data access

Low performance on computers with deep memory hierarchy

Became apparent in the 80s

Solution: LAPACK Reimplement the algorithms “block-based,” i.e., with locality

Developed late 1980s, early 1990s

Jim Demmel, Jack Dongarra et al.

Carnegie Mellon

LAPACK and BLAS Basic Idea:

LAPACK

BLAS = Basic Linear Algebra Subroutines (list) BLAS1: vector-vector operations (e.g., vector sum)

BLAS2: matrix-vector operations (e.g., matrix-vector product)

BLAS3: matrix-matrix operations (mainly matrix-matrix product)

static

reimplementedfor each platform

LAPACK implemented on top of BLAS (web) as much as possible using block matrix operations (locality) = BLAS 3

Implemented in F77 (to enable good compilation)

Open source

BLAS recreated for each platform to port performance

Carnegie Mellon

Why is BLAS3 so important?

Explain on blackboard

Using BLAS3 = blocking

Motivate blocking

Blocking (for the memory hierarchy) is the single most important optimization for linear algebra algorithms

The introduction of multicore processors requires a reimplementation of LAPACK (just multithreading BLAS is not good enough)

Carnegie Mellon

Matlab

Invented in the late 70s by Cleve Moler

Commercialized (MathWorks) in 84

Motivation: Make LINPACK, EISPACK easy to use

Matlab uses LAPACK and other libraries but can only call it if you operate with matrices and vectors and do not write your own loops A*B (MMM)

A\b (solving linear system)

Carnegie Mellon

MMM by Definition

Usually computed as C = AB + C

Cost as computed before n3 multiplications

n3 additions

= 2n3 floating point operations

= O(n3) runtime

Blocking Increases locality (see previous example)

Does not decrease cost

Can we do better?

Carnegie Mellon

Strassen’s Algorithm

Strassen, V. "Gaussian Elimination is Not Optimal." Numerische Mathematik 13, 354-356, 1969Until then, MMM was thought to be Θ(n3)

Check out algorithm at Mathworld

Recurrence T(n) = 7T(n/2) + O(n2):Multiplies two n x n matrices in O(nlog

2(7)) ≈ O(n2.808)

Similar to Karatsuba

Crossover point, in terms of cost: n=654, but … Structure more complex

Numerical stability inferior

Can we do better?

Carnegie Mellon

MMM Complexity: What is known

Coppersmith, D. and Winograd, S. "Matrix Multiplication via Arithmetic Programming." J. Symb. Comput. 9, 251-280, 1990

MMM is O(n2.376) and (obviously) Ω(n2)

It could well be Θ(n2)

Compare this to matrix-vector multiplication, which is Θ(n2) (Winograd), i.e., boring

MMM is the single most important computational kernel in linear algebra (probably in whole numerical computing)

Carnegie Mellon

MMM: Memory Hierarchy Optimization

1 2 4 8 16 32 64 128 256 512 1024 2048

performance [Gflop/s]

matrix side

MMM (square real double) Core 2 Duo 3Ghz

triple loop

ATLAS generated

theoretical peak

• Intel compiler icc –O2• Huge performance difference for large sizes• Great case study to learn memory hierarchy optimization

Carnegie Mellon

Successor of PhiPAC, BLAS program generator (web)

People can also contribute handwritten code

The generator uses empirical search over implementation alternatives to find the fastest implementationno vectorization or parallelization

We focus on BLAS3 MMM

Search only over 2n3 algorithms (cost equal to triple loop)

Carnegie Mellon

ATLAS Architecture

DetectHardware

Parameters

ATLAS SearchEngine

(MMSearch)

NRMulAdd

L1SizeATLAS MM

Code Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

Compile,Execute,Measure

MFLOPS

Hardware parameters:• L1Size: size of L1 data cache• NR: number of registers• MulAdd: fused multiply-add available?• L* : latency of FP multiplication

Search parameters:• span search space• specify code• found by orthogonal line search

source: Pingali, Yotov, Cornell U.

Carnegie Mellon

How ATLAS Works

Blackboard

How to Write Fast Code · Carnegie Mellon How to Write Fast Code 18-645, spring 2008 6 thLecture,...

Documents

Transcript of How to Write Fast Code · Carnegie Mellon How to Write Fast Code 18-645, spring 2008 6 thLecture,...

Spark - Rice University · What is Spark! An open source cluster computing system ! Fast data analysis — fast to run and fast to write. (Map-reduce like system) ! Up to 100X faster

Lokalisation und Verletzungsmuster von ... · UNIVERSITÄTSKLINIKUM HAMBURG-EPPENDORF Institut für Rechtsmedizin Direktor Prof. Dr. med. Klaus Püschel Lokalisation und Verletzungsmuster

How to Write Fast Numerical Code - ETH Zurich€¦ · 486: merged FPU and Integer Unit onto one chip ... Write assembly Use intrinsics ... How to Write Fast Numerical Code Markus

Computer Generation of IP Cores · 2012-07-06 · © Markus Püschel, 2011 © Markus Püschel Computer Science Software Hardware (FPGA, ASIC) Libraries IP Cores C/C++/Fortran Verilog/VHDL

Argumentative Writing Is Fast Food Bad for you?. Write an argumentative essay on the following: ‘Is Fast Food Bad for You?’ Use the following to help.

Fast 5 Check your results Lab Write-Up Notes “Bubble Time” Lab Groups

Hacking Your Email Content: How to Write Effective Emails Fast

How to Write Fast Code SIMD Vectorizationusers.ece.cmu.edu/~franzf/teaching/slides-18-645-simd.pdf · How to Write Fast Code SIMD Vectorization 18-645, spring 2008 ... min, max, /,

How to Write Fast Numerical Code - people.inf.ethz.ch

A Log Buffer based Flash Translation Layer using Fully ...vldb.skku.ac.kr/sub_page/pub/ACM-TECS-FAST-paper.pdfHowever, flash memory suffers from the write bandwidth problem. A write

Write Fast!

How to Write Fast Numerical Code

They Write the Right Stuff _ Fast Company

How to Write Fast Code SIMD Vectorizationfranzf/teaching/slides-18-645-simd.pdfHow to Write Fast Code SIMD Vectorization 18-645, spring 2008 ... Transpose Macro. ... (Xenon processor)

Write to Sell FAST: Top 10 easy, profitable and clever ideas to increase your writing income TODAY

DjangoCon 2013 - How to Write Fast and Efficient Unit Tests in Django

How to Write a Good Book Fast

How to Write Fast Numerical Code - ETH Z · How to write fast numerical code Spring 2017 Algorithms: Example FFT, n = 4 Fast Fourier transform (FFT) Representation using matrix algebra

How To Write Fast Numerical Code: A Small Introductionusers.ece.cmu.edu/~franzf/papers/gttse07.pdf · How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz

21 Tips to Write Fast and Well