Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...

36
VECTOR INSTRUCTIONS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 4 , John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS A VANÇADAS DE COMPUTADORES (AAC)

Transcript of Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...

Page 1: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

VECTOR INSTRUCTIONSSlides by: Pedro Tomás

Additional reading: Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

ADVANCED COMPUTER ARCHITECTURES

ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Page 2: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Outline

2

Flynn’s taxonomy of computer systems

Vector processing

Single Instruction, Multiple Data Stream (SIMD) processing

Implementation

Intel SIMD extensions

Examples

Page 3: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Flynn’s taxonomy

3

In 1966 Michael J. Flynn introduced a classification of computer architectures:

Single Instruction, Single Data stream (SISD)

Each instruction processes a single data stream

Single Instruction, Multiple Data streams (SIMD)

Each instruction processes multiple data elements

Typical in vector architectures and ISA extensions

Multiple Instruction, Single Data stream (MISD)

Multiple instructions operate over a single data stream

Not typical; theoretical useful for fault-tolerance systems

Multiple Instruction, Multiple Data streams (MIMD)

Multiple programs process multiple sets of data

Typical in multicore architectures

Page 4: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

SIMD processing benefits

4

SIMD architectures can exploit significant data-level

parallelism for:

matrix-oriented scientific computing

media-oriented image and sound processors

SIMD is more energy efficient than MIMD

Only needs to fetch one instruction per data operation

Makes SIMD attractive for personal mobile devices

SIMD allows programmer to continue thinking sequentially

Page 5: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Introduction to SIMD processing

5

Basic idea:

Read sets of data into “vector registers”

Operate on those registers

Disperse the results back into memory

127 96 95 64 32 31 0

WORD 3 WORD 2 WORD 1 WORD 0

A3 A2 A1 A0

+ + + +

B3 B2 B1 B0

= = = =

A3 + B3 A2 + B2 A1 + B1 A0 + B0

Page 6: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

SIMD implementation

6

Special functional units allow parallel execution of SIMD

instructions

Instruction latency depends on:

Instruction type

Length of operand vectors

Structural hazards

Data dependencies

Registers are typically controlled by the compiler

Used to hide memory latency

Leverage memory bandwidth

Page 7: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Intel SIMD instructions

7

Intel MMX (MultiMedia Extensions) [1996]

8 registers (MM0-MM7), each 64-bit long

Each MMx register can hold:

2x 32-bit Integer, or

4x 16-bit Integer, or

8x 8-bit Integer

Instructions include:

Shift operations

Logical operations (AND, OR, XOR)

ADD / SUB with or without saturation

Multiply

Load/Store (pack/unpack)

Page 8: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More SIMD instructions

8

AMD 3DNow! Extensions [1997]

Extension of MMX instructions to support vector processing of single-

precision (32-bit) floating point numbers

New instructions including:

min, max

square root

Intel SSE (Streaming SIMD Extensions) [1999]

8 registers (XMM0-XMM7), each 128-bit long

Each XMMx register can hold:

4x 32-bit single precision floating point

Page 9: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More SIMD instructions

9

Intel SSE2 (Streaming SIMD Extensions) [2001]

8 registers (XMM0-XMM7), each 128-bit long

XMMx registers can hold:

2x 64-bit double precision floating point

4x 32-bit single precision floating point

4x 32-bit / 8x 16-bit / 16x 8-bit integer

Instructions include:

Arithmetic operations (ADD, SUB, MUL, DIV, SQRT, MIN, MAX)

Move and shuffle

Bitwise logical operations

Cache pre-fetch instructions

Page 10: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More SIMD instructions

10

Intel SSE3 (Streaming SIMD Extensions) [2003]

Addition of new instructions (e.g., horizontal add)

Intel SSE4 (Streaming SIMD Extensions) [2006]

New instructions:

SAD (Sum of absolute differences)

Dot product

Intel AVX (Advanced Vector eXtensions) [2010]

Expand registers to support 256-bit vectors (YMM0-YMM15)

Page 11: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Intel 64 and IA-32 Architectures Software Developer Manuals http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Intel Architecture Instruction Set Extensions Programming Reference http://download-

software.intel.com/sites/default/files/319433-014.pdf

Intel Intrinsics Guide

http://software.intel.com/en-us/articles/intel-intrinsics-guide

Intel C and C++ Compilers

http://software.intel.com/en-us/c-compilers/

Programming with SIMD extensions11

Page 12: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Programming with SIMD instructions

12

Two possibilities:

Using assembly language

// Multiply a constant vector by a constant scalar and return the result

for (i=0; i<4; i++)

Y[i] = X[i] * k;

Page 13: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Programming with SIMD instructions

13

Two possibilities:

Using assembly language

// Multiply a constant vector by a constant scalar and return the result

Vector4 SSE_Multiply(const Vector4 &Op_A, const float &Op_B)

{

Vector4 Ret_Vector;

// Create a 128 bit vector with four elements Op_B

__m128 F = _mm_set1_ps(Op_B)

// Enter Assembly mode

__asm

{

MOV EAX, Op_A // Load pointer into CPU reg

MOVUPS XMM0, [EAX] // Move the vector to an SSE reg

MULPS XMM0, F // Multiply vectors

MOVUPS [Ret_Vector], XMM0 // Save the return vector

}

return Ret_Vector;

}

Page 14: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Programming with SIMD instructions

14

Two possibilities:

Using C/C++ intrinsic functions to hide assembly

/* compute SSE vector product, 4 products at a time */

for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4,raux+=4) {

// load packed single precision floating point

vect1 = _mm_loadu_ps(faux1);

vect2 = _mm_loadu_ps(faux2);

// multiply packed single precision floating point vectors

vectRes = _mm_mul_ps(vect1,vect2);

// store packed single precision floating point

_mm_storeu_ps(raux, vectRes);

}

Page 15: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Programming with SIMD instructions

15

First step to extract parallelism:

1. Resolve dependencies to make sure the loop is parallelizable

2. Apply loop unrolling as many times as the number of elements in

the vector

3. Parallelize each new iteration using SIMD instructions

S1 S2

S3 S4

WAR(C)

RAW(B)

WAR(C)

WAW(A)

RAW(A)

S1 S2

S3 S4

WAR(C)

RAW(B)

WAR(C)

WAW(A)

RAW(A)

Page 16: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Programming with SIMD instructions

16

Optional steps:

1. Apply further loop unrolling

Minimizes the overhead due to loop control instructions

May be limited by the number of available SIMD registers

2. Apply software pipelining

minimizes conflicts due to instruction latencies

Requires less SIMD registers

3. Optimize the code to maximize number of cache hits

Typically involves diving the computation in blocks such as to maximize re-

usage of data on L1/L2/L3 caches

Page 17: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

17

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 1 – typical solutionvoid normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i++) {

res[i] = f1[i] * f2[i];

}

}

Page 18: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Parallelizing18

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}

Page 19: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Using VMIPS Assembly instructions19

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

} LV.PS V1,0(R1)

; initialize registers

ADD R1,R0,#f1

ADD R2,R0,#f2

ADD R3,R0,#res

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

Page 20: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Using VMIPS Assembly instructions20

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}

; initialize registers

ADDI R1,R0,#f1

ADDI R2,R0,#f2

ADDI R3,R0,#res

LD R4,#&len(R0) ; load len to R4

ADD R5,R0,R0 ; i

BGE R5,R4,end ; check if i>=len

loop: LV.PS V1,0(R1)

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R5,R0,#4 ; increment by 4 words

BLT R5,R4,loop ; loop if i<len

end:

Page 21: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Using VMIPS Assembly instructions21

How to deal loops where the number of iterations is not a

multiple of the vector length?

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}

; initialize registers

ADDI R1,R0,#f1

ADDI R2,R0,#f2

ADDI R3,R0,#res

LD R4,#&len(R0) ; load len to R4

ADD R5,R0,R0 ; i

BGE R5,R4,end ; check if i >= len

LSR R6,R4,2 ; divide R4 by 4 (vector length)

BGE R5,R6,final ; check if i >= (len>>2)

loop: LV.PS V1,0(R1)

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R5,R0,#1 ; increment by 1 vector

BLT R5,R6,loop ; loop if i < (len>>2)

final: ...

Page 22: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Using VMIPS Assembly instructions22

How to deal loops where the number of iterations is not a

multiple of the vector length?

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}

; initialize registers

...

loop: LV.PS V1,0(R1)

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

ADD R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADD R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADD R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADD R5,R0,#1 ; increment by 1 vector

BLT R5,R6,loop ; loop if i < (len>>2)

final: L.S F0,0(R1) ; execute the remaining iterations

L.S F1,0(R2)

MUL F0,F0,F1

S.D F0,0(R3)

ADD R1,R0,#4

ADD R2,R0,#4

ADD R3,R0,#4

ADD R6,R0,#1

BLT R6,R4,loop ; loop if i < len

end:

Page 23: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Using VMIPS Assembly instructions23

How to deal loops where the number of iterations is not a

multiple of the vector length?

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}

; initialize registers

ADDI R1,R0,#f1

ADDI R2,R0,#f2

ADDI R3,R0,#res

LD R4,#&len(R0) ; load len to R4

ADD R5,R0,R0 ; i

BGE R5,R4,end ; check if i >= len

LSR R6,R4,2 ; divide R4 by 4 (vector length)

BGE R5,R6,final ; check if i >= (len>>2)

loop: LV.PS V1,0(R1)

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)

ADDI R5,R0,#1 ; increment by 1 vector

BLT R5,R6,loop ; loop if i < (len>>2)

final: ...

__m128 vect1, vect2, vectRes;

...

// load packed single precision floating point

vect1 = _mm_loadu_ps(faux1);

vect2 = _mm_loadu_ps(faux2);

// multiply packed single precision floating point vectors

vectRes = _mm_mul_ps(vect1,vect2);

// store packed single precision floating point

_mm_storeu_ps(raux, vectRes);

Page 24: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

Using Intel intrinsic functions24

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 2 – step 2, parallelization void sse128_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

__m128 vect1, vect2, vectRes;

float *faux1=f1, *faux2=f2, *raux=res;

/* compute SSE vector product, 4 products at a time */

for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4,raux+=4) {

// load packed single precision floating point

vect1 = _mm_loadu_ps(faux1);

vect2 = _mm_loadu_ps(faux2);

// multiply packed single precision floating point vectors

vectRes = _mm_mul_ps(vect1,vect2);

// store packed single precision floating point

_mm_storeu_ps(raux, vectRes);

}

/* compute remaining elements */

for (i=i<<2; i<len; i++) {

res[i] = f1[i] * f2[i];

}

}

Page 25: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 1

25

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 3 – Using 256-bit SIMD instructions (AVX)void sse256_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

__m256 vect1, vect2, vectRes;

float *faux1=f1, *faux2=f2, *raux=res;

/* compute SSE vector product, 8 products at a time */

for (i=0; i<(len>>3); i++,faux1+=8,faux2+=8,raux+=8) {

// load packed single precision floating point

vect1 = _mm256_loadu_ps(faux1);

vect2 = _mm256_loadu_ps(faux2);

// multiply packed single precision floating point vectors

vectRes = _mm256_mul_ps(vect1,vect2);

// store packed single precision floating point

_mm256_storeu_ps(raux, vectRes);

}

/* compute remaining elements */

for (i=i<<3; i<len; i++) {

res[i] = f1[i] * f2[i];

}

}

Page 26: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Typical problem

26

Load/Store instructions have maximum efficiency when

operands are memory aligned

Two types of memory access instructions:

Memory aligned load/store

_mm_load_ps(address)

_mm_store_ps(address)

Memory Unaligned load/Store

_mm_loadu_ps(address)

_mm_storeu_ps(address)

Using aligned instructions is more efficient, but may lead to

segmentation fault if the address is unaligned

Page 27: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 2

27

Write a function to compute the vector inner product between two vectors

v1 and v2:

𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =

𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

Page 28: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 2

28

Write a function to

compute the vector

inner product

between two

vectors v1 and v2:

𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =

𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

float sse128_vector_inner_product(float *f1, float *f2, int len) {

int i;

__m128 v1, v2, vectRes, vv;

float aux[4];

float *faux1=f1,*faux2=f2;

/* initialize all four words with zero */

vectRes = _mm_setzero_ps();

/* compute SSE vector product, 4 products at a time */

for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4) {

// load packed single precision floating point

v1 = _mm_loadu_ps(faux1);

v2 = _mm_loadu_ps(faux2);

// multiply packed single precision floating point vectors

vv = _mm_mul_ps(v1,v2);

// accumulate result

vectRes = _mm_add_ps(vectRes,vv);

}

_mm_storeu_ps(aux, vectRes);

aux[0] = aux[0] + aux[1] + aux[2] + aux[3];

/* compute remaining elements */

for (i=i<<2; i<len; i++) {

aux[0] += f1[i] * f2[i];

}

return aux[0];

}

Page 29: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 2

29

Write a function to compute the vector inner product between two vectors

v1 and v2:

𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =

𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

Speed-up (compiled with Intel C/C++ Compiler, flag –O0):

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 32 128 512 2048 8192 32768 131072 524288

Speedup

Vector Length (base2 log scale)

SIMD-128 Speedup

SIMD-256 Speedup

SIMD-256 + 4x Loop Unrolling Speedup

Page 30: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

Example 2

30

Write a function to compute the vector inner product between two vectors

v1 and v2:

𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =

𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

Speed-up

Problem:

For large problems, the problem becomes memory-bounded, i.e., the execution time

is constrained by the time to load data into the processor

In many cases, efficient cache management can improve the performance

Page 31: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More on SIMD instructions

31

Stream pre-fetch

_mm_prefetch( const char* address , int type )

Type = _MM_HINT_T0 Load data into all cache levels

Type = _MM_HINT_T1 Load data into all cache levels, except L1

Type = _MM_HINT_T2 Load data into all cache levels, except L1 and L2

Type = _MM_HINT_NTA Non-temporal cache access (pre-fetch data without putting on the

cache levels)

Stream store

_mm_stream_ps (float* address, __m128 float_vector_4)

_mm256_stream_ps (float* address, __m256 float_vector_8)

_mm_stream_pd (double* address, __m128 double_vector_2)

_mm256_stream_pd (double* address, __m256 double_vector_4)

Store packed floats without loading data into cache (forces a write-through policy)

Page 32: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More on SIMD instructions

32

Data gather

__m128 _mm_i32gather_ps (

float const* base_addr,

__m128i vindex,

const inst scale

)

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit

elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex

(each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1,

2, 4 or 8.

Packed __m128 word 3 Packed __m128 word 2 Packed __m128 word 1 Packed __m128 word 0

127 96 95 64 63 32 31 0

M[base+vindex3*scale] M[base+vindex2*scale] M[base+vindex1*scale] M[base+vindex0*scale]

vindex3 vindex2 vindex1 vindex0

127 96 95 64 32 31 0

32-bit index 32-bit index 32-bit index 32-bit index

Page 33: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More on SIMD instructions

33

Data permute

__m256 _mm256_permute_ps (__m256 a, int imm)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in

imm, and store the results in dst.

SELECT4(src, control){

CASE(control[1:0])

0: tmp[31:0] := src[31:0]

1: tmp[31:0] := src[63:32]

2: tmp[31:0] := src[95:64]

3: tmp[31:0] := src[127:96]

ESAC

RETURN tmp[31:0]

}

dst[31:0] := SELECT4(a[127:0], imm[1:0])

dst[63:32] := SELECT4(a[127:0], imm[3:2])

dst[95:64] := SELECT4(a[127:0], imm[5:4])

dst[127:96] := SELECT4(a[127:0], imm[7:6])

dst[159:128] := SELECT4(a[255:128], imm[1:0])

dst[191:160] := SELECT4(a[255:128], imm[3:2])

dst[223:192] := SELECT4(a[255:128], imm[5:4])

dst[255:224] := SELECT4(a[255:128], imm[7:6])

dst[MAX:256] := 0

Page 34: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More on SIMD instructions

34

Data shuffle

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in

imm, and store the results in dst.

SELECT4(src, control){

CASE(control[1:0])

0: tmp[31:0] := src[31:0]

1: tmp[31:0] := src[63:32]

2: tmp[31:0] := src[95:64]

3: tmp[31:0] := src[127:96]

ESAC

RETURN tmp[31:0]

}

dst[31:0] := SELECT4(a[127:0], imm[1:0])

dst[63:32] := SELECT4(a[127:0], imm[3:2])

dst[95:64] := SELECT4(b[127:0], imm[5:4])

dst[127:96] := SELECT4(b[127:0], imm[7:6])

dst[159:128] := SELECT4(a[255:128], imm[1:0])

dst[191:160] := SELECT4(a[255:128], imm[3:2])

dst[223:192] := SELECT4(b[255:128], imm[5:4])

dst[255:224] := SELECT4(b[255:128], imm[7:6])

dst[MAX:256] := 0

Page 35: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

Advanced Computer Architectures, 2014

More on SIMD instructions

35

Data blend

__m256 _mm_blendv_ps (__m256 a, __m256 b, __m256 mask)

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm,

and store the results in dst.

FOR j := 0 to 7

i := j*32

IF mask[i+31]

dst[i+31:i] := b[i+31:i]

ELSE

dst[i+31:i] := a[i+31:i]

FI

ENDFOR

dst[MAX:256] := 0

Page 36: Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In 1966 Michael J. Flynn introduced a classification of computer architectures: Single

More on parallelism:

Graphics Processing Units (GPUs)

Next lesson36