Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...

VECTOR INSTRUCTIONSSlides by: Pedro Tomás

Additional reading: Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

ADVANCED COMPUTER ARCHITECTURES

ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Advanced Computer Architectures, 2014

Outline

2

Flynn’s taxonomy of computer systems

Vector processing

Single Instruction, Multiple Data Stream (SIMD) processing

Implementation

Intel SIMD extensions

Examples


Flynn’s taxonomy

3

In 1966 Michael J. Flynn introduced a classification of computer architectures:

Single Instruction, Single Data stream (SISD)

Each instruction processes a single data stream

Single Instruction, Multiple Data streams (SIMD)

Each instruction processes multiple data elements

Typical in vector architectures and ISA extensions

Multiple Instruction, Single Data stream (MISD)

Multiple instructions operate over a single data stream

Not typical; theoretical useful for fault-tolerance systems

Multiple Instruction, Multiple Data streams (MIMD)

Multiple programs process multiple sets of data

Typical in multicore architectures


SIMD processing benefits

4

SIMD architectures can exploit significant data-level

parallelism for:

matrix-oriented scientific computing

media-oriented image and sound processors

SIMD is more energy efficient than MIMD

Only needs to fetch one instruction per data operation

Makes SIMD attractive for personal mobile devices

SIMD allows programmer to continue thinking sequentially


Introduction to SIMD processing

5

Basic idea:

Read sets of data into “vector registers”

Operate on those registers

Disperse the results back into memory

127 96 95 64 32 31 0

WORD 3 WORD 2 WORD 1 WORD 0

A3 A2 A1 A0

+ + + +

B3 B2 B1 B0

= = = =

A3 + B3 A2 + B2 A1 + B1 A0 + B0


SIMD implementation

6

Special functional units allow parallel execution of SIMD

instructions

Instruction latency depends on:

Instruction type

Length of operand vectors

Structural hazards

Data dependencies

Registers are typically controlled by the compiler

Used to hide memory latency

Leverage memory bandwidth


Intel SIMD instructions

7

Intel MMX (MultiMedia Extensions) [1996]

8 registers (MM0-MM7), each 64-bit long

Each MMx register can hold:

2x 32-bit Integer, or

4x 16-bit Integer, or

8x 8-bit Integer

Instructions include:

Shift operations

Logical operations (AND, OR, XOR)

ADD / SUB with or without saturation

Multiply

Load/Store (pack/unpack)


More SIMD instructions

8

AMD 3DNow! Extensions [1997]

Extension of MMX instructions to support vector processing of single-

precision (32-bit) floating point numbers

New instructions including:

min, max

square root

Intel SSE (Streaming SIMD Extensions) [1999]

8 registers (XMM0-XMM7), each 128-bit long

Each XMMx register can hold:

4x 32-bit single precision floating point



9

Intel SSE2 (Streaming SIMD Extensions) [2001]

8 registers (XMM0-XMM7), each 128-bit long

XMMx registers can hold:

2x 64-bit double precision floating point

4x 32-bit single precision floating point

4x 32-bit / 8x 16-bit / 16x 8-bit integer

Instructions include:

Arithmetic operations (ADD, SUB, MUL, DIV, SQRT, MIN, MAX)

Move and shuffle

Bitwise logical operations

Cache pre-fetch instructions



10


Addition of new instructions (e.g., horizontal add)


New instructions:

SAD (Sum of absolute differences)

Dot product

Intel AVX (Advanced Vector eXtensions) [2010]

Expand registers to support 256-bit vectors (YMM0-YMM15)

Intel 64 and IA-32 Architectures Software Developer Manuals http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Intel Architecture Instruction Set Extensions Programming Reference http://download-

software.intel.com/sites/default/files/319433-014.pdf

Intel Intrinsics Guide

http://software.intel.com/en-us/articles/intel-intrinsics-guide

Intel C and C++ Compilers

http://software.intel.com/en-us/c-compilers/

Programming with SIMD extensions11


Programming with SIMD instructions

12

Two possibilities:

Using assembly language

// Multiply a constant vector by a constant scalar and return the result

for (i=0; i<4; i++)

Y[i] = X[i] * k;



13

Two possibilities:

Using assembly language

// Multiply a constant vector by a constant scalar and return the result

Vector4 SSE_Multiply(const Vector4 &Op_A, const float &Op_B)

{

Vector4 Ret_Vector;

// Create a 128 bit vector with four elements Op_B

__m128 F = _mm_set1_ps(Op_B)

// Enter Assembly mode

__asm

{

MOV EAX, Op_A // Load pointer into CPU reg

MOVUPS XMM0, [EAX] // Move the vector to an SSE reg

MULPS XMM0, F // Multiply vectors

MOVUPS [Ret_Vector], XMM0 // Save the return vector

}

return Ret_Vector;

}



14

Two possibilities:

Using C/C++ intrinsic functions to hide assembly

/* compute SSE vector product, 4 products at a time */

for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4,raux+=4) {

// load packed single precision floating point

vect1 = _mm_loadu_ps(faux1);


// multiply packed single precision floating point vectors

vectRes = _mm_mul_ps(vect1,vect2);

// store packed single precision floating point

_mm_storeu_ps(raux, vectRes);

}



15

First step to extract parallelism:

1. Resolve dependencies to make sure the loop is parallelizable

2. Apply loop unrolling as many times as the number of elements in

the vector

3. Parallelize each new iteration using SIMD instructions

S1 S2

S3 S4

WAR(C)

RAW(B)

WAR(C)

WAW(A)

RAW(A)

S1 S2

S3 S4

WAR(C)

RAW(B)

WAR(C)

WAW(A)

RAW(A)



16

Optional steps:

1. Apply further loop unrolling

Minimizes the overhead due to loop control instructions

May be limited by the number of available SIMD registers

2. Apply software pipelining

minimizes conflicts due to instruction latencies

Requires less SIMD registers

3. Optimize the code to maximize number of cache hits

Typically involves diving the computation in blocks such as to maximize re-

usage of data on L1/L2/L3 caches


Example 1

17

Write a function to compute the vector dot product between

two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}

Solution 1 – typical solutionvoid normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i++) {

res[i] = f1[i] * f2[i];

}

}


Example 1

Parallelizing18



Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions

void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}


Example 1

Using VMIPS Assembly instructions19





int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

} LV.PS V1,0(R1)

; initialize registers

ADD R1,R0,#f1

ADD R2,R0,#f2

ADD R3,R0,#res

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)


Example 1






int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}


ADDI R1,R0,#f1

ADDI R2,R0,#f2

ADDI R3,R0,#res

LD R4,#&len(R0) ; load len to R4

ADD R5,R0,R0 ; i

BGE R5,R4,end ; check if i>=len

loop: LV.PS V1,0(R1)

LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)



ADDI R5,R0,#4 ; increment by 4 words

BLT R5,R4,loop ; loop if i<len

end:


Example 1


How to deal loops where the number of iterations is not a

multiple of the vector length?


int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}


ADDI R1,R0,#f1

ADDI R2,R0,#f2

ADDI R3,R0,#res


ADD R5,R0,R0 ; i

BGE R5,R4,end ; check if i >= len

LSR R6,R4,2 ; divide R4 by 4 (vector length)

BGE R5,R6,final ; check if i >= (len>>2)


LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)




ADDI R5,R0,#1 ; increment by 1 vector

BLT R5,R6,loop ; loop if i < (len>>2)

final: ...


Example 1





int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}


...


LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)

ADD R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)



ADD R5,R0,#1 ; increment by 1 vector


final: L.S F0,0(R1) ; execute the remaining iterations

L.S F1,0(R2)

MUL F0,F0,F1

S.D F0,0(R3)

ADD R1,R0,#4

ADD R2,R0,#4

ADD R3,R0,#4

ADD R6,R0,#1

BLT R6,R4,loop ; loop if i < len

end:


Example 1





int i;

for (i=0; i<len; i+=4) {

res[i] = f1[i] * f2[i];

res[i+1] = f1[i+1] * f2[i+1];

res[i+2] = f1[i+2] * f2[i+2];

res[i+3] = f1[i+3] * f2[i+3];

}

}


ADDI R1,R0,#f1

ADDI R2,R0,#f2

ADDI R3,R0,#res


ADD R5,R0,R0 ; i

BGE R5,R4,end ; check if i >= len

LSR R6,R4,2 ; divide R4 by 4 (vector length)

BGE R5,R6,final ; check if i >= (len>>2)


LV.PS V2,0(R2)

MULVV.PS V1,V1,V2

SV.PS V1,0(R3)




ADDI R5,R0,#1 ; increment by 1 vector


final: ...

__m128 vect1, vect2, vectRes;

...









Example 1

Using Intel intrinsic functions24



Solution 2 – step 2, parallelization void sse128_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;


float *faux1=f1, *faux2=f2, *raux=res;










}

/* compute remaining elements */

for (i=i<<2; i<len; i++) {

res[i] = f1[i] * f2[i];

}

}


Example 1

25



Solution 3 – Using 256-bit SIMD instructions (AVX)void sse256_vector_dot_product(float *f1, float *f2, float *res, int len) {

int i;


float *faux1=f1, *faux2=f2, *raux=res;




vect1 = _mm256_loadu_ps(faux1);

vect2 = _mm256_loadu_ps(faux2);


vectRes = _mm256_mul_ps(vect1,vect2);


_mm256_storeu_ps(raux, vectRes);

}


for (i=i<<3; i<len; i++) {

res[i] = f1[i] * f2[i];

}

}


Typical problem

26

Load/Store instructions have maximum efficiency when

operands are memory aligned

Two types of memory access instructions:

Memory aligned load/store

_mm_load_ps(address)

_mm_store_ps(address)

Memory Unaligned load/Store

_mm_loadu_ps(address)

_mm_storeu_ps(address)

Using aligned instructions is more efficient, but may lead to

segmentation fault if the address is unaligned


Example 2

27

Write a function to compute the vector inner product between two vectors

v1 and v2:

𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =

𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖


Example 2

28

Write a function to

compute the vector

inner product

between two

vectors v1 and v2:


𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

float sse128_vector_inner_product(float *f1, float *f2, int len) {

int i;

__m128 v1, v2, vectRes, vv;

float aux[4];

float *faux1=f1,*faux2=f2;

/* initialize all four words with zero */

vectRes = _mm_setzero_ps();


for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4) {


v1 = _mm_loadu_ps(faux1);

v2 = _mm_loadu_ps(faux2);


vv = _mm_mul_ps(v1,v2);

// accumulate result

vectRes = _mm_add_ps(vectRes,vv);

}

_mm_storeu_ps(aux, vectRes);

aux[0] = aux[0] + aux[1] + aux[2] + aux[3];


for (i=i<<2; i<len; i++) {

aux[0] += f1[i] * f2[i];

}

return aux[0];

}


Example 2

29


v1 and v2:


𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

Speed-up (compiled with Intel C/C++ Compiler, flag –O0):

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 32 128 512 2048 8192 32768 131072 524288

Speedup

Vector Length (base2 log scale)

SIMD-128 Speedup

SIMD-256 Speedup

SIMD-256 + 4x Loop Unrolling Speedup


Example 2

30


v1 and v2:


𝑖=0

𝐿𝐸𝑁−1

𝑉1 𝑖 × 𝑉2 𝑖

Speed-up

Problem:

For large problems, the problem becomes memory-bounded, i.e., the execution time

is constrained by the time to load data into the processor

In many cases, efficient cache management can improve the performance


More on SIMD instructions

31

Stream pre-fetch

_mm_prefetch( const char* address , int type )

Type = _MM_HINT_T0 Load data into all cache levels

Type = _MM_HINT_T1 Load data into all cache levels, except L1

Type = _MM_HINT_T2 Load data into all cache levels, except L1 and L2

Type = _MM_HINT_NTA Non-temporal cache access (pre-fetch data without putting on the

cache levels)

Stream store

_mm_stream_ps (float* address, __m128 float_vector_4)

_mm256_stream_ps (float* address, __m256 float_vector_8)

_mm_stream_pd (double* address, __m128 double_vector_2)

_mm256_stream_pd (double* address, __m256 double_vector_4)

Store packed floats without loading data into cache (forces a write-through policy)



32

Data gather

__m128 _mm_i32gather_ps (

float const* base_addr,

__m128i vindex,

const inst scale

)

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit

elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex

(each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1,

2, 4 or 8.

Packed __m128 word 3 Packed __m128 word 2 Packed __m128 word 1 Packed __m128 word 0

127 96 95 64 63 32 31 0

M[base+vindex3*scale] M[base+vindex2*scale] M[base+vindex1*scale] M[base+vindex0*scale]

vindex3 vindex2 vindex1 vindex0

127 96 95 64 32 31 0

32-bit index 32-bit index 32-bit index 32-bit index



33

Data permute

__m256 _mm256_permute_ps (__m256 a, int imm)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in

imm, and store the results in dst.

SELECT4(src, control){

CASE(control[1:0])

0: tmp[31:0] := src[31:0]

1: tmp[31:0] := src[63:32]

2: tmp[31:0] := src[95:64]

3: tmp[31:0] := src[127:96]

ESAC

RETURN tmp[31:0]

}

dst[31:0] := SELECT4(a[127:0], imm[1:0])

dst[63:32] := SELECT4(a[127:0], imm[3:2])

dst[95:64] := SELECT4(a[127:0], imm[5:4])

dst[127:96] := SELECT4(a[127:0], imm[7:6])

dst[159:128] := SELECT4(a[255:128], imm[1:0])

dst[191:160] := SELECT4(a[255:128], imm[3:2])

dst[223:192] := SELECT4(a[255:128], imm[5:4])

dst[255:224] := SELECT4(a[255:128], imm[7:6])

dst[MAX:256] := 0



34

Data shuffle

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in

imm, and store the results in dst.

SELECT4(src, control){

CASE(control[1:0])

0: tmp[31:0] := src[31:0]

1: tmp[31:0] := src[63:32]

2: tmp[31:0] := src[95:64]

3: tmp[31:0] := src[127:96]

ESAC

RETURN tmp[31:0]

}

dst[31:0] := SELECT4(a[127:0], imm[1:0])

dst[63:32] := SELECT4(a[127:0], imm[3:2])

dst[95:64] := SELECT4(b[127:0], imm[5:4])

dst[127:96] := SELECT4(b[127:0], imm[7:6])

dst[159:128] := SELECT4(a[255:128], imm[1:0])

dst[191:160] := SELECT4(a[255:128], imm[3:2])

dst[223:192] := SELECT4(b[255:128], imm[5:4])

dst[255:224] := SELECT4(b[255:128], imm[7:6])

dst[MAX:256] := 0



35

Data blend

__m256 _mm_blendv_ps (__m256 a, __m256 b, __m256 mask)

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm,

and store the results in dst.

FOR j := 0 to 7

i := j*32

IF mask[i+31]

dst[i+31:i] := b[i+31:i]

ELSE

dst[i+31:i] := a[i+31:i]

FI

ENDFOR

dst[MAX:256] := 0

More on parallelism:

Graphics Processing Units (GPUs)

Next lesson36

Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...

Documents

Transcript of Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...