Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...
-
Upload
truongdang -
Category
Documents
-
view
221 -
download
1
Transcript of Lesson 2 - ISA - Autenticação · Advanced Computer Architectures, 2014 Flynn’s taxonomy 3 In...
VECTOR INSTRUCTIONSSlides by: Pedro Tomás
Additional reading: Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
ADVANCED COMPUTER ARCHITECTURES
ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)
Advanced Computer Architectures, 2014
Outline
2
Flynn’s taxonomy of computer systems
Vector processing
Single Instruction, Multiple Data Stream (SIMD) processing
Implementation
Intel SIMD extensions
Examples
Advanced Computer Architectures, 2014
Flynn’s taxonomy
3
In 1966 Michael J. Flynn introduced a classification of computer architectures:
Single Instruction, Single Data stream (SISD)
Each instruction processes a single data stream
Single Instruction, Multiple Data streams (SIMD)
Each instruction processes multiple data elements
Typical in vector architectures and ISA extensions
Multiple Instruction, Single Data stream (MISD)
Multiple instructions operate over a single data stream
Not typical; theoretical useful for fault-tolerance systems
Multiple Instruction, Multiple Data streams (MIMD)
Multiple programs process multiple sets of data
Typical in multicore architectures
Advanced Computer Architectures, 2014
SIMD processing benefits
4
SIMD architectures can exploit significant data-level
parallelism for:
matrix-oriented scientific computing
media-oriented image and sound processors
SIMD is more energy efficient than MIMD
Only needs to fetch one instruction per data operation
Makes SIMD attractive for personal mobile devices
SIMD allows programmer to continue thinking sequentially
Advanced Computer Architectures, 2014
Introduction to SIMD processing
5
Basic idea:
Read sets of data into “vector registers”
Operate on those registers
Disperse the results back into memory
127 96 95 64 32 31 0
WORD 3 WORD 2 WORD 1 WORD 0
A3 A2 A1 A0
+ + + +
B3 B2 B1 B0
= = = =
A3 + B3 A2 + B2 A1 + B1 A0 + B0
Advanced Computer Architectures, 2014
SIMD implementation
6
Special functional units allow parallel execution of SIMD
instructions
Instruction latency depends on:
Instruction type
Length of operand vectors
Structural hazards
Data dependencies
Registers are typically controlled by the compiler
Used to hide memory latency
Leverage memory bandwidth
Advanced Computer Architectures, 2014
Intel SIMD instructions
7
Intel MMX (MultiMedia Extensions) [1996]
8 registers (MM0-MM7), each 64-bit long
Each MMx register can hold:
2x 32-bit Integer, or
4x 16-bit Integer, or
8x 8-bit Integer
Instructions include:
Shift operations
Logical operations (AND, OR, XOR)
ADD / SUB with or without saturation
Multiply
Load/Store (pack/unpack)
Advanced Computer Architectures, 2014
More SIMD instructions
8
AMD 3DNow! Extensions [1997]
Extension of MMX instructions to support vector processing of single-
precision (32-bit) floating point numbers
New instructions including:
min, max
square root
Intel SSE (Streaming SIMD Extensions) [1999]
8 registers (XMM0-XMM7), each 128-bit long
Each XMMx register can hold:
4x 32-bit single precision floating point
Advanced Computer Architectures, 2014
More SIMD instructions
9
Intel SSE2 (Streaming SIMD Extensions) [2001]
8 registers (XMM0-XMM7), each 128-bit long
XMMx registers can hold:
2x 64-bit double precision floating point
4x 32-bit single precision floating point
4x 32-bit / 8x 16-bit / 16x 8-bit integer
Instructions include:
Arithmetic operations (ADD, SUB, MUL, DIV, SQRT, MIN, MAX)
Move and shuffle
Bitwise logical operations
Cache pre-fetch instructions
Advanced Computer Architectures, 2014
More SIMD instructions
10
Intel SSE3 (Streaming SIMD Extensions) [2003]
Addition of new instructions (e.g., horizontal add)
Intel SSE4 (Streaming SIMD Extensions) [2006]
New instructions:
SAD (Sum of absolute differences)
Dot product
Intel AVX (Advanced Vector eXtensions) [2010]
Expand registers to support 256-bit vectors (YMM0-YMM15)
Intel 64 and IA-32 Architectures Software Developer Manuals http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Intel Architecture Instruction Set Extensions Programming Reference http://download-
software.intel.com/sites/default/files/319433-014.pdf
Intel Intrinsics Guide
http://software.intel.com/en-us/articles/intel-intrinsics-guide
Intel C and C++ Compilers
http://software.intel.com/en-us/c-compilers/
Programming with SIMD extensions11
Advanced Computer Architectures, 2014
Programming with SIMD instructions
12
Two possibilities:
Using assembly language
// Multiply a constant vector by a constant scalar and return the result
for (i=0; i<4; i++)
Y[i] = X[i] * k;
Advanced Computer Architectures, 2014
Programming with SIMD instructions
13
Two possibilities:
Using assembly language
// Multiply a constant vector by a constant scalar and return the result
Vector4 SSE_Multiply(const Vector4 &Op_A, const float &Op_B)
{
Vector4 Ret_Vector;
// Create a 128 bit vector with four elements Op_B
__m128 F = _mm_set1_ps(Op_B)
// Enter Assembly mode
__asm
{
MOV EAX, Op_A // Load pointer into CPU reg
MOVUPS XMM0, [EAX] // Move the vector to an SSE reg
MULPS XMM0, F // Multiply vectors
MOVUPS [Ret_Vector], XMM0 // Save the return vector
}
return Ret_Vector;
}
Advanced Computer Architectures, 2014
Programming with SIMD instructions
14
Two possibilities:
Using C/C++ intrinsic functions to hide assembly
/* compute SSE vector product, 4 products at a time */
for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4,raux+=4) {
// load packed single precision floating point
vect1 = _mm_loadu_ps(faux1);
vect2 = _mm_loadu_ps(faux2);
// multiply packed single precision floating point vectors
vectRes = _mm_mul_ps(vect1,vect2);
// store packed single precision floating point
_mm_storeu_ps(raux, vectRes);
}
Advanced Computer Architectures, 2014
Programming with SIMD instructions
15
First step to extract parallelism:
1. Resolve dependencies to make sure the loop is parallelizable
2. Apply loop unrolling as many times as the number of elements in
the vector
3. Parallelize each new iteration using SIMD instructions
S1 S2
S3 S4
WAR(C)
RAW(B)
WAR(C)
WAW(A)
RAW(A)
S1 S2
S3 S4
WAR(C)
RAW(B)
WAR(C)
WAW(A)
RAW(A)
Advanced Computer Architectures, 2014
Programming with SIMD instructions
16
Optional steps:
1. Apply further loop unrolling
Minimizes the overhead due to loop control instructions
May be limited by the number of available SIMD registers
2. Apply software pipelining
minimizes conflicts due to instruction latencies
Requires less SIMD registers
3. Optimize the code to maximize number of cache hits
Typically involves diving the computation in blocks such as to maximize re-
usage of data on L1/L2/L3 caches
Advanced Computer Architectures, 2014
Example 1
17
Write a function to compute the vector dot product between
two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}
Solution 1 – typical solutionvoid normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i++) {
res[i] = f1[i] * f2[i];
}
}
Advanced Computer Architectures, 2014
Example 1
Parallelizing18
Write a function to compute the vector dot product between
two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}
Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions
void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i+=4) {
res[i] = f1[i] * f2[i];
res[i+1] = f1[i+1] * f2[i+1];
res[i+2] = f1[i+2] * f2[i+2];
res[i+3] = f1[i+3] * f2[i+3];
}
}
Advanced Computer Architectures, 2014
Example 1
Using VMIPS Assembly instructions19
Write a function to compute the vector dot product between
two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}
Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions
void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i+=4) {
res[i] = f1[i] * f2[i];
res[i+1] = f1[i+1] * f2[i+1];
res[i+2] = f1[i+2] * f2[i+2];
res[i+3] = f1[i+3] * f2[i+3];
}
} LV.PS V1,0(R1)
; initialize registers
ADD R1,R0,#f1
ADD R2,R0,#f2
ADD R3,R0,#res
LV.PS V2,0(R2)
MULVV.PS V1,V1,V2
SV.PS V1,0(R3)
Advanced Computer Architectures, 2014
Example 1
Using VMIPS Assembly instructions20
Write a function to compute the vector dot product between
two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}
Solution 2 – step 1, loop unrolling for 128-bit SIMD instructions
void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i+=4) {
res[i] = f1[i] * f2[i];
res[i+1] = f1[i+1] * f2[i+1];
res[i+2] = f1[i+2] * f2[i+2];
res[i+3] = f1[i+3] * f2[i+3];
}
}
; initialize registers
ADDI R1,R0,#f1
ADDI R2,R0,#f2
ADDI R3,R0,#res
LD R4,#&len(R0) ; load len to R4
ADD R5,R0,R0 ; i
BGE R5,R4,end ; check if i>=len
loop: LV.PS V1,0(R1)
LV.PS V2,0(R2)
MULVV.PS V1,V1,V2
SV.PS V1,0(R3)
ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R5,R0,#4 ; increment by 4 words
BLT R5,R4,loop ; loop if i<len
end:
Advanced Computer Architectures, 2014
Example 1
Using VMIPS Assembly instructions21
How to deal loops where the number of iterations is not a
multiple of the vector length?
void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i+=4) {
res[i] = f1[i] * f2[i];
res[i+1] = f1[i+1] * f2[i+1];
res[i+2] = f1[i+2] * f2[i+2];
res[i+3] = f1[i+3] * f2[i+3];
}
}
; initialize registers
ADDI R1,R0,#f1
ADDI R2,R0,#f2
ADDI R3,R0,#res
LD R4,#&len(R0) ; load len to R4
ADD R5,R0,R0 ; i
BGE R5,R4,end ; check if i >= len
LSR R6,R4,2 ; divide R4 by 4 (vector length)
BGE R5,R6,final ; check if i >= (len>>2)
loop: LV.PS V1,0(R1)
LV.PS V2,0(R2)
MULVV.PS V1,V1,V2
SV.PS V1,0(R3)
ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R5,R0,#1 ; increment by 1 vector
BLT R5,R6,loop ; loop if i < (len>>2)
final: ...
Advanced Computer Architectures, 2014
Example 1
Using VMIPS Assembly instructions22
How to deal loops where the number of iterations is not a
multiple of the vector length?
void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i+=4) {
res[i] = f1[i] * f2[i];
res[i+1] = f1[i+1] * f2[i+1];
res[i+2] = f1[i+2] * f2[i+2];
res[i+3] = f1[i+3] * f2[i+3];
}
}
; initialize registers
...
loop: LV.PS V1,0(R1)
LV.PS V2,0(R2)
MULVV.PS V1,V1,V2
SV.PS V1,0(R3)
ADD R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADD R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADD R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADD R5,R0,#1 ; increment by 1 vector
BLT R5,R6,loop ; loop if i < (len>>2)
final: L.S F0,0(R1) ; execute the remaining iterations
L.S F1,0(R2)
MUL F0,F0,F1
S.D F0,0(R3)
ADD R1,R0,#4
ADD R2,R0,#4
ADD R3,R0,#4
ADD R6,R0,#1
BLT R6,R4,loop ; loop if i < len
end:
Advanced Computer Architectures, 2014
Example 1
Using VMIPS Assembly instructions23
How to deal loops where the number of iterations is not a
multiple of the vector length?
void normal_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
for (i=0; i<len; i+=4) {
res[i] = f1[i] * f2[i];
res[i+1] = f1[i+1] * f2[i+1];
res[i+2] = f1[i+2] * f2[i+2];
res[i+3] = f1[i+3] * f2[i+3];
}
}
; initialize registers
ADDI R1,R0,#f1
ADDI R2,R0,#f2
ADDI R3,R0,#res
LD R4,#&len(R0) ; load len to R4
ADD R5,R0,R0 ; i
BGE R5,R4,end ; check if i >= len
LSR R6,R4,2 ; divide R4 by 4 (vector length)
BGE R5,R6,final ; check if i >= (len>>2)
loop: LV.PS V1,0(R1)
LV.PS V2,0(R2)
MULVV.PS V1,V1,V2
SV.PS V1,0(R3)
ADDI R1,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R2,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R3,R0,#16 ; increment by 4 words (vector length) of 4 bytes (float)
ADDI R5,R0,#1 ; increment by 1 vector
BLT R5,R6,loop ; loop if i < (len>>2)
final: ...
__m128 vect1, vect2, vectRes;
...
// load packed single precision floating point
vect1 = _mm_loadu_ps(faux1);
vect2 = _mm_loadu_ps(faux2);
// multiply packed single precision floating point vectors
vectRes = _mm_mul_ps(vect1,vect2);
// store packed single precision floating point
_mm_storeu_ps(raux, vectRes);
Advanced Computer Architectures, 2014
Example 1
Using Intel intrinsic functions24
Write a function to compute the vector dot product between
two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}
Solution 2 – step 2, parallelization void sse128_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
__m128 vect1, vect2, vectRes;
float *faux1=f1, *faux2=f2, *raux=res;
/* compute SSE vector product, 4 products at a time */
for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4,raux+=4) {
// load packed single precision floating point
vect1 = _mm_loadu_ps(faux1);
vect2 = _mm_loadu_ps(faux2);
// multiply packed single precision floating point vectors
vectRes = _mm_mul_ps(vect1,vect2);
// store packed single precision floating point
_mm_storeu_ps(raux, vectRes);
}
/* compute remaining elements */
for (i=i<<2; i<len; i++) {
res[i] = f1[i] * f2[i];
}
}
Advanced Computer Architectures, 2014
Example 1
25
Write a function to compute the vector dot product between
two vectors v1 and v2:𝑉𝑅𝑒𝑠𝑢𝑙𝑡 𝑖 = 𝑉1 𝑖 × 𝑉2 𝑖 𝑖 ∈ {0, … , 𝐿𝐸𝑁 − 1}
Solution 3 – Using 256-bit SIMD instructions (AVX)void sse256_vector_dot_product(float *f1, float *f2, float *res, int len) {
int i;
__m256 vect1, vect2, vectRes;
float *faux1=f1, *faux2=f2, *raux=res;
/* compute SSE vector product, 8 products at a time */
for (i=0; i<(len>>3); i++,faux1+=8,faux2+=8,raux+=8) {
// load packed single precision floating point
vect1 = _mm256_loadu_ps(faux1);
vect2 = _mm256_loadu_ps(faux2);
// multiply packed single precision floating point vectors
vectRes = _mm256_mul_ps(vect1,vect2);
// store packed single precision floating point
_mm256_storeu_ps(raux, vectRes);
}
/* compute remaining elements */
for (i=i<<3; i<len; i++) {
res[i] = f1[i] * f2[i];
}
}
Advanced Computer Architectures, 2014
Typical problem
26
Load/Store instructions have maximum efficiency when
operands are memory aligned
Two types of memory access instructions:
Memory aligned load/store
_mm_load_ps(address)
_mm_store_ps(address)
Memory Unaligned load/Store
_mm_loadu_ps(address)
_mm_storeu_ps(address)
Using aligned instructions is more efficient, but may lead to
segmentation fault if the address is unaligned
Advanced Computer Architectures, 2014
Example 2
27
Write a function to compute the vector inner product between two vectors
v1 and v2:
𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =
𝑖=0
𝐿𝐸𝑁−1
𝑉1 𝑖 × 𝑉2 𝑖
Advanced Computer Architectures, 2014
Example 2
28
Write a function to
compute the vector
inner product
between two
vectors v1 and v2:
𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =
𝑖=0
𝐿𝐸𝑁−1
𝑉1 𝑖 × 𝑉2 𝑖
float sse128_vector_inner_product(float *f1, float *f2, int len) {
int i;
__m128 v1, v2, vectRes, vv;
float aux[4];
float *faux1=f1,*faux2=f2;
/* initialize all four words with zero */
vectRes = _mm_setzero_ps();
/* compute SSE vector product, 4 products at a time */
for (i=0; i<(len>>2); i++,faux1+=4,faux2+=4) {
// load packed single precision floating point
v1 = _mm_loadu_ps(faux1);
v2 = _mm_loadu_ps(faux2);
// multiply packed single precision floating point vectors
vv = _mm_mul_ps(v1,v2);
// accumulate result
vectRes = _mm_add_ps(vectRes,vv);
}
_mm_storeu_ps(aux, vectRes);
aux[0] = aux[0] + aux[1] + aux[2] + aux[3];
/* compute remaining elements */
for (i=i<<2; i<len; i++) {
aux[0] += f1[i] * f2[i];
}
return aux[0];
}
Advanced Computer Architectures, 2014
Example 2
29
Write a function to compute the vector inner product between two vectors
v1 and v2:
𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =
𝑖=0
𝐿𝐸𝑁−1
𝑉1 𝑖 × 𝑉2 𝑖
Speed-up (compiled with Intel C/C++ Compiler, flag –O0):
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 32 128 512 2048 8192 32768 131072 524288
Speedup
Vector Length (base2 log scale)
SIMD-128 Speedup
SIMD-256 Speedup
SIMD-256 + 4x Loop Unrolling Speedup
Advanced Computer Architectures, 2014
Example 2
30
Write a function to compute the vector inner product between two vectors
v1 and v2:
𝑉𝑅𝑒𝑠𝑢𝑙𝑡 =
𝑖=0
𝐿𝐸𝑁−1
𝑉1 𝑖 × 𝑉2 𝑖
Speed-up
Problem:
For large problems, the problem becomes memory-bounded, i.e., the execution time
is constrained by the time to load data into the processor
In many cases, efficient cache management can improve the performance
Advanced Computer Architectures, 2014
More on SIMD instructions
31
Stream pre-fetch
_mm_prefetch( const char* address , int type )
Type = _MM_HINT_T0 Load data into all cache levels
Type = _MM_HINT_T1 Load data into all cache levels, except L1
Type = _MM_HINT_T2 Load data into all cache levels, except L1 and L2
Type = _MM_HINT_NTA Non-temporal cache access (pre-fetch data without putting on the
cache levels)
Stream store
_mm_stream_ps (float* address, __m128 float_vector_4)
_mm256_stream_ps (float* address, __m256 float_vector_8)
_mm_stream_pd (double* address, __m128 double_vector_2)
_mm256_stream_pd (double* address, __m256 double_vector_4)
Store packed floats without loading data into cache (forces a write-through policy)
Advanced Computer Architectures, 2014
More on SIMD instructions
32
Data gather
__m128 _mm_i32gather_ps (
float const* base_addr,
__m128i vindex,
const inst scale
)
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit
elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex
(each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1,
2, 4 or 8.
Packed __m128 word 3 Packed __m128 word 2 Packed __m128 word 1 Packed __m128 word 0
127 96 95 64 63 32 31 0
M[base+vindex3*scale] M[base+vindex2*scale] M[base+vindex1*scale] M[base+vindex0*scale]
vindex3 vindex2 vindex1 vindex0
127 96 95 64 32 31 0
32-bit index 32-bit index 32-bit index 32-bit index
Advanced Computer Architectures, 2014
More on SIMD instructions
33
Data permute
__m256 _mm256_permute_ps (__m256 a, int imm)
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in
imm, and store the results in dst.
SELECT4(src, control){
CASE(control[1:0])
0: tmp[31:0] := src[31:0]
1: tmp[31:0] := src[63:32]
2: tmp[31:0] := src[95:64]
3: tmp[31:0] := src[127:96]
ESAC
RETURN tmp[31:0]
}
dst[31:0] := SELECT4(a[127:0], imm[1:0])
dst[63:32] := SELECT4(a[127:0], imm[3:2])
dst[95:64] := SELECT4(a[127:0], imm[5:4])
dst[127:96] := SELECT4(a[127:0], imm[7:6])
dst[159:128] := SELECT4(a[255:128], imm[1:0])
dst[191:160] := SELECT4(a[255:128], imm[3:2])
dst[223:192] := SELECT4(a[255:128], imm[5:4])
dst[255:224] := SELECT4(a[255:128], imm[7:6])
dst[MAX:256] := 0
Advanced Computer Architectures, 2014
More on SIMD instructions
34
Data shuffle
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm)
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in
imm, and store the results in dst.
SELECT4(src, control){
CASE(control[1:0])
0: tmp[31:0] := src[31:0]
1: tmp[31:0] := src[63:32]
2: tmp[31:0] := src[95:64]
3: tmp[31:0] := src[127:96]
ESAC
RETURN tmp[31:0]
}
dst[31:0] := SELECT4(a[127:0], imm[1:0])
dst[63:32] := SELECT4(a[127:0], imm[3:2])
dst[95:64] := SELECT4(b[127:0], imm[5:4])
dst[127:96] := SELECT4(b[127:0], imm[7:6])
dst[159:128] := SELECT4(a[255:128], imm[1:0])
dst[191:160] := SELECT4(a[255:128], imm[3:2])
dst[223:192] := SELECT4(b[255:128], imm[5:4])
dst[255:224] := SELECT4(b[255:128], imm[7:6])
dst[MAX:256] := 0
Advanced Computer Architectures, 2014
More on SIMD instructions
35
Data blend
__m256 _mm_blendv_ps (__m256 a, __m256 b, __m256 mask)
Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm,
and store the results in dst.
FOR j := 0 to 7
i := j*32
IF mask[i+31]
dst[i+31:i] := b[i+31:i]
ELSE
dst[i+31:i] := a[i+31:i]
FI
ENDFOR
dst[MAX:256] := 0
More on parallelism:
Graphics Processing Units (GPUs)
Next lesson36