SIMD - Peter Elderon

50
® IBM Software Group © 2015 IBM Corporation SIMD Spring 2015 Peter Elderon [email protected]

Transcript of SIMD - Peter Elderon

IBM Software Group | Rational software

2

SIMD

Overview

PL/I use in SEARCH and VERIFY

PL/I use in INLIST

Possible PL/I use in the future

IBM Software Group | Rational software

4

SIMD/Vector Overview

SIMD – Single Instruction Multiple Data

Also referred to as Vector Instructions

Each vector contains multiple data elements of a fixed size:

16 bytes

8 halfwords

4 fullwords

2 doublewords

1 quadword

So each vector is 16 bytes long

There are 32 vector registers, named V0, V1, …, V31

IBM Software Group | Rational software

5

SIMD/Vector Overview

Vector instructions can operate on all of the elements in one or more vectors

So, a vector add of V1 and V2 as 16 one-byte integers into V0 would perform 16 adds in one instruction

+ + + ++ + + + + + + ++ + + +V1

V2

V0

IBM Software Group | Rational software

6

SIMD/Vector Overview

And a vector add of V1 and V2 as 8 2-byte integers would perform 8 adds in one instruction

Bits in the instruction encode the element size, and the instruction mnemonic will reveal it as well

+ ++ + + ++ +V1

V2

V0

IBM Software Group | Rational software

7

SIMD/Vector Overview

Most vector instructions do not set/change the condition code

Instead a vector instruction can be used to extract a result

Vector loads and stores

Can handle any byte alignment

Are most efficient with 8 byte boundaries

There is also a vector load instruction that will load only those bytes up to the next 4K page boundary

Very useful in handling null-terminated strings

IBM Software Group | Rational software

8

Overlaid Vector/Floating-Point registers

The 32 vector registers overlay the 16 FPRs

Bits 0:63 of SIMD registers 0-15 will correspond to FPRs 0-15

When using an FPR, bits 64:127 of the corresponding vector register will become unpredictable

FPRs

Vectors

15

63

0

310 127

Bits

Regis

ter

IBM Software Group | Rational software

9

Application Considerations

Be very aware that any use of a FPR will change all 16 bytes of the corresponding VR

Linkage Convention (caller may assume across a call)

– VRs 0 to 7 are volatile

– VRs 8 to 15

– bytes 0-7 are non-volatile

– bytes 8-15 are volatile

– VRs 16 to 23 are non-volatile

– VRs 24 to 31 are volatile

IBM Software Group | Rational software

10

Instruction Overview

There are 4 classes of vector instructions

Support instructions

Integer instructions

Floating-point instructions

String instructions

The (many) integer instructions allow for add, multiply, compare, logical and, shifts, etc etc

Currently not exploited by PL/I

The floating-point instructions support only IEEE float binary

hence of little use to PL/I (or COBOL)

IBM Software Group | Rational software

11

String instructions - VFAE

Vector Find Any Equal

VFAE v1,v2,v3,m4,m5

Compares v2 to v3 from left to right looking for an element in v2 equal to any of the elements in v3

Stores the byte index (0-15) of the leftmost hit or 16 if none in v1

m4 is a 4-bit nibble indicating the element size

0 – byte for 16 * 16 compares

1 – halfword for 8 * 8 compares

2 – word for 4 * 4 compares

m5 is a 4-bit nibble providing some variations

IBM Software Group | Rational software

12

String instructions - VFAE

Vector Find Any Equal

VFAE v1,v2,v3,m4,m5

if m5 is ‘1…’b, then

Compares v2 to v3 from left to right looking for an element in v2 not equal to any of the elements in v3

Stores the byte index (0-15) of the leftmost hit or 16 if none in v1

So, m5 equal to ‘0…’b is useful for SEARCH

And, m5 equal to ‘1…’b is useful for VERIFY

When m5 is ‘1…’b, the instruction is also known as

Vector Find Any Not Equal

IBM Software Group | Rational software

13

String instructions - VFAE

Vector Find Any Equal

VFAE v1,v2,v3,m4,m5

if m5 is ‘..1.’b, then

The comparison will stop if an element in v2 is equal to zero

Stores the byte index (0-15) of the leftmost hit or 16 if none in v1

So, m5 equal to ‘..1.’b is useful for VARYINGZ strings

When m5 is ‘..1.’b, the instruction is also known as

Vector Find Any Equal or Zero

Vector Find Any Not Equal or Zero

IBM Software Group | Rational software

14

String instructions - VSTRC

Vector String Range Compare

VSTRC v1,v2,v3,v4,m5,m6

Compares each element in v2 to all the ranges specified by the corresponding elements in the even-odd element pairs in v3 with the comparison done according to the indicator bits in the even odd element pairs in v4

Stores the byte index (0-15) of the leftmost hit or 16 if none in v1

m5 is a 4-bit nibble indicating the element size

0 – byte for 16 sets of 8 range compares

1 – halfword for 8 sets of 4 range compares

2 – word for 4 sets of 2 range compares

m6 is a 4-bit nibble providing variations a la VFAE

IBM Software Group | Rational software

15

String instructions - VSTRC

Typically the even-odd element pairs defining the comparisons will be constants indicating GE and LE

But they could consist of EQ and EQ for a degenerate range

Duplicate ranges are allowed

The number of range pairs may also be less than fills the vector

This instruction is very useful for tests of WCHAR

a la VFAE, the m6 value determines

if it implements SEARCH or VERIFY

if it also looks for a terminating zero (for VARYINGZ)

IBM Software Group | Rational software

17

SEARCH and VERIFY of CHAR

SEARCH and VERIFY of CHAR are done inline if arch < 11

But the old TRT instruction is used only if the first argument is

NONVARYING with length known at compile time

Otherwise the characters are tested one at a time

But with the vector instructions, we do much better

IBM Software Group | Rational software

18

SEARCH and VERIFY of CHAR

For example, this simple code tests if a VARYING CHAR string is hex

ishex: proc( s );

dcl s char(*) var;

dcl x char value( '0123456789ABCDEF' );

dcl sx fixed bin(31);

sx = verify( s, x );

if sx > 0 then

It can now be done with a loop of vector find-any-not-equal

IBM Software Group | Rational software

19

SEARCH and VERIFY of CHAR

Namely with a series of FANE testing up to 16 bytes at a time

E720 1000 0006 VL v2,+CONSTANT_AREA(,r1,0)

E3F0 E000 0095 LLH r15,_shadow2(,r14,0)

4120 E002 LA r2,#AddressShadow(,r14,2)

ECFC 003F 007E CIJNH r15,H'0',@1L9

@1L5 DS 0H

A7FE 0010 CHI r15,H'16'

4140 0010 LA r4,16

B9F2 404F LOCRL r4,r15

B9FA 00E2 ALRK r14,r2,r0

E704 E000 0037 VLL v0,r4,_shadow1(r14,0)

E700 2080 0082 VFAE v0,v0,v2,b'0000',b'1000'

E7E0 0001 2021 VLGV r14,v0,1,2

EC4E 000C 2076 CRJH r4,r14,@1L6

A7FA FFF0 AHI r15,H'-16'

A70A 0010 AHI r0,H'16'

ECFC 0024 007E CIJNH r15,H'0',@1L9

A7F4 FFE5 J @1L5

@1L6 DS 0H

IBM Software Group | Rational software

20

SEARCH and VERIFY of CHAR

In this example, the value string ‘0123456789ABCDEF’ was 16 bytes long

So it was loaded into a vector register that was repeatedly tested

If the value string were ‘0123456789ABCDEFabcdef’, this would not work

But then the vector range compare instruction could be used

The test string would then be compared 16 bytes at time to see if any was

not in one of the ranges 0-9, A-F or a-f

IBM Software Group | Rational software

21

SEARCH and VERIFY of CHAR

So, under arch(11), for SEARCH(x,y) and VERIFY(x,y) where x is char, if y is a

literal with 1 <= length(y) <= 16, then

the compiler will generate code using the vector find-any-equal (VFAE) instruction

(or VFANE for verify)

If y is a literal that the compiler can regroup as 8 or fewer ranges, then

It will generate code using the vector string range compare instruction

And this will be done if x is NONVARYING, VARYING or VARYINGZ

IBM Software Group | Rational software

22

SEARCH and VERIFY of WIDECHAR

SEARCH and VERIFY of WIDECHAR are done via library calls if arch < 11

But under arch(11), for SEARCH(x,y) and VERIFY(x,y) where x is widechar

when y is a literal with 1 <= length(y) <= 8,

the compiler will generate code using the vector find-any-equal (VFAE) instruction

(or VFANE for verify)

Length(y) must be <= 8 since the set of wchars to be tested has to fit in one

vector of 16 bytes

IBM Software Group | Rational software

23

SEARCH and VERIFY of WIDECHAR

For example, this simple code tests if a UTF-16 string is octal

woctal: proc( s );

dcl s wchar(*) var;

dcl o wchar value( '01234567' );

dcl sx fixed bin(31);

sx = verify( s, o );

if sx > 0 then ...

It is done with an expensive library call with ARCH <= 10

IBM Software Group | Rational software

24

SEARCH and VERIFY of WIDECHAR

With ARCH(11), the vector instruction facility is used to inline it as

E720 1000 0006 VL v2,+CONSTANT_AREA(,r1,0)

@1L5 DS 0H

A7FE 0010 CHI r15,H'16'

4140 0010 LA r4,16

B9F2 404F LOCRL r4,r15

B9FA 00E2 ALRK r14,r2,r0

E704 E000 0037 VLL v0,r4,_shadow1(r14,0)

E700 2080 1082 VFAE v0,v0,v2,b'0001',b'1000'

E7E0 0001 2021 VLGV r14,v0,1,2

EC4E 000C 2076 CRJH r4,r14,@1L6

A7FA FFF0 AHI r15,H'-16'

A70A 0010 AHI r0,H'16'

ECFC 0026 007E CIJNH r15,H'0',@1L9

A7F4 FFE5 J @1L5

@1L6 DS 0H

IBM Software Group | Rational software

25

SEARCH and VERIFY of WIDECHAR

Here the value string ‘01234567’ has 8 wide characters

So it can be loaded as a vector of 8 2-byte integers

And then the test string can be compared against the vector with up to 8

characters tested at a time

If the value string had more than 8 wchars, FAE and FANE could not be used

But a vector string compare could be used instead

IBM Software Group | Rational software

26

SEARCH and VERIFY of WIDECHAR

For example, this simple code tests if a UTF-16 string is numeric

wnumb: proc( s );

dcl s wchar(*) var;

dcl n wchar value( '0123456789' );

dcl sx fixed bin(31);

sx = verify( s, n );

if sx > 0 then ...

It is done with an expensive library call with ARCH <= 10

IBM Software Group | Rational software

27

SEARCH and VERIFY of WIDECHAR

With ARCH(11), the vector instruction facility is used to inline it as

E700 E000 0006 VL v0,+CONSTANT_AREA(,r14,0)

E740 E010 0006 VL v4,+CONSTANT_AREA(,r14,16)

@1L2 DS 0H

A74E 0010 CHI r4,H'16'

4150 0010 LA r5,16

B9F2 4054 LOCRL r5,r4

B9FA F0E2 ALRK r14,r2,r15

E725 E000 0037 VLL v2,r5,_shadow1(r14,0)

E722 0180 408A VSTRC v2,v2,v0,v4,b'0001',b'1000'

E7E2 0001 2021 VLGV r14,v2,1,2

EC5E 000D 2076 CRJH r5,r14,@1L3

A74A FFF0 AHI r4,H'-16'

A7FA 0010 AHI r15,H'16'

EC4C 000E 007E CIJNH r4,H'0',@1L4

A7F4 FFE5 J @1L2

IBM Software Group | Rational software

28

SEARCH and VERIFY of WIDECHAR

Where the range and comparison vectors are

0030_0039 0000_0000 0000_0000 0000_0000

A000_C000 0000_0000 0000_0000 0000_0000

If the string to test were ‘_0123456789’, then the vectors would be:

0030_0039 005F_005F 0000_0000 0000_0000

A000_C000 8000_8000 0000_0000 0000_0000

Where the second “range” is a degenerate range to test for a wchar _

IBM Software Group | Rational software

29

SEARCH and VERIFY of WIDECHAR

For SEARCH of wchar under arch(11),

when y is a literal with 16 or fewer ranges, the compiler will generate code

using the vector string range compare (VSTRC) instruction

If there are more than 4 ranges, the source bytes are loaded once and

repeated VSTRC tests are made against that source vector until a range is hit

IBM Software Group | Rational software

30

SEARCH and VERIFY of WIDECHAR

For VERIFY of wchar under arch(11),

when y is a literal with 4 or fewer ranges, the compiler will generate code

using the inverse vector string range compare (VSTRC) instruction

However, loading the source once and using repeated inverse VSTRC tests

against that vector won't work as simply with VERIFY (unlike SEARCH)

IBM Software Group | Rational software

31

SEARCH and VERIFY of WIDECHAR

For example, suppose the suppose the literal defines a set of 6 ranges

c-d g-h k-l o-p s-t w-x

To perform SEARCH of ‘quvx’ against this set of ranges, we can simply test

to see if any of the characters fall in one of the first 4 ranges, and if not, in

one of the next 4 etc:, i.e. test first against

c-d g-h k-l o-p

and then, if necessary, test against

s-t w-x

IBM Software Group | Rational software

32

SEARCH and VERIFY of WIDECHAR

But for VERIFY of ‘clot’ against this set of ranges, we would find a character

not in the first 4 ranges and 3 characters not in the next 2 ranges

c-d g-h k-l o-p s-t w-x

That would lead us to produce a non-zero result

But every character is in the full set of ranges and we want a result of zero!

The key here is that every vector string range compare instruction is

comparing multiple characters against a set of ranges – unlike a traditional,

simple test of a single character against a set of ranges

IBM Software Group | Rational software

33

SEARCH and VERIFY of WIDECHAR

This would suggest limiting VERIFY of wchar to 4 ranges

But that restriction is worse than it might seem

Testing no more than 8 ranges for char may be ok since there are only 256

char values and 8 ranges of 16 cover half of that

But there are 64K wchar values and 4 ranges won’t cover much of that

And one major European bank runs some important code (over 1M times a

day) that has a VERIFY against this string

IBM Software Group | Rational software

34

SEARCH and VERIFY of WIDECHAR

With 16 ranges

dcl test_chars wchar value(

'002B002C002D002E'wx

|| '0030003100320033003400350036003700380039'wx

|| '0660066106620663066406650666066706680669'wx

|| '06F006F106F206F306F406F506F606F706F806F9'wx

|| '0966096709680969096A096B096C096D096E096F'wx

|| '09E609E709E809E909EA09EB09EC09ED09EE09EF'wx

|| '0A660A670A680A690A6A0A6B0A6C0A6D0A6E0A6F'wx

|| '0AE60AE70AE80AE90AEA0AEB0AEC0AED0AEE0AEF'wx

|| '0B660B670B680B690B6A0B6B0B6C0B6D0B6E0B6F'wx

|| '0BE70BE80BE90BEA0BEB0BEC0BED0BEE0BEF'wx

|| '0C660C670C680C690C6A0C6B0C6C0C6D0C6E0C6F'wx

|| '0CE60CE70CE80CE90CEA0CEB0CEC0CED0CEE0CEF'wx

|| '0D660D670D680D690D6A0D6B0D6C0D6D0D6E0D6F'wx

|| '0E500E510E520E530E540E550E560E570E580E59'wx

|| '0ED00ED10ED20ED30ED40ED50ED60ED70ED80ED9'wx

|| '0F200F210F220F230F240F250F260F270F280F29'wx );

IBM Software Group | Rational software

35

SEARCH and VERIFY of WIDECHAR

However: we can finesse this problem:

We flip VERIFY( x, y ) to SEARCH( x, not y )

And so VERIFY and SEARCH for widechar will both be inlined if the number

of ranges is 16 or less

Although for VERIFY this may require testing 17 ranges

IBM Software Group | Rational software

36

SEARCH and VERIFY of WIDECHAR

For example, suppose the suppose the literal defines a set of 6 ranges

c-d g-h k-l o-p s-t w-x

VERIFY against this is the same as SEARCH against the “missing” ranges,

and so we can inline this via two normal (non-inverse) VSTRC tests against

this set of ranges

a-b e-f i-j m-n q-r u-v y-z

But note that the 6 ranges when flipped became 7 ranges - hence if there are

16 ranges, we might have to test against 17

IBM Software Group | Rational software

38

INLIST

This built-in function is useful in determining if a value belongs to a set of

values and allows you to put a SELECT in the middle of an IF

It requires a minimum of 3 arguments and accepts a maximum of 64

INLIST( x, a, b, c, … ) is equivalent to ( x = a ) | ( x = b ) | ( x = c ) …

All the arguments must have computational type

The compiler will optimize this when possible

IBM Software Group | Rational software

39

INLIST

If the first argument is “nice” and the rest are all similar, “close” values, then

the compiler will turn the inlist reference into a branch table. For example,

inlist( x, 2, 3, 5, 7, 11, 13, 17, 19 )

would become a branch table if x is FIXED BIN(31) or if X is FIXED DEC(5)

And if all are CHAR(1), a simple table look-up is generated

But if all are CHAR(2) or CHAR(4) or WCHAR(1) or WCHAR(2), a series of

compares is generated (since the values are unlikely to be “close” and the

branch table would be huge)

IBM Software Group | Rational software

40

INLIST

But, consider this snippet of code to validate a 2-byte country code

checkcc:

proc( countryCode )

options(nodescriptor);

dcl countryCode char(2);

if inlist( countryCode,

'AT', 'DE', 'CH', 'NL', 'DK', 'FI', 'SE', 'NO' ) then;

else

signal error;

IBM Software Group | Rational software

41

INLIST

Under arch(10) and opt(3), it becomes 8 compares and branches

5810 1000 L r1,_addrCOUNTRYCODE(,r1,0)

4800 1000 LH r0,_shadow1(,r1,0)

A70E C1E3 CHI r0,H'-15901'

A784 0026 JE @1L13

A70E C4C5 CHI r0,H'-15163'

A784 0022 JE @1L13

A70E C3C8 CHI r0,H'-15416'

A784 001E JE @1L13

A70E D5D3 CHI r0,H'-10797'

A784 001A JE @1L13

A70E C4D2 CHI r0,H'-15150'

A784 0016 JE @1L13

A70E C6C9 CHI r0,H'-14647'

A784 0012 JE @1L13

A70E E2C5 CHI r0,H'-7483'

A784 000E JE @1L13

A70E D5D6 CHI r0,H'-10794'

A784 000A JE @1L13

IBM Software Group | Rational software

42

INLIST

But under arch(11) and opt(3), one vector-find-any-equal and one branch do

it faster and more simply!

5810 1000 L r1,_addrCOUNTRYCODE(,r1,0)

4100 0002 LA r0,2

E700 1000 0037 VLL v0,r0,_shadow1(r1,0)

E720 E000 0006 VL v2,+CONSTANT_AREA(,r14,0)

E700 2000 1082 VFAE v0,v0,v2,b'0001',b'0000'

E700 0001 2021 VLGV r0,v0,1,2

EC08 000B 007E CIJE r0,H'0',@1L4

IBM Software Group | Rational software

43

INLIST

And if there were 16 codes to be tested, then instead of 16 compares and

branches, under arch(11), 2 vector-find-any-equal and 2 branches suffice!

5810 1000 L r1,_addrCOUNTRYCODE(,r1,0)

4100 0002 LA r0,2

C0E0 0000 LARL r14,F'48'

E720 1000 0037 VLL v2,r0,#AddressShadow(r1,0)

E700 E000 0006 VL v0,+CONSTANT_AREA(,r14,0)

E702 0000 1082 VFAE v0,v2,v0,b'0001',b'0000'

E700 0001 2021 VLGV r0,v0,1,2

EC08 0017 007E CIJE r0,H'0',@1L6

E700 E010 0006 VL v0,+CONSTANT_AREA(,r14,16)

E702 0000 1082 VFAE v0,v2,v0,b'0001',b'0000'

E700 0001 2021 VLGV r0,v0,1,2

EC08 000B 007E CIJE r0,H'0',@1L6

IBM Software Group | Rational software

44

INLIST

And one vector operation will also suffice for

8 compares of WCHAR(1)

4 compares of CHAR(4)

4 compares of WCHAR(2)

IBM Software Group | Rational software

46

BETWEEN

This built-in function is useful in determining if a value is in an interval

It requires exactly 3 arguments

BETWEEN( x, a, b ) is equivalent to ( x >= a ) & ( x <= b )

All the arguments must be ordinals or have real numeric type

The compiler will optimize this when possible

For example, if x, a, and b are all FIXED BIN(p,0) with p <= 31, then the compiler will

turn BETWEEN( x, a, b ) into one comparison (not two!)

OORDINAL, CHAR(1), and WCHAR(1) are optimized in the same way

IBM Software Group | Rational software

47

BETWEEN

If this function were allowed to have more arguments to test if a value was in

any one of several ranges, for example

BETWEEN( x, a, b, c, d, e, f ) would be equivalent to

BETWEEN( x, a, b ) | BETWEEN( x, c, d ) | BETWEEN( x, e, f )

Then for certain types of x, the compiler could use the vector range compare

instruction to generate nice code to do these tests

IBM Software Group | Rational software

48

Other built-in functions

USUPPLEMENTARY

Essentially a range compare

JSONGetComma, JSONGetColon, etc

Requires an initial “VERIFY” against the possible whitespace values

IBM Software Group | Rational software

49

Arrays

PL/I has always had array language

Vector instructions could be used to optimize code such as

A = B; where A is an array of FIXED BIN(31) and B an array of FIXED BIN(15)

A = B + C * D; (etc) where the elements are arrays of FIXED BIN

ALL or ANY when applied to various integer or string arrays

Et al

IBM Software Group | Rational software

50

© Copyright IBM Corporation 2008. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, the on-demand business logo, Rational, the Rational logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.

Learn more at:

IBM Rational software

IBM Rational Software Delivery Platform

Process and portfolio management

Change and release management

Quality management

Architecture management

Rational trial downloads

developerWorks Rational

IBM Rational TV

IBM Rational Business Partners