Cvim half precision floating point

Half Precision Floating Point Number

-half-@tomoaki_teshima

How big is the image ?• Multiplying two images (floating point operation)

Size ! Size !! Size !!!• RGB 3 bytes / pixel• float 4 bytes / pixel• Any more space to reduce ?

Summary• Explanation of half• Example on ARM• Example on ARM w/ SIMD instruction• Example on Intel, AMD(x86)• Example on CUDA

Format of Floating pointsIEEE75464bit = double, double precision

32bit = float, single precision

16bit = half, half precision

Signed bit

Exponent

Significand

1

1

1

11bit 52bit

23bit

10bit5bit

8bit

ARM has fp16

https://ja.wikipedia.org/wiki/半精度浮動小数点数

https://ja.wikipedia.org/wiki/%E5%8D%8A%E7%B2%BE%E5%BA%A6%E6%B5%AE%E5%8B%95%E5%B0%8F%E6%95%B0%E7%82%B9%E6%95%B0

What to prepare• An ARM machine which runs Linux • Raspberry Pi zero/1/2/3• ODROID XU4/C2• Jetson TK1/TX1• PINE64• Red ones are 64bit architecture

• Buy one for better understanding

Example on ARMint main(int argc, char**argv)

{

printf("Hello World !!\n");

__fp16 halfPrecision = 1.5f;

printf("half precision:%f\n“, halfPrecision);

printf("half precision:sizeof %d\n“, sizeof(halfPrecision));

printf("half precision:0x%04x\n", *(short*)(void*)&halfPrecision);

float original[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f,

9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f,16.0f,};

for (unsigned int i = 0;i < 16;i++)

{

__fp16 stub = original[i];

printf(“%2d 0x%04x\n", (int)original[i], *(short*)&stub);

}

return 0;

}

https://github.com/tomoaki0705/sampleFp16



Build it

• Required to put option “-mpf16-format”• Try it on ARM gcc, otherwise “unknown option”error

$ gcc -std=c99 -mfp16-format=ieee main.c

Result 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0

1/2

1/1024

1/41/81/16

1/321/64

1/1281/256

1/512

2(17− 15)×(1+ 12+ 14 )=22× 74=7

Signed bit(+)

Exponent(17)

Significand

When exponent is all 0, the number is subnormal.When exponent is all 1, the number is Inf or NaN.

Summary• Floating points format is complicated than Integer• Half can express floating point numbers in 2 bytes

Check in Assembly• Soft implemented conversion

• What’s the point doing it on SW side ?

$ gcc –S -std=c99 -mfp16-format=ieee –O main.c.s main.c

movw r3, #15872 <-0x3e00strh r3, [r7, #8] @ __fp16 <-store to stackldrh r3, [r7, #8] @ __fp16 <-load from stackmov r0, r3 @ __fp16 <-copy to r0bl __gnu_h2f_ieee <-function call (half2float)

Half conversion instructions•Conversion instruction between

half and float• VCVTB.F16.F32　（ float -> half）• VCVTB.F32.F16　（ half -> float）• VCVTT.F16.F32　（ float -> half）• VCVTT.F32.F16　（ half -> float）

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html




Half instructions•ARM CPU might not have an FPU• To use the FPU, compiler has to know• Give an option to tell gcc

$ gcc –mfp16-format=ieee main.c　　　　↓$ gcc –mfp16-format=ieee –mfpu=vfpv4 main.c

Check in Assembler 2

movw r3, #15872strh r3, [r7, #8] @ __fp16add r2, r7, #8vld1.16 {d7[2]}, [r2]vcvtb.f32.f16 s15, s15

movw r3, #15872strh r3, [r7, #8] @ __fp16ldrh r3, [r7, #8] @ __fp16mov r0, r3 @ __fp16bl __gnu_h2f_ieee

w/o FPU option mfpu=vfpv4

fp16 instructions on ARM• Conversion between half <-> float only• VCVTB.F16.F32• VCVTB.F32.F16• VCVTT.F16.F32• VCVTT.F32.F16

• If you perfume an operation with half number, the number will be promoted to single precision float just before the operation





Summary• ARM• To use the HW instruction, specify the FPU• No operation instruction but conversion between fp32

• ARM(SIMD)• Intel, AMD (x86)• CUDA

fp16 instruction on ARM (SIMD)•

vcvt stands for vector

• Let’s try using SIMD instructions• Conversion instruction using SIMD• float16x4_t vcvt_f16_f32(float32x4_t a);• VCVT.F16.F32 d0, q0

• float32x4_t vcvt_f32_f16(float16x4_t a);• VCVT.F32.F16 q0, d0

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348bj/BABGABJH.html



Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uint8x8_t srcInteger = vld1_u8(src+x); // load 64bits float16x4_t gainHalfLow = *(float16x4_t*)(gain + x ); // load 32bits float16x4_t gainHalfHigh = *(float16x4_t*)(gain + x + 4 ); // load 32bits uint16x8_t srcIntegerShort = vmovl_u8(srcInteger); // uchar -> ushort uint32x4_t srcIntegerLow = vmovl_u16(vget_low_s16 (srcIntegerShort)); // ushort -> uint uint32x4_t srcIntegerHigh = vmovl_u16(vget_high_s16(srcIntegerShort)); // ushort -> uint float32x4_t srcFloatLow = vcvtq_f32_u32(srcIntegerLow ); // uint -> float float32x4_t srcFloatHigh = vcvtq_f32_u32(srcIntegerHigh); // uint -> float float32x4_t gainFloatLow = vcvt_f32_f16(gainHalfLow ); // half -> float float32x4_t gainFloatHigh = vcvt_f32_f16(gainHalfHigh); // half -> float float32x4_t dstFloatLow = vmulq_f32(srcFloatLow, gainFloatLow ); // float * float float32x4_t dstFloatHigh = vmulq_f32(srcFloatHigh, gainFloatHigh); // float * float uint32x4_t dstIntegerLow = vcvtq_u32_f32(dstFloatLow ); // float -> uint uint32x4_t dstIntegerHigh = vcvtq_u32_f32(dstFloatHigh); // float -> uint uint16x8_t dstIntegerShort = vcombine_u16(vmovn_u16(dstIntegerLow), vmovn_u16(dstIntegerHigh)); // uint -> ushort uint8x8_t dstInteger = vmovn_u16(dstIntegerShort); // ushort -> uchar vst1_u8(dst+x, dstInteger); // store}

https://github.com/tomoaki0705/sampleFp16Vector



Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh); // uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}

Let’ build• Specify one of the red FPU

options• The FPU has to have feature of

SIMD and half

vfpvfpv3vfpv3-fp16vfpv3-d16vfpv3-d16-fp16vfpv3xdvfpv3xd-fp16neonneon-fp16vfpv4vfpv4-d16fpv4-sp-d16neon-vfpv4fp-armv8neon-fp-armv8crypto-neon-fp-armv8

List of FPU option

http://dench.flatlib.jp/opengl/fpu_vfphttp://tessy.org/wiki/index.php?ARM%A4%CEFPU

http://dench.flatlib.jp/opengl/fpu_vfp

http://dench.flatlib.jp/opengl/fpu_vfp

http://tessy.org/wiki/index.php?ARM%A4%CEFPU

http://tessy.org/wiki/index.php?ARM%A4%CEFPU

Check in Assembly

VCVT instruction

Summary• ARM• Done

• ARM(SIMD)• Specify the FPU which is capable of both SIMD and half

• Intel,AMD (x86)• CUDA

half instructions on x86• F16C instruction set

https://en.wikipedia.org/wiki/F16C



Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ __m128i srcInteger = _mm_loadl_epi64((__m128i const *)(src + x)); // load 64bits __m128i gainHalfLow = _mm_loadl_epi64((__m128i const *)(gain + x )); // load 32bits __m128i gainHalfHigh = _mm_loadl_epi64((__m128i const *)(gain + x + 4)); // load 32bits __m128i srcIntegerShort = _mm_unpacklo_epi8(srcInteger, v_zero); // uchar -> ushort __m128i srcIntegerLow = _mm_unpacklo_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcIntegerHigh = _mm_unpackhi_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcFloatLow = _mm_cvtepi32_ps(srcIntegerLow ); // uint -> float __m128i srcFloatHigh = _mm_cvtepi32_ps(srcIntegerHigh); // uint -> float __m128 gainFloatLow = _mm_cvtph_ps(gainHalfLow ); // half -> float __m128 gainFloatHigh = _mm_cvtph_ps(gainHalfHigh); // half -> float __m128 dstFloatLow = _mm_mul_ps(srcFloatLow , gainFloatLow ); // float * float __m128 dstFloatHigh = _mm_mul_ps(srcFloatHigh, gainFloatHigh); // float * float __m128i dstIntegerLow = _mm_cvtps_epi32(dstFloatLow ); // float -> uint __m128i dstIntegerHigh = _mm_cvtps_epi32(dstFloatHigh); // float -> uint __m128i dstIntegerShort = _mm_packs_epi32(dstIntegerLow, dstIntegerHigh); // uint -> ushort __m128i dstInteger = _mm_packus_epi16(dstIntegerShort, v_zero); // ushort -> uchar _mm_storel_epi64((__m128i *)(dst + x), dstInteger); // store}




Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh);// uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}

$ gcc -mf16c main.cpp

Check in Assembly• Note that inline

functions have not been expanded “inline” when build in Debug mode

Check in Assembly• Build with

RelWithDebInfo mode• Instructions are more

packed

Conversion instruction(vcvtph2ps)

Check in Assembly(gcc)• Same behavior as

Visual Studio, inline functions are kept as function calls

Check in Assembly(gcc)• Assembly of Release

mode• Much more packed

instructionsConversion instruction(vcvtph2ps)

まとめ• ARM• Done

• ARM(SIMD)• Done

• Intel,AMD (x86)• x86 has half conversion as one of the SIMD instructions• Implemented on Ivy Bridge and later CPU (Intel)• Implemented on Piledriver and later CPU (AMD) • Done

• CUDA

CUDAunsigned short a = g_indata[y*imgw+x];float gain;gain = __half2float(a);

float b = imageData[(y*imgw+x)*3 ];float g = imageData[(y*imgw+x)*3+1];float r = imageData[(y*imgw+x)*3+2];

g_odata[(y*imgw+x)*3 ] = clamp(b * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+1] = clamp(g * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+2] = clamp(r * gain, 0.0f, 255.0f);

The best point using half• Data size transferring to GPU will be reduced

GPU memory

Summary• ARM

• Done• ARM(SIMD)

• Done• Intel,AMD (x86)

• Done

• CUDA• CUDA 7.5 and later will support half natively• Pascal is expected to have has been announced to have direct

operation treating half <- Announced on 5th/April• Partially available on Jetson TX1• Conversion instruction it self exists for long timehttp://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwellhttp://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html

http://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwell



http://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html

http://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html

Summary of each platform

Platform Conversion(Single variable)

Conversion(Vector)

Direct operation with fp16

ARM ◯ ◯ ×X86 × ◯ ×CUDA(Maxwell and older) ◯ ◯ ×CUDA(Pascal and later) ◯ ◯ <-New!◯<-New!

Limit of half precision -Overflow-• The maximum of float (32bit)• Exponent 8bits, significand 23bits

-> Up to 10E38• This is larger than maximum of signed int

(+ 2,147,483,647 )• The maximum of half (16bit)• Exponent 5bits, significand 10bits

-> Up to 65504• This is smaller than maximum of unsigned short

(65535)

Limit of half precision –Rounding Error-• Rounding error of float (32bit)• Exact integer can be expressed up to 16777216(=2^24)

• Rounding error of half (16bit)• Exact integer can be expressed up to 2048 (=2^11)• In between 1024-2047, half can only express exact

integer number• In between 512-1024, half can only express numbers

with step of 0.5• Ex. 180.5 + 178.2 + 185.2 + 150.3 + 160.3 = 854.5• Correct average: 854.5/5 = 170.9• Computing with half: 171.0 <- rounding error

Summary• Explanation of FP16, half precision floating point• Available on platform• ARM (single variable / SIMD, storage only)• X86 (SIMD only, storage only)• CUDA (operation of fp16 coming on TX1)

Cvim half precision floating point

Engineering

Transcript of Cvim half precision floating point