Implementation of Custom Precision Floating Point Arithmetic on FPGAs
Cvim half precision floating point
-
Upload
tomoaki0705 -
Category
Engineering
-
view
256 -
download
11
Transcript of Cvim half precision floating point
![Page 1: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/1.jpg)
Half Precision Floating Point Number
-half-@tomoaki_teshima
![Page 2: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/2.jpg)
How big is the image ?• Multiplying two images (floating point operation)
![Page 3: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/3.jpg)
Size ! Size !! Size !!!• RGB 3 bytes / pixel• float 4 bytes / pixel• Any more space to reduce ?
![Page 4: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/4.jpg)
Summary• Explanation of half• Example on ARM• Example on ARM w/ SIMD instruction• Example on Intel, AMD(x86)• Example on CUDA
![Page 5: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/5.jpg)
Format of Floating pointsIEEE75464bit = double, double precision
32bit = float, single precision
16bit = half, half precision
Signed bit
Exponent
Significand
1
1
1
11bit 52bit
23bit
10bit5bit
8bit
![Page 6: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/6.jpg)
ARM has fp16
https://ja.wikipedia.org/wiki/半精度浮動小数点数
![Page 7: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/7.jpg)
What to prepare• An ARM machine which runs Linux • Raspberry Pi zero/1/2/3• ODROID XU4/C2• Jetson TK1/TX1• PINE64• Red ones are 64bit architecture
• Buy one for better understanding
![Page 8: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/8.jpg)
Example on ARMint main(int argc, char**argv)
{
printf("Hello World !!\n");
__fp16 halfPrecision = 1.5f;
printf("half precision:%f\n“, halfPrecision);
printf("half precision:sizeof %d\n“, sizeof(halfPrecision));
printf("half precision:0x%04x\n", *(short*)(void*)&halfPrecision);
float original[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f,
9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f,16.0f,};
for (unsigned int i = 0;i < 16;i++)
{
__fp16 stub = original[i];
printf(“%2d 0x%04x\n", (int)original[i], *(short*)&stub);
}
return 0;
}
https://github.com/tomoaki0705/sampleFp16
![Page 9: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/9.jpg)
Build it
• Required to put option “-mpf16-format”• Try it on ARM gcc, otherwise “unknown option”error
$ gcc -std=c99 -mfp16-format=ieee main.c
![Page 10: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/10.jpg)
Result 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0
1/2
1/1024
1/41/81/16
1/321/64
1/1281/256
1/512
2(17− 15)×(1+ 12+ 14 )=22× 74=7
Signed bit(+)
Exponent(17)
Significand
When exponent is all 0, the number is subnormal.When exponent is all 1, the number is Inf or NaN.
![Page 11: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/11.jpg)
Summary• Floating points format is complicated than Integer• Half can express floating point numbers in 2 bytes
![Page 12: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/12.jpg)
Check in Assembly• Soft implemented conversion
• What’s the point doing it on SW side ?
$ gcc –S -std=c99 -mfp16-format=ieee –O main.c.s main.c
movw r3, #15872 <-0x3e00strh r3, [r7, #8] @ __fp16 <-store to stackldrh r3, [r7, #8] @ __fp16 <-load from stackmov r0, r3 @ __fp16 <-copy to r0bl __gnu_h2f_ieee <-function call (half2float)
![Page 13: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/13.jpg)
Half conversion instructions•Conversion instruction between
half and float• VCVTB.F16.F32 ( float -> half)• VCVTB.F32.F16 ( half -> float)• VCVTT.F16.F32 ( float -> half)• VCVTT.F32.F16 ( half -> float)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html
![Page 14: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/14.jpg)
Half instructions•ARM CPU might not have an FPU• To use the FPU, compiler has to know• Give an option to tell gcc
$ gcc –mfp16-format=ieee main.c ↓$ gcc –mfp16-format=ieee –mfpu=vfpv4 main.c
![Page 15: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/15.jpg)
Check in Assembler 2
movw r3, #15872strh r3, [r7, #8] @ __fp16add r2, r7, #8vld1.16 {d7[2]}, [r2]vcvtb.f32.f16 s15, s15
movw r3, #15872strh r3, [r7, #8] @ __fp16ldrh r3, [r7, #8] @ __fp16mov r0, r3 @ __fp16bl __gnu_h2f_ieee
w/o FPU option mfpu=vfpv4
![Page 16: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/16.jpg)
fp16 instructions on ARM• Conversion between half <-> float only• VCVTB.F16.F32• VCVTB.F32.F16• VCVTT.F16.F32• VCVTT.F32.F16
• If you perfume an operation with half number, the number will be promoted to single precision float just before the operation
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ij/CJAGIFIJ.html
![Page 17: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/17.jpg)
Summary• ARM• To use the HW instruction, specify the FPU• No operation instruction but conversion between fp32
• ARM(SIMD)• Intel, AMD (x86)• CUDA
![Page 18: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/18.jpg)
fp16 instruction on ARM (SIMD)•
vcvt stands for vector
• Let’s try using SIMD instructions• Conversion instruction using SIMD• float16x4_t vcvt_f16_f32(float32x4_t a);• VCVT.F16.F32 d0, q0
• float32x4_t vcvt_f32_f16(float16x4_t a);• VCVT.F32.F16 q0, d0
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348bj/BABGABJH.html
![Page 19: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/19.jpg)
Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uint8x8_t srcInteger = vld1_u8(src+x); // load 64bits float16x4_t gainHalfLow = *(float16x4_t*)(gain + x ); // load 32bits float16x4_t gainHalfHigh = *(float16x4_t*)(gain + x + 4 ); // load 32bits uint16x8_t srcIntegerShort = vmovl_u8(srcInteger); // uchar -> ushort uint32x4_t srcIntegerLow = vmovl_u16(vget_low_s16 (srcIntegerShort)); // ushort -> uint uint32x4_t srcIntegerHigh = vmovl_u16(vget_high_s16(srcIntegerShort)); // ushort -> uint float32x4_t srcFloatLow = vcvtq_f32_u32(srcIntegerLow ); // uint -> float float32x4_t srcFloatHigh = vcvtq_f32_u32(srcIntegerHigh); // uint -> float float32x4_t gainFloatLow = vcvt_f32_f16(gainHalfLow ); // half -> float float32x4_t gainFloatHigh = vcvt_f32_f16(gainHalfHigh); // half -> float float32x4_t dstFloatLow = vmulq_f32(srcFloatLow, gainFloatLow ); // float * float float32x4_t dstFloatHigh = vmulq_f32(srcFloatHigh, gainFloatHigh); // float * float uint32x4_t dstIntegerLow = vcvtq_u32_f32(dstFloatLow ); // float -> uint uint32x4_t dstIntegerHigh = vcvtq_u32_f32(dstFloatHigh); // float -> uint uint16x8_t dstIntegerShort = vcombine_u16(vmovn_u16(dstIntegerLow), vmovn_u16(dstIntegerHigh)); // uint -> ushort uint8x8_t dstInteger = vmovn_u16(dstIntegerShort); // ushort -> uchar vst1_u8(dst+x, dstInteger); // store}
https://github.com/tomoaki0705/sampleFp16Vector
![Page 20: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/20.jpg)
Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh); // uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}
![Page 21: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/21.jpg)
Let’ build• Specify one of the red FPU
options• The FPU has to have feature of
SIMD and half
vfpvfpv3vfpv3-fp16vfpv3-d16vfpv3-d16-fp16vfpv3xdvfpv3xd-fp16neonneon-fp16vfpv4vfpv4-d16fpv4-sp-d16neon-vfpv4fp-armv8neon-fp-armv8crypto-neon-fp-armv8
List of FPU option
http://dench.flatlib.jp/opengl/fpu_vfphttp://tessy.org/wiki/index.php?ARM%A4%CEFPU
![Page 22: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/22.jpg)
Check in Assembly
VCVT instruction
![Page 23: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/23.jpg)
Summary• ARM• Done
• ARM(SIMD)• Specify the FPU which is capable of both SIMD and half
• Intel,AMD (x86)• CUDA
![Page 24: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/24.jpg)
half instructions on x86• F16C instruction set
https://en.wikipedia.org/wiki/F16C
![Page 25: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/25.jpg)
Try the operation in vectorconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ __m128i srcInteger = _mm_loadl_epi64((__m128i const *)(src + x)); // load 64bits __m128i gainHalfLow = _mm_loadl_epi64((__m128i const *)(gain + x )); // load 32bits __m128i gainHalfHigh = _mm_loadl_epi64((__m128i const *)(gain + x + 4)); // load 32bits __m128i srcIntegerShort = _mm_unpacklo_epi8(srcInteger, v_zero); // uchar -> ushort __m128i srcIntegerLow = _mm_unpacklo_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcIntegerHigh = _mm_unpackhi_epi16(srcIntegerShort, v_zero); // ushort -> uint __m128i srcFloatLow = _mm_cvtepi32_ps(srcIntegerLow ); // uint -> float __m128i srcFloatHigh = _mm_cvtepi32_ps(srcIntegerHigh); // uint -> float __m128 gainFloatLow = _mm_cvtph_ps(gainHalfLow ); // half -> float __m128 gainFloatHigh = _mm_cvtph_ps(gainHalfHigh); // half -> float __m128 dstFloatLow = _mm_mul_ps(srcFloatLow , gainFloatLow ); // float * float __m128 dstFloatHigh = _mm_mul_ps(srcFloatHigh, gainFloatHigh); // float * float __m128i dstIntegerLow = _mm_cvtps_epi32(dstFloatLow ); // float -> uint __m128i dstIntegerHigh = _mm_cvtps_epi32(dstFloatHigh); // float -> uint __m128i dstIntegerShort = _mm_packs_epi32(dstIntegerLow, dstIntegerHigh); // uint -> ushort __m128i dstInteger = _mm_packus_epi16(dstIntegerShort, v_zero); // ushort -> uchar _mm_storel_epi64((__m128i *)(dst + x), dstInteger); // store}
https://github.com/tomoaki0705/sampleFp16Vector
![Page 26: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/26.jpg)
Little bit of improvementsconst unsigned int cParallel = 8;for (unsigned int x = 0;x <= cSize - cParallel;x += cParallel){ uchar8 srcInteger = load_uchar8(src+x); // load 64bits half4 gainHalfLow = load_half4(gain + x ); // load 32bits half4 gainHalfHigh = load_half4(gain + x + 4 ); // load 32bits ushort8 srcIntegerShort = convert_uchar8_ushort8(srcInteger); // uchar -> ushort uint4 srcIntegerLow = convert_ushort8_lo_uint4(srcIntegerShort); // ushort -> uint uint4 srcIntegerHigh = convert_ushort8_hi_uint4(srcIntegerShort); // ushort -> uint float4 srcFloatLow = convert_uint4_float4(srcIntegerLow ); // uint -> float float4 srcFloatHigh = convert_uint4_float4(srcIntegerHigh); // uint -> float float4 gainFloatLow = convert_half4_float4(gainHalfLow ); // half -> float float4 gainFloatHigh = convert_half4_float4(gainHalfHigh); // half -> float float4 dstFloatLow = multiply_float4(srcFloatLow , gainFloatLow ); // float * float float4 dstFloatHigh = multiply_float4(srcFloatHigh, gainFloatHigh); // float * float uint4 dstIntegerLow = convert_float4_uint4(dstFloatLow ); // float -> uint uint4 dstIntegerHigh = convert_float4_uint4(dstFloatHigh); // float -> uint ushort8 dstIntegerShort = convert_uint4_ushort8(dstIntegerLow, dstIntegerHigh);// uint -> ushort uchar8 dstInteger = convert_ushort8_uchar8(dstIntegerShort); // ushort -> uchar store_uchar8(dst + x, dstInteger); // store}
$ gcc -mf16c main.cpp
![Page 27: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/27.jpg)
Check in Assembly• Note that inline
functions have not been expanded “inline” when build in Debug mode
![Page 28: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/28.jpg)
Check in Assembly• Build with
RelWithDebInfo mode• Instructions are more
packed
Conversion instruction(vcvtph2ps)
![Page 29: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/29.jpg)
Check in Assembly(gcc)• Same behavior as
Visual Studio, inline functions are kept as function calls
![Page 30: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/30.jpg)
Check in Assembly(gcc)• Assembly of Release
mode• Much more packed
instructionsConversion instruction(vcvtph2ps)
![Page 31: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/31.jpg)
まとめ• ARM• Done
• ARM(SIMD)• Done
• Intel,AMD (x86)• x86 has half conversion as one of the SIMD instructions• Implemented on Ivy Bridge and later CPU (Intel)• Implemented on Piledriver and later CPU (AMD) • Done
• CUDA
![Page 32: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/32.jpg)
CUDAunsigned short a = g_indata[y*imgw+x];float gain;gain = __half2float(a);
float b = imageData[(y*imgw+x)*3 ];float g = imageData[(y*imgw+x)*3+1];float r = imageData[(y*imgw+x)*3+2];
g_odata[(y*imgw+x)*3 ] = clamp(b * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+1] = clamp(g * gain, 0.0f, 255.0f);g_odata[(y*imgw+x)*3+2] = clamp(r * gain, 0.0f, 255.0f);
![Page 33: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/33.jpg)
The best point using half• Data size transferring to GPU will be reduced
GPU memory
![Page 34: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/34.jpg)
Summary• ARM
• Done• ARM(SIMD)
• Done• Intel,AMD (x86)
• Done
• CUDA• CUDA 7.5 and later will support half natively• Pascal is expected to have has been announced to have direct
operation treating half <- Announced on 5th/April• Partially available on Jetson TX1• Conversion instruction it self exists for long timehttp://www.slideshare.net/NVIDIAJapan/1071-gpu-cuda-75maxwellhttp://pc.watch.impress.co.jp/docs/news/event/20160406_751833.html
![Page 35: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/35.jpg)
Summary of each platform
Platform Conversion(Single variable)
Conversion(Vector)
Direct operation with fp16
ARM ◯ ◯ ×X86 × ◯ ×CUDA(Maxwell and older) ◯ ◯ ×CUDA(Pascal and later) ◯ ◯ <-New!◯<-New!
![Page 36: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/36.jpg)
Limit of half precision -Overflow-• The maximum of float (32bit)• Exponent 8bits, significand 23bits
-> Up to 10E38• This is larger than maximum of signed int
(+ 2,147,483,647 )• The maximum of half (16bit)• Exponent 5bits, significand 10bits
-> Up to 65504• This is smaller than maximum of unsigned short
(65535)
![Page 37: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/37.jpg)
Limit of half precision –Rounding Error-• Rounding error of float (32bit)• Exact integer can be expressed up to 16777216(=2^24)
• Rounding error of half (16bit)• Exact integer can be expressed up to 2048 (=2^11)• In between 1024-2047, half can only express exact
integer number• In between 512-1024, half can only express numbers
with step of 0.5• Ex. 180.5 + 178.2 + 185.2 + 150.3 + 160.3 = 854.5• Correct average: 854.5/5 = 170.9• Computing with half: 171.0 <- rounding error
![Page 38: Cvim half precision floating point](https://reader035.fdocuments.net/reader035/viewer/2022062310/587151fd1a28ab8e5b8b4645/html5/thumbnails/38.jpg)
Summary• Explanation of FP16, half precision floating point• Available on platform• ARM (single variable / SIMD, storage only)• X86 (SIMD only, storage only)• CUDA (operation of fp16 coming on TX1)