Exercise: RISC – Programminggmichi/asocd/exercises/ex_02.pdf · Exercise 2 – Basic Tests 1/4....

26
Integrated Systems Laboratory Exercise: RISC – Programming Increasing efficiency of a RISC-core with simple instruction extensions Michael Gautschi 04.04.2016

Transcript of Exercise: RISC – Programminggmichi/asocd/exercises/ex_02.pdf · Exercise 2 – Basic Tests 1/4....

Integrated Systems Laboratory

Exercise: RISC – ProgrammingIncreasing efficiency of a RISC-core with

simple instruction extensions

Michael Gautschi

04.04.2016

Integrated Systems Laboratory

Introduction• The exercises in today will be performed on the Pulpino platform

– Open source platform [www.pulp-platform.org]

– OpenRISC / RISC-V core

– 32kB Instruction memory

– 32kB Data memory

– SPI (load/unload data)

– UART (for printf)

– Small event unit

04.04.2016 2

Integrated Systems Laboratory

Exercise Overview1. Introduction example

– Compile & execute Helloworld

2. RTL – Simulator basics– Run motion_detection application [perf counters, traces, read]

3. Benchmarking– Analyze performance improvements of the new instructions

4. Efficient matrix multiplications and convolutions with Dot-product– Program a convolution and show the benefit of the dot product

5. Motion detection with efficient convolution– Plug the optimized convolution into the application and observe the speedup

6. Compressed instructions on RISC-V– Coremark analysis

04.04.2016 3

Integrated Systems Laboratory

Getting Started – 1/2

• Copy data from master account:$ mkdir 2_OpenRISC$ cp /home/soc_master/2_OpenRISC/pulpino.tar.gz 2_OpenRISC/.$ tar –xzf pulpino.tar.gz

• We will be working in the software (sw) and build directories– rtl/ips-dir:

• Contains HDL source code– sw-dir:

• contains application sourcecode (in apps)– build-dir:

• Contains compiler and simulator outputs• RTL-simulations will be run here

– vsim-dir:• Contains all scripts for RTL compilation

04.04.2016 4

2_OpenRISC directory

Integrated Systems Laboratory

Getting Started – 2/2

• We will be working on the scratch because we are going to generate some data

1. Create a build directory and set up the compiler• $ mkdir /scratch/soc_xx/build_or10n

2. Configure the build directory• $ cd /scratch/soc_xx/build_or10n• $ cp ~soc_master/2_OpenRISC/pulpino/sw/cmake_configure.or1k.gcc.sh .• In the configure script: Set the path to your exercise folder:

PULP_GIT_DIRECTORY=“/home/soc_xx/2_OpenRISC/pulpino”• $ or1k -g2.0.11 ./cmake_configure.or1k.gcc.sh• You have successfully set up the build directory!

3. Compile the RTL code• $ make vcompile• Lets get started with exercise 1!

04.04.2016 5

Integrated Systems Laboratory

Exercise 1 – Introductiona) The build directory is created, the compiler is configured, the RTL is

compiled. We are ready to start with a simple helloworld.

b) Compile helloworld• helloworld.c is located in sw/apps/helloworld/.• To compile the application enter the build folder and run the makefile

$ cd /scratch/soc_xx/build_or10n$ make helloworld.read : to generate the assembly$ make helloworld.slm.cmd : to generate input data for RTL simulations

c) Compile & Run helloworld• The application can be run in modelsim (gui) or in batch mode:

$ make helloworld.vsim : to start modelsim (+type run –al)$ make helloworld.vsimc : to run in batch mode

Console should output “helloworld”

• Output is also written to the file: apps/helloworld/stdout/uart

04.04.2016 6

Integrated Systems Laboratory

Exercise 2 – Basic Tests 1/4

04.04.2016 7

• We are now looking at a more complicated application: The “motion_detection” application

• To compile&run the application:$ make motion_detection.vsimc

• A timer is tracking how many cycles were required to compute the image– The printf-output is sent over UART, and the testbench dumps the

received data to the file:build_or10n/apps/sequential_tests/motion_detection/stdout/uart

– The testbench also outputs a trace file which allows to see in what sequence the instructions have been executed:

build_or10n/apps/sequential_tests/motion_detection/trace_core00.log

Integrated Systems Laboratory

Exercise 2 – Basic Tests 2/4

04.04.2016 8

• To better understand what the compiler generated you can have a look at the disassembled code:

$ make motion_detection.read

PC

Disassembled instructions

Absolute and relative jump/branch targets

Instruction encoding• Trace file:ALU register update; load data to register; write to memoryTime Cycle PC

Integrated Systems Laboratory

Exercise 2 – Basic Tests 3/4

04.04.2016 9

• Performance counters:– In order to profile an application, the core supports several performance counters.– Only one Counter exists in the micro-architecture to keep the area overhead small

• To count multiple events the program has to be run in sequence with different events configured

– The following events are of interest:

Name ID Counts:SPR_PCER_CYCLES 0 # cycles

SPR_PCER_INSTR 1 # instructions

SPR_PCER_LD_STALL 2 # load hazards

SPR_PCER_LD 7 # load insn.

SPR_PCER_ST 8 # store insn.

SPR_PCER_JUMP 9 # jumps

SPR_PCER_BRANCH 10 # branches

SPR_PCER_DELAY_NOP 11 # delay nopsFunctions to set up performance counters:

perf_reset() : to reset countersperf_enable_id(ID) : start count event IDperf_stop() : stop countingcpu_perf_get(ID) : read counter

Integrated Systems Laboratory

Exercise 2 – Basic Tests 4/4

04.04.2016 10

• Tasks:– How many kB is the binary?

• How big is the convolution_rect function?

– Profile the motion_detection algorithm:• How many instructions are executed? • How many load/stores were used? • How many cycles were counted? • What is the IPC (# instructions per cycle)?

Integrated Systems Laboratory

Exercise 3 – Benchmarking 1/6

04.04.2016 11

• We will benchmark a simple matrix multiplication:sw/apps/sequential_tests/matrixMul8/matrixMul.csw/apps/sequential_tests/matrixMul8/matmul_kernels.c

– To have some quick cycle count feedback the timer is used:– Include “timer.h” and use the functions:

• reset_timer() start_timer()• stop_timer() get_time()

• Hardware loops– Hardware loops are enabled by default

• To prevent the use of hardware loops in your application a flag has to be set:• Open ../matrixMul8/CMakeLists.txt and remove the flag: “-mnohwloop”• If you recompile the application, the flag will be used for compilation automatically

• The compiler will generate the following hwloop instructions to produce efficient loops:

- lp.start - lp.end- lp.count - lp.counti- lp.setup - lp.setupi

Integrated Systems Laboratory

Exercise 3 – Benchmarking 2/6

04.04.2016 12

• Tasks:– Check if hardware loops are generated (in the matrixMul8.read file)

– What speedup do you expect when enabling hardware loops?

– How many instructions are actually saved? Compare the matrixMul8.read with and w/o hwloops.

– Do your measurements match your estimations?

– How do your results change if you set N, M to a constant? (in matMul8() )• int M = SIZE;• int N = SIZE;

Execution time: (# cycles/ % improvement) Codesize [B]Baseline -

Hardware loop (2 register set)

Integrated Systems Laboratory

Exercise 3 – Benchmarking 3/6

04.04.2016 13

• Post increment immediate:– Activated by default!– Deactivate with –mnopostmod

• Post increment register:– From a hardware perspective, what is

the drawback of this instruction?

• Multiply-accumulate instruction:– Old architecture:

• Accumulation register stored in a special register

• Accumulation result can be accessed in two cycles

– New architecture:• Enabled by default!• Accumulates directly on the register file• Disable with -mnomac

Old MAC:

New MAC:

Integrated Systems Laboratory

Exercise 3 – Benchmarking 4/6

04.04.2016 14

• Vector Instructions:– Add, sub, comparisons are all supported in vector mode– It is possible to process in parallel:

• One word• Two halfwords, or• Four bytes

– Check in the matrixMul.read if vector code is generated. Vector instructions have the format:

• lv.{sub,add,dotp,…}

• Tasks: – Run the matrixMul application with the different compiler options

1. no extensions: “-mnohwloop -mnopostmod -mnomac2. with hardware loops: “-mnopostmod -mnomac”3. with post increment: “-mnomac”4. with register mac: “.”

– Summarize your results in the first table on the next page

Integrated Systems Laboratory

Exercise 3 – Benchmarking 5/6

04.04.2016 15

• Use constant values for N, M to get a fair comparison

• What can be done better?– Try to improve the matrix multiplication by using dot product

operations (see next slide)

Instructions Cycles Codesize

Total Reduction [%]

Total Speedup[%]

[B]

Baseline - -+Hardware-loop+Post increment+mac+Dot product

Integrated Systems Laboratory

Exercise 3 – Benchmarking 6/6

• In order to speed up the multiplication with dot products we are first transposing matrix B (this leads to more efficient access patterns when loading vectors in the multiplication)

• In the second step we can load vectors of 4 chars, and use the Dot-product and Sum of Dot-product instruction to compute one output pixel

• How many cycles are required to compute one output pixel?

04.04.2016 16

Integrated Systems Laboratory

Exercise 4: Efficient Convolutions (1/4)

• Convolutions are important kernels in image processing

• Convolutions are defined as:

• Let us consider a 5x5 window to compute the convolution– For each output pixel we need 25 multiplications, and 24 additions,

or 1 multiplication and 24 mac operations

• The Dot product instruction can do 4 multiplications, and 3 additions in a single cycle– Hence, 1 Dot Product, and 6 Sum of Dot Product instructions are

sufficient

04.04.2016 17

Integrated Systems Laboratory

Exercise 4: Efficient Convolutions (2/4)

• Look at the code given in (appname = convolution) “apps/sequential_tests/convolution/conv_kernels.c”

• The 5x5 convolution exists for 2 versions– conv5x5_Byte() and conv5x5_Scalar()– Check the difference in execution time

• In order to keep the complexity under control we will now look at a 3x3 kernel– The scalar version conv3x3_Scalar() is already functional– The vector version conv3x3_Byte() needs to be completed

• Task:– Compare the two 5x5 convolution kernels– Complete the 3x3 convolution kernel (see also next slide)

04.04.2016 18

Integrated Systems Laboratory

Exercise 4: Efficient Convolutions (3/4)

• The idea of the vector 3x3 convolution is:1. Load vectors instead of bytes2. Process one output pixel in each iteration3. Use Dotp to maximize the throughput

• For each vertical column of the image:– Initialize the vectors V1,V2– Move V2 -> V1– Move V1 -> V0– Load V2 (fresh data)– Compute the convolution with

three dot product instructions– Move kernel 1 pixel down

• Switch to next vertical column

04.04.2016 19

1 iteration

Integrated Systems Laboratory

Exercise 4: Efficient Convolutions (4/4)

• Tasks:– What speedup do you expect?– Complete the table using the performance counters– How many cycles are required to compute one output pixel?

04.04.2016 20

Total instructions Cycles Loads operations

Total Reduction[x] Total Speedup [x] Total Reduction [x]

5x5: w/o dot product

1 1 1

5x5: With dot product

3x3: w/o dot product

1 1 1

3x3: With dot product

• Discuss your results with an assistant

Integrated Systems Laboratory

Exercise 5: Motion detection with fast convolution (1/3)• In this exercise we will focus again on the motion detection

algorithm.– “apps/sequential_tests/motion_detection/motion_detection.c”

• The algorithm is doing a bunch of image processing steps:– Dilatation– Erosion– Convolution– Etc.

• The computationally heaviest part is the convolution– It is using a 3x3 convolution with a Sobel filter

• Datatypes are shorts (not bytes!)

04.04.2016 21

Integrated Systems Laboratory

Exercise 5: Motion detection with fast convolution (2/3)• Tasks:

– Modify the convolution of exercise 4 in order to work with shorts– See “conv_fast.c”

04.04.2016 22

• Hints:– Define 5 vectors

V0-V4

– Initialize V1-V4 in the beginning of a new column

– Use the shuffle instruction to combine V3 and V4 into V3

Integrated Systems Laboratory

Exercise 5: Motion detection with fast convolution (3/3)• Tasks:

– Complete the table below (use performance counters to get the instructions/load operations)

– How do you expect your performance to change if you increase the image size?

• you can include the header img_40_40.h to see the difference• Runtime will increase! Make sure debug outputs are deactivated!

04.04.2016 23

Total instructions Cycles Load operations

Total Reduction [%] Total Speedup [%] Total Reduction [%]

10x10: w/o dot product

- - -

10x10: With dot product

40x40: w/o dot product

- - -

40x40: With dot product

Integrated Systems Laboratory

Exercise 6 – RISC-V compressed Instructions (1/2)

04.04.2016 24

• In this exercise we are going to use the new RISC-V core– Not all instructions have been ported yet– The core supports 32 bit and compressed 16bit instructions

• Create a build folder for RISC-V:$ cd /scratch/soc_xx/build_riscv

• Configure the build folder:$ cp ~soc_master/2_OpenRISC/pulpino/sw/cmake_configure.riscv.gcc.sh .In the configure script: Set the path to your exercise folder:

PULP_GIT_DIRECTORY=“/home/soc_xx/2_OpenRISC/pulpino”

$ riscv -g2.2.8 ./cmake_configure.riscv.gcc.sh

• To switch between compressed and uncompressed instructions set the RVC flag– Set RVC=1 in cmake_configure.riscv.gcc.sh to enable compressed instructions– Source the configure script again

• Compile the RTL:$ make vcompile : compiles Pulpino with the RISC-V core

Integrated Systems Laboratory

Exercise 6 – RISC-V compressed Instructions (2/2)

04.04.2016 25

• Coremark is a core comparison benchmark– Independent of frequency– Coremark/MHz score = 10^6 / (#ticks)– The higher the better

• Tasks:– Run coremark on RISC-V and compute the score (make coremark.vsimc)– Run coremark with compressed instruction

– Go to ARM homepage and compare it to your results

RISC-V RISC-V (Compressed) Cortex M0 Cortex M4Score Size Score Size Score Score

Integrated Systems Laboratory

04.04.2016 26

• You have successfully completed the exercise

• You can find sample solutions under: (after the exercise)~soc_master/2_OpenRISC/solutions

• If you are interested in a mini-project we can offer you:

– Implement a program on Pulpino (e.g. a game)• Use the LCD display of the Zedboard

– Implementation and optimization of a benchmark using the multicore pulp environment

• See last exercise about the pulp architecture

– RISC-V core architecture development. Analysis of:• Mini core• VLIW architecture

– We are open to your own ideas!

Questions & Answers