EE382M.20 SOC Design HW Accelerators and...
Transcript of EE382M.20 SOC Design HW Accelerators and...
![Page 1: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/1.jpg)
EE382M.20 SOC Design
HW Accelerators and
Co-Processors
Mark McDermott
Fall 2018
![Page 2: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/2.jpg)
Motivation for HW Acceleration
§ OPs/$ or OPs/Joule– Exploit problem specific parallelism,
at thread and instructions level– Custom operational units or
“instructions” match the set of operations needed for the algorithm (replace multiple instructions with one), custom word width arithmetic, etc.
– Remove overhead of instruction storage and fetch, ALU multiplexing
210/5/18
Wawrzynek, 2013
![Page 3: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/3.jpg)
Co-Processors, Reconfigurable Architectures and Custom ISAs
![Page 4: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/4.jpg)
Tightly Coupled Coprocessors
§ Integrated with processor control logic– Task typically completes in a few cycles– Small amounts of data– Processor stalls waiting for the coprocessor– Communication with coprocessor typically via registers and dedicated
control signals
10/5/18 4
![Page 5: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/5.jpg)
Loosely-Coupled Coprocessors
§ Loosely-Coupled Coprocessors– Used for larger tasks than is the case for tightly-coupled coprocessors– Task runs in parallel with main processor– May take many cycles per task– Large amounts of data that coprocessor may access independent of main
processor– May or may not use the standard coprocessor interface
10/5/18 5
https://www.xilinx.com/support/documentation/application_notes/xapp1170-zynq-hls.pdf
![Page 6: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/6.jpg)
Accelerator Coherency Port (ACP)
§ Accelerator coherency port (ACP) is a 64-bit AXI slave interface
on the SCU that provides an asynchronous cache-coherent access
point directly from the PL to the Cortex-A9 MP-Core processor
subsystem.
§ A range of system PL masters can use this interface to access the
caches and the memory subsystem exactly the way the APU
processors do to simplify software, increase overall system
performance, or improve power consumption.
610/5/18
![Page 7: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/7.jpg)
ACP Usage
§ The ACP provides a low latency path between the PS and the
accelerators implemented in the PL when compared with a legacy
cache flushing and loading scheme. Steps that must take place in
an example of a PL-based accelerator are as follows:
1. The CPU prepares input data for the accelerator within its local cache
space.
2. The CPU sends a message to the accelerator using one of the general
purpose AXI master interfaces to the PL.
3. The accelerator fetches the data through the ACP, processes the data, and
returns the result through the ACP.
4. The accelerator sets a flag by writing to a known location to indicate that
the data processing is complete. Status of this flag can be polled by the
processor
710/5/18
![Page 8: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/8.jpg)
ACP Caveats
§ NOTE: When compared to a tightly-coupled coprocessor, ACP access latencies are relatively long. Therefore, ACP is not recommended for fine-grained instruction level acceleration.
§ For coarse-grain acceleration such as video frame-level processing, ACP does not have a clear advantage over traditional memory-mapped PL acceleration because the transaction overhead is small relative to the transaction time, and might potentially cause undesirable cache thrashing.
§ ACP is therefore optimal for medium-grain acceleration, such as block-level crypto accelerator and video macro-block level processing.
810/5/18
![Page 9: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/9.jpg)
Performance-Driven ISA Extensions
§ Adding instructions that do more work per cycle– Shift-add: replace two instructions with one (e.g., multiply by 5)– Multiply-add: replace two instructions with one (x := c + a ´́ b)– Multiply-accumulate: reduce round-off error (s := s + a ´́ b)– Conditional copy: to avoid some branches (e.g., in if-then-else)
§ Sub-word parallelism (for multimedia applications)– Intel MMX: multimedia extension– 64-bit registers can hold multiple integer operands– Intel SSE: Streaming SIMD extension– 128-bit registers can hold several floating-point operands
Slide 910/5/18
![Page 10: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/10.jpg)
Intel MMX ISA Extension
Slide 1010/5/18
Class Instruction Vector Op type Function or results
Copy
Register copy 32 bits Integer register «MMX register
Parallel pack 4, 2 Saturate Convert to narrower elements
Parallel unpack low 8, 4, 2 Merge lower halves of 2 vectors
Parallel unpack high 8, 4, 2 Merge upper halves of 2 vectors
Arithmetic
Parallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundaries
Parallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibition
Parallel multiply low 4 Multiply, keep the 4 low halves
Parallel multiply high 4 Multiply, keep the 4 high halves
Parallel multiply-add 4 Multiply, add adjacent products*
Parallel compare equal 8, 4, 2 All 1s where equal, else all 0s
Parallel compare greater 8, 4, 2 All 1s where greater, else all 0s
Shift
Parallel left shift logical 4, 2, 1 Shift left, respect boundaries
Parallel right shift logical 4, 2, 1 Shift right, respect boundaries
Parallel right shift arith 4, 2 Arith shift within each (half)word
Logic
Parallel AND 1 Bitwise dest¬ (src1) Ù (src2)
Parallel ANDNOT 1 Bitwise dest¬ (src1) Ù (src2)¢
Parallel OR 1 Bitwise dest¬ (src1) Ú (src2)
Parallel XOR 1 Bitwise dest¬ (src1) Å (src2)
Memoryaccess
Parallel load MMX reg 32 or 64 bits Address given in integer register
Parallel store MMX reg 32 or 64 bit Address given in integer register
Control Empty FP tag bits Required for compatibility$
![Page 11: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/11.jpg)
MMX Multiplication and Multiply-Add
Slide 1110/5/18
a
(a) Parallel multiply low (b) Parallel multiply-add
b d e
e f g h
s t u v
e ´ h d ´ g
b ´ f a ´ e
z v
y u
x t
w s
a b d e
e f g h
s + t u + v
e ´ h d ´ g
b ´ f a ´ e
v
u
t
s
add add
![Page 12: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/12.jpg)
MMX Parallel Comparisons
Slide 1210/5/18
14
(a) Parallel compare equal (b) Parallel compare greater
3 58 66
79 1 58 65
0 0 0
5 12 3 32
12 3 22
5 12 6 9
12 5 90 17 8 65 535 (all 1s)
0 0 0 0 0
255 (all 1s)
![Page 13: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/13.jpg)
Custom ISA: HC12 Fuzzy Logic Acceleration§ Native fuzzy instructions:– MEM; evaluate membership functions– REV; rule evaluation: IF a is x THEN b is y– WAV; weighted averaging
§ Additional related instructions– MINA (place smaller of two unsigned 8-bit
values in accumulator A)– EMIND (place smaller of two unsigned 16-
bit values in accumulator D)– MAXM (place larger of two unsigned 8-bit
values in memory)– EMAXM (place larger of two unsigned 16-
bit values in memory)– TBL (table lookup and interpolate)– ETBL (extended table lookup and
interpolate)– EMACS (extended multiply and accumulate
signed 16-bit by 16-bit to 32-bit)– EDIV (extended divide)
15,000 times faster than HC111310/5/18
![Page 14: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/14.jpg)
Reconfigurable Architectures
10/5/18 14
![Page 15: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/15.jpg)
Taxonomy of Reconfigurable Architectures
15
RECONFIGURABLE ARCHITECTURES(R-SOC)
FINE GRAIN(FPGA)
MULTI GRANULARITY(Heterogeneous)
COARSE GRAIN(Systolic)
Processor +Coprocessor
Tile-BasedArchitecture
Coarse Grain Coprocessor
Fine GrainCoprocessor
IslandTopology
Hierarchical Topology
LinearTopology
HierarchicalTopology
MeshTopology
ChameleonREMARCMorphosys
PleiadesGarpFIPSOCTriscend E5Triscend A7Xilinx Virtex-II ProAltera ExcaliburAtmel FPSICTensislica
Xilinx VirtexXilinx SpartranAtmel AT40KLattice ispXPGA
Altera StratixAltera ApexAltera Cyclone
Systolic RingRaPiDPipeRench
DARTFPFA
RAWCHESSMATRIXKressArraySystolix Pulsedsp
aSoCE-FPFA
10/5/18
![Page 16: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/16.jpg)
Customizable ISA: Cadence/Tensilica Xtensa
§ 32-bit ALU§ 1 or 2 Load/Store Model§ Registers– 32-bit general purpose register file– 32-bit program counter– 16 optional 1-bit Boolean registers– 16 optional 32-bit floating point registers– 4 optional 32-bit MAC16 data registers– Optional Vectra LX DSP registers
§ General Purpose AR Register File– 32 or 64 registers– Instructions have access through “sliding window” of 16 registers. Window
can rotate by 4, 8, or 12 registers– Register window reduces code size by limiting number of bits for the address
and eliminated the need to save and restore register files
1610/5/18
![Page 17: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/17.jpg)
Hardware Acceleration
![Page 18: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/18.jpg)
Common HW Acceleration Applications
§ Graphics§ Data Compression/Decompression§ Data Streaming: Audio/Video Encoding/Decoding, Network, I/O§ Image sensing and processing§ Logic Simulation§ Data Encryption: RSA, DES, AES§ FFT, DCT, EXP, LOG, …§ Neuronal Networks§ Neuromorphic
10/5/18 18
![Page 19: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/19.jpg)
Decision Tree: When do you use a hardware accelerator?
1910/5/18
Can an existing algorithm be implemented using existing ISA?
Can a new algorithm be devised to solve problem using existing ISA?
Can API be modified to expose necessary functionality or make it easier to exploit?
Can the datapath be modified to better support algorithm, without breaking others?
Can ISA be modified to better support algorithm?
Can HW accelerator be added as a co-processor instruction
Easy
Hard
![Page 20: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/20.jpg)
Hardware Acceleration
§ Ad hoc interface to controlling processor– Accelerator registers are memory-mapped– Bus-based, FIFO, or register data interfaces– Uses DMA for high speed transfers
§ Typically, the processor transfers data to the accelerator, issues a go command, and then collects result data later.– Polled or interrupt-based interface
§ Accelerator may have its own path to/from memory§ Often fixed function but can be microcoded for programmability
10/5/18 20
![Page 21: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/21.jpg)
Hardware Accelerator TopologiesAccelerator appears as a device on a bus
Accelerator is tightly coupled into the processor memory system
2110/5/18
![Page 22: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/22.jpg)
CPU-Accelerator Interface Example
2210/5/18
ARM Core
Accelerator
AXI
BlockRAM
PL
Slave I/C
DDRController
PS
DD
R
§ AXI– 32 bit Bus– Access to DRAM data &
programmable logic fabric– 1/2 CPU frequency– Big penalty if bus is busy during
first attempt to access bus
§ AHB (AMBA High Speed Bus)– 64 bit bus– Runs at CPU clock frequency– Access to DDR Controller to
provide addresses to SDRAMBusFirst Access
Pipelined Access Arbitration
Read Write Read Write
ARM à I/C 2 2 2 2
I/Cà AXI 8 8 3 3 5
AHB à DDRC 4 4 4 4
DDRC à DRAM 8 9 3 3 5
AXI ↔ BRAM 20 20 8 8 12
BRAM ↔ ACC 2 2 2 2
6
5
2
1 3 4
1
2
3
4
5
6
![Page 23: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/23.jpg)
Four Programmers Models of Accelerator Design
2310/5/18
Base - HW I/F onlyNo OS Service (in simple embedded systems)
OS service – Accelerator accessed as a user space memory mapped I/O device
Virtualized Device withOS scheduling support
CPU Accelerator
Application
OS
CPU Accelerator
Application
CPU Accelerator
Application
OS
mmap()
CPU Accelerator
Application
OS
dev() driver
![Page 24: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/24.jpg)
Hybrid Hardware/Software Execution Model
2410/5/18
§ Hardware Accelerator as a Kernel Module– Seamless integration of hardware
accelerators into the Linux software stack for use by mainstream applications
– The KM approach enables transparent interchange of software and hardware components
§ Application level execution model– Compiler deep analysis and transformations
generate CPU code, hardware library stubs and synthesized components
– FPGA bitmaps as hardware counterpart to existing software modules.
– Same dynamic linking library interfaces and stubs apply to both software and hardware implementation
§ OS resource management– Services (API) for allocation, partial
reconfiguration, saving and restoring the status, and monitoring
– Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use
– Control the access to the new hardware to ensure trust under private or shared use
CPUFPGAaccele-rators
memory
devices
Linux OS
Linker/Loader
Application
DLL
OS m
odules
Compiler analysis/transformations
Synthesis
Soft objectHard object
User level function or device driver:
Source code
Resource manager
Compile Time
User Runtime
Kernel Runtime
Human designedhardware
![Page 25: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/25.jpg)
Hardware Accelerator Interface: Interrupts or Polling?
§ Polling interfaces usually require the processor to read a memory-mapped register to determine the state of the accelerator.– Can the accelerator accept new input data?– Is the accelerator done with its current task?– Has the accelerator generated an error condition?
§ Polling interfaces offer minimal latency between the setting of a condition on the accelerator and its discovery by the controlling processor.– But processor isn’t doing useful work while it polls…
10/5/18 25
![Page 26: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/26.jpg)
Hardware Accelerator Interface: Interrupts or Polling?
§ Interrupt-based interfaces allow the accelerator to signal conditions to the controlling processor.– Interrupt latency is longer than is achievable via the polling method.– But the processor can more easily proceed with other work while the
accelerator is busy with a task.
§ Interrupts more efficient for coarse grained parallelism (i.e., larger tasks with looser and less frequent synchronization requirements)
§ Interrupts may not work for real-time control tasks with tight schedules
10/5/18 26
![Page 27: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/27.jpg)
Zedboard Interrupt latency measurement results
2710/5/18
1
10
100
1000
10000
100000
0 500 1000 1500 2000 2500 3000
Late
cy (m
icro
-sec
s)
Number of Samples (NOTE: each sample is 10,000 interrupts)
MAX
MIN
MAX: 43333 μsecsNote: heavy CPU load
3 Million Samples
![Page 28: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/28.jpg)
AXI
Typical CPU àà Accelerator Transaction
2810/5/18
Application Operating System Hardware
Time ààopen(/dev/accel); /* only once*/
… /* construct macroblocks */macroblock = …syscall(¯oblock,
num_blocks)…
… /* macroblock now has transformed data */…
Data copy
Flush Cache Range
Setup DMA Transfer
Poll
DMA Controller
Setup DMA Transfer
Invalidate Cache Range
Memory
AXI
Accelerator(Executing)
AXI
Data Copy
MemoryAXI
DMA Controller
AXI
MemoryAXI
MemoryAXI
Enable Accelerator Access for Application
ARM
ARM
ARM
ARM
ARM
ARM
ARM
![Page 29: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/29.jpg)
Device Driver Access Cost
2910/5/18
![Page 30: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/30.jpg)
Accelerator Speedup
§ Assume loop is executed n times.Speedup = n(tCPU - taccel)
= n(tCPU - (tin + texec + tout))
§ Compare accelerated system to non-accelerated system:
Page: 30
![Page 31: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/31.jpg)
Single-threaded vs Multi-threading
§ One critical factor is the available parallelism in the application:– Single-threaded/blocking: CPU waits for accelerator;– Multithreaded/non-blocking: CPU continues to execute along with
accelerator.
§ For multithread, CPU must have some useful work to do while accelerators perform some tasks.– Software environment must also support multi- threading.
§ Blocking: CPU waits for the accelerator call to complete.
Page: 31G. Khan
![Page 32: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/32.jpg)
P2
P1
A1
P3
P4
P2
P1
A1
P3
P4
Acceler ator
Determining total execution time
Page: 32G. Khan
Acceler ator
Single Threaded CPU:Count execution time of
component processes
Multi-Threaded CPU:Find longest path execution
time of component processes
P2, P3 are independentAfter P1, CPU starts P3P2 depends on A1
![Page 33: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/33.jpg)
The bus interface may provide mechanisms for accelerators to tell the CPU of required cache changes…
Caching Issues with Accelerators
§ Main memory provides the primary data transfer mechanism to the accelerator.
§ Programs must ensure that caching does not invalidate main memory data.– CPU reads location S.– Accelerator writes location S.– CPU writes location S.• BAD – Program will not see proper value of S stored in the cache
10/5/18 33
![Page 34: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/34.jpg)
Synchronization and Memory
§ As with cache, main memory writes to shared memory may cause invalidation (memory incoherence).– CPU reads location S– Accelerator writes S– CPU writes S
§ Many CPU buses implement test-and-set atomic operations that the accelerator can use to implement a semaphore. This can serve as a highly efficient means of synchronizing inter-process Communications (IPC)
10/5/18 34
![Page 35: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/35.jpg)
Logic Simulation Acceleration
10/5/18 35
![Page 36: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/36.jpg)
Page 36
Metrics
§ Performance:
– 400 – 600X faster than SW
simulator
– 400K Evaluations/Sec
– I/O speed: 100 MB/sec
§ Simulation algorithm
– 2-Pass event driven, selective
trace
• Evaluation pass
• Update pass
§ Supported functions:
– Logic Verification
• Delay assigned per element
• Delay assigned per pin type
• 4 value logic
• 16 value logic
– Rise and Fall delays
– Setup and Hold time analysis
– Minimum pulse width detection
– Worst case analysis
– Wire delay
– Transmission gates
– Fault simulation
– Behavioral simulation
10/5/18 36
![Page 37: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/37.jpg)
Simulation Processing Memory
3710/5/18
768 Bits wide
![Page 38: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/38.jpg)
High Level Block Diagram
3810/5/18
![Page 39: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/39.jpg)
Detailed Block Diagram
3910/5/18
![Page 40: EE382M.20 SOC Design HW Accelerators and Co-Processorsusers.ece.utexas.edu/~gerstl/ee382m...CoProcessors.pdf–Optional Vectra LX DSP registers §General Purpose AR Register File –32](https://reader030.fdocuments.net/reader030/viewer/2022040901/5e70bb12f0f41665443576c1/html5/thumbnails/40.jpg)
Final observations
§ 2 hours to compile 64K gate design– No incremental compile
§ 75 I/O pins§ 500+ observation points§ 30 minutes to download compiled
descriptors to accelerator§ 11 seconds to simulate 2000 µSec of
sim-time§ 3-4 hours to unload accelerator data– Pins & observation points
§ Only marginally faster than SW simulation– Amdahl’s Law at work….
4010/5/18