Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

Hardware-Software Codesign Kermin FlemingComputer Science & Artificial Intelligence LabMassachusetts Institute of Technology

Many slides produced by: Arvind, Myron King, Man Cheuk Ng, Angshuman Parashar

March 14, 2011 L12-1http://csg.csail.mit.edu/6.375

Hello, world!int main (int argc, char* argv[]){ int n = atoi(argv[1]);

for (int i = 0; i < n; i++) { printf(“Hello, world!\n”); } return 0;}


module mkHello#(TOP_LEVEL_WIRES wires);

CHANNEL_IFC channel <- mkChannel(wires); // has a software counterpart

Reg#(Bit#(8)) count <- mkReg(0); Reg#(Bit#(5)) state <- mkReg(0);

rule init (count == 0); count <- channel.recv(); state <= 0; endrule

rule hello (count != 0); case (state) 0: channel.send(‘H’); 1: channel.send(‘e’); 2: channel.send(‘l’); 3: channel.send(‘l’); ... 16: count <= count – 1; endcase

if (state != 16) state <= state + 1; else state <= 0; endruleendmodule

Today’s LectureCase Study: IMDCT Interfacing with HW Extracting ParallelismAutomated Solutions Bluespec Inc.: SCE-MI Intel/MIT: LEAP RRR


Ogg Vorbis PipelineOgg Vorbis is a audio compression format roughly comparable to other compression formats: e.g. MP3, AAC, MWA.

Input is a stream of compressed bitsParsed into frame residues and floor “predictions”The summed frequency results are converted to time-valued sequenciesFinal frames are windows to smooth out irregularities

IMDCT takes the most computation

Stream Parser

Floor Decoder

Residue Decoder

Windowing

PCM Output

Bits

IMDCT


IMDCTArray imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); }

// postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }

// do the IFFTvifft = ifft(2*N, vin);

Suppose we want to use hardware to accelerate FFT/IFFT computation


IMDCTArray imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // call the hardware vifft = call_hw(2*N, vin); // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }

// do the IFFTvifft = ifft(2*N, vin);

Implement or find a hardware IFFTHow will the HW/SW communication work?How do we explore design alternatives?


HW Accelerator in a systemCommunication via bus

DMA transfer?

Accelerators are all multiplexed on bus

Possibly introduces conflicts

Fair sharing of bus bandwidth

SoftwareCPU

Bus (PCI Express)

HW IFFTAccelerator

1

HW IFFTAccelerator

2


The HW InterfaceSW calls turn into a set of memory-mapped calls through BusThree communication tasks

Set size of IFFT Enter data stream Take output outBus (PCI Express)

setSize

inputData

outputData


Data Compatibility Issue

template <typename F, typename I>struct FixedPt{ F fract; I integer;};template <typename T> struct Complex{ T rel; T img;};

C++

IFFT takes Complex fixed point numbers.How do we represent such numbers in C and in RTL?

typedef struct { bit [31:0] fract; bit [31:0] integer;} FixedPt; typedef struct { FixedPt rel; FixedPt img;} Complex_FixedPt;

VerilogMarch 14, 2011 L12-9http://csg.csail.mit.edu/6.375

Data CompatibilityKeeping HW and SW representation is tedious and error prone

Issues of endianness (bit and byte) Layout changes based on C compiler

(gcc vs. icc vs. msvc++)Some SW representation do not have a natural HW analog

What is a pointer? Do we disallow passing trees and lists directly?

Ideally translation should be automatically generated

Let us assume that data compatibility issue have been solved and focus on control issues


First Attempt at AccelerationArray imdct(int N, Array<Complex<FixedPt<int,int>> vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); }

// postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }

pcie_ifc.setSize(2*N); for(i = 0; i < 2*N; i++) pcie_ifc.put(vin[i]); for(i = 0; i < 2*N; i++) vifft[i] = pcie_ifc.get();

Sends 1 element

Gets 1 element

Sets size

Software blocks until response exists


Exposing more details//mem-mapped hw registervolatile int* hw_flag = …//mem-mapped hw frame buffervolatile int* fbuffer = …

Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ … assert(*hw_flag== IDLE); for(cnt = 0; cnt<n; cnt++) *(fbuffer +cnt)= frame[cnt]; *hw_flag = GO; while(*hw_flag != IDLE) {;} for(cnt = 0; cnt<n*2; cnt++) frame[cnt++]=*(fbuffer+cnt); … } What happens if SW

has a cache?March 14, 2011 L12-12http://csg.csail.mit.edu/6.375

IssuesAre the internal hardware conditions exposed correctly by the hw_flag control register?Blocking SW is problematic: Prevents the processor from doing anything

while the accelerator is in use Hard to pipeline the accelerator Does not handle variation in timing well


Driving a Pipelined HW…int pid = fork();if(pid){ // producer process while(…) { … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); }} else { // consumer process while(…){ for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … }}

Multiple processes exploit pipeline parallelism in the IFFT accelerator.How does the BSV exert back pressure on the producer thread?How does the consumer thread exert back pressure on the BSV module?What if our frames are really large, could the HW begin working before the entire frame is transmitted?


Data Parallelism 1…SyncQueue<Complex<…>> workQ();int pid = fork();// both threads do same workwhile(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = pcie.get(); …}

How do we isolate each thread’s use of the HW accelerator?

Do two synchronization points (workQ and the HW accelerator) cause our design to deadlock?


Data Parallelism 2

…SyncQueue<Complex<…>> workQ();int pid = fork();// both threads do same workwhile(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) get_hw(pid).put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = get_hw(pid).get(); …}

PCIE get_hw(int pid){ if(pid==0) return pcieA; else return pcieB;}

By giving each thread its own HW accelerator, we have further increased data parallelism

If the HW is not the bottleneck this could be a waste of resources.

Do we multiplex the use of the physical BUS between the two threads?


Multithreading without threads or processies int icnt, ocnt = 0;Complex iframe[sz];Complex oframe[sz];…// IMDCT loopwhile(…){ … // producer “thread” for(i = 0; i<2,icnt<n; i++) if(pcie.can_put()) pcie.put(iframe[icnt++]); // consumer “thread” for(i = 0; i<2,ocnt<n*2; i++) if(pcie.can_get()) oframe[ocnt++]= pcie.get(); … }

Embedded execution environments often have little or no OS support, so multithreading must be emulated in user code

Getting the arbitration right is a complex task

All existing issues are compounded with the complexity of the duplicated states for each “thread”


The messageWriting SW which can safely exploit HW parallelism is difficult…

Particularly difficult if shared resources (e.g. bus) are involved

Need for automated solutions doing a good job


Today’s LectureCase Study: IMDCT

Interfacing with HW Extracting ParallelismAutomated Solutions Bluespec Inc.: SCE-MI Intel/MIT: LEAP RRR


Bluespec Co-design: SCE-MI

Circuit verification is difficult Billions of cycles of gate-level

simulation

How do we retain cycle accuracy? Use SCE-MI

Technology Cycles/sec*Bluesim 4KFPGA 20MASIC 60M+

*Target: WiFi TransceiverMarch 14, 2011 L12-20http://csg.csail.mit.edu/6.375

SCE-MIUse gated clocks to preserve cycle-accuracy Circuit internals run at “Model Clock” “Model Clock” ticks only when inputs and outputs

to the circuit stabilize Another Co-design problem


Bluespec SCE-MIUsed already in Lab With a controlled clock on the FPGABluespec has a rich SCE-MI library Get/Put transactors provided User provides C++ and HW

transactors for exotic interfaces


Intel/MIT: LEAP RRRAsynchronous Remote Request-Response stack for FPGA Uses common Client/Server paradigmSimilar in many respects to Bluespec SCE-MI Constrained user interface Open, many platforms supported


http://csg.csail.mit.edu/6.375March 14, 2009 L12-24

Client/Server interfacesGet/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose

interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface

interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface

client

data

ready

enabledata

enable

ready

getserver

data

ready

enabledata

enable

readyget put

put

req_t resp_t

RRR Specification Language

// ----------------------------------------// create a new service called ISA_EMULATOR// ----------------------------------------service ISA_EMULATOR{ // -------------------------------- // declare services provided by CPU // -------------------------------- server CPU <- FPGA; { method UpdateRegister(in REG_INDEX, in REG_VALUE); method Emulate(in INST_INFO, out INST_ADDR); };

// --------------------------------- // declare services provided by FPGA // --------------------------------- server FPGA <- CPU; { method SyncRegister(in REG_INDEX, in REG_VALUE); };};


LEAP Abstraction Layers: RRR

Channel IO

Kernel DriverFPGA Platform Physical Devices

Channel IO

FPGA CPU



Channel IO


Channel IO

FPGA CPU

RRR Client/Server ManagerRRR Client/Server Manager

Client Stub Server Stub

RRRspecification

files



Channel IO


Channel IO

FPGA CPU


ClientStubs.ISA_EMULATOR iemu;......iemu.UpdateRegister.Request( REG_R27, regFile[REG_R27]);......iemu.Emulate.Request(inst);......tgtPC <- iemu.Emulate.Response();

ISA_EMULATOR::UpdateRegister( REG_INDEX i, REG_VALUE v){ regFile[i] = v;}

ISA_EMULATOR::Emulate( INST_INFO inst){ // emulate the instruction

return target_PC;}

Client Stub Server Stub

Use

r Cod

e User Code



Channel IO


Channel IO

FPGA CPU


StubStubStub StubStubStub

User Application


ConclusionWriting SW which can safely exploit HW parallelism is difficult…

Several automated tools are available Development ongoing


Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

Documents

Transcript of Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab