Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

30
Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology Many slides produced by: Arvind, Myron King, Man Cheuk Ng, Angshuman Parashar March 14, 2011 L12-1 http:// csg.csail.mit.edu/6.375

description

Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology. Many slides produced by: Arvind, Myron King, Man Cheuk Ng, Angshuman Parashar. Hello, w orld!. module mkHello #(TOP_LEVEL_WIRES wires); - PowerPoint PPT Presentation

Transcript of Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

Page 1: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Hardware-Software Codesign Kermin FlemingComputer Science & Artificial Intelligence LabMassachusetts Institute of Technology

Many slides produced by: Arvind, Myron King, Man Cheuk Ng, Angshuman Parashar

March 14, 2011 L12-1http://csg.csail.mit.edu/6.375

Page 2: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Hello, world!int main (int argc, char* argv[]){ int n = atoi(argv[1]);

for (int i = 0; i < n; i++) { printf(“Hello, world!\n”); } return 0;}

March 14, 2011 L12-2http://csg.csail.mit.edu/6.375

module mkHello#(TOP_LEVEL_WIRES wires);

CHANNEL_IFC channel <- mkChannel(wires); // has a software counterpart

Reg#(Bit#(8)) count <- mkReg(0); Reg#(Bit#(5)) state <- mkReg(0);

rule init (count == 0); count <- channel.recv(); state <= 0; endrule

rule hello (count != 0); case (state) 0: channel.send(‘H’); 1: channel.send(‘e’); 2: channel.send(‘l’); 3: channel.send(‘l’); ... 16: count <= count – 1; endcase

if (state != 16) state <= state + 1; else state <= 0; endruleendmodule

Page 3: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Today’s LectureCase Study: IMDCT Interfacing with HW Extracting ParallelismAutomated Solutions Bluespec Inc.: SCE-MI Intel/MIT: LEAP RRR

March 14, 2011 L12-3http://csg.csail.mit.edu/6.375

Page 4: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Ogg Vorbis PipelineOgg Vorbis is a audio compression format roughly comparable to other compression formats: e.g. MP3, AAC, MWA.

Input is a stream of compressed bitsParsed into frame residues and floor “predictions”The summed frequency results are converted to time-valued sequenciesFinal frames are windows to smooth out irregularities

IMDCT takes the most computation

Stream Parser

Floor Decoder

Residue Decoder

Windowing

PCM Output

Bits

IMDCT

March 14, 2011 L12-4http://csg.csail.mit.edu/6.375

Page 5: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

IMDCTArray imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); }

// postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }

// do the IFFTvifft = ifft(2*N, vin);

Suppose we want to use hardware to accelerate FFT/IFFT computation

March 14, 2011 L12-5http://csg.csail.mit.edu/6.375

Page 6: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

IMDCTArray imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // call the hardware vifft = call_hw(2*N, vin); // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }

// do the IFFTvifft = ifft(2*N, vin);

Implement or find a hardware IFFTHow will the HW/SW communication work?How do we explore design alternatives?

March 14, 2011 L12-6http://csg.csail.mit.edu/6.375

Page 7: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

HW Accelerator in a systemCommunication via bus

DMA transfer?

Accelerators are all multiplexed on bus

Possibly introduces conflicts

Fair sharing of bus bandwidth

SoftwareCPU

Bus (PCI Express)

HW IFFTAccelerator

1

HW IFFTAccelerator

2

March 14, 2011 L12-7http://csg.csail.mit.edu/6.375

Page 8: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

The HW InterfaceSW calls turn into a set of memory-mapped calls through BusThree communication tasks

Set size of IFFT Enter data stream Take output outBus (PCI Express)

setSize

inputData

outputData

March 14, 2011 L12-8http://csg.csail.mit.edu/6.375

Page 9: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Data Compatibility Issue

template <typename F, typename I>struct FixedPt{ F fract; I integer;};template <typename T> struct Complex{ T rel; T img;};

C++

IFFT takes Complex fixed point numbers.How do we represent such numbers in C and in RTL?

typedef struct { bit [31:0] fract; bit [31:0] integer;} FixedPt; typedef struct { FixedPt rel; FixedPt img;} Complex_FixedPt;

VerilogMarch 14, 2011 L12-9http://csg.csail.mit.edu/6.375

Page 10: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Data CompatibilityKeeping HW and SW representation is tedious and error prone

Issues of endianness (bit and byte) Layout changes based on C compiler

(gcc vs. icc vs. msvc++)Some SW representation do not have a natural HW analog

What is a pointer? Do we disallow passing trees and lists directly?

Ideally translation should be automatically generated

Let us assume that data compatibility issue have been solved and focus on control issues

March 14, 2011 L12-10http://csg.csail.mit.edu/6.375

Page 11: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

First Attempt at AccelerationArray imdct(int N, Array<Complex<FixedPt<int,int>> vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); }

// postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }

pcie_ifc.setSize(2*N); for(i = 0; i < 2*N; i++) pcie_ifc.put(vin[i]); for(i = 0; i < 2*N; i++) vifft[i] = pcie_ifc.get();

Sends 1 element

Gets 1 element

Sets size

Software blocks until response exists

March 14, 2011 L12-11http://csg.csail.mit.edu/6.375

Page 12: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Exposing more details//mem-mapped hw registervolatile int* hw_flag = …//mem-mapped hw frame buffervolatile int* fbuffer = …

Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ … assert(*hw_flag== IDLE); for(cnt = 0; cnt<n; cnt++) *(fbuffer +cnt)= frame[cnt]; *hw_flag = GO; while(*hw_flag != IDLE) {;} for(cnt = 0; cnt<n*2; cnt++) frame[cnt++]=*(fbuffer+cnt); … } What happens if SW

has a cache?March 14, 2011 L12-12http://csg.csail.mit.edu/6.375

Page 13: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

IssuesAre the internal hardware conditions exposed correctly by the hw_flag control register?Blocking SW is problematic: Prevents the processor from doing anything

while the accelerator is in use Hard to pipeline the accelerator Does not handle variation in timing well

March 14, 2011 L12-13http://csg.csail.mit.edu/6.375

Page 14: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Driving a Pipelined HW…int pid = fork();if(pid){ // producer process while(…) { … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); }} else { // consumer process while(…){ for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … }}

Multiple processes exploit pipeline parallelism in the IFFT accelerator.How does the BSV exert back pressure on the producer thread?How does the consumer thread exert back pressure on the BSV module?What if our frames are really large, could the HW begin working before the entire frame is transmitted?

March 14, 2011 L12-14http://csg.csail.mit.edu/6.375

Page 15: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Data Parallelism 1…SyncQueue<Complex<…>> workQ();int pid = fork();// both threads do same workwhile(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = pcie.get(); …}

How do we isolate each thread’s use of the HW accelerator?

Do two synchronization points (workQ and the HW accelerator) cause our design to deadlock?

March 14, 2011 L12-15http://csg.csail.mit.edu/6.375

Page 16: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Data Parallelism 2

…SyncQueue<Complex<…>> workQ();int pid = fork();// both threads do same workwhile(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) get_hw(pid).put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = get_hw(pid).get(); …}

PCIE get_hw(int pid){ if(pid==0) return pcieA; else return pcieB;}

By giving each thread its own HW accelerator, we have further increased data parallelism

If the HW is not the bottleneck this could be a waste of resources.

Do we multiplex the use of the physical BUS between the two threads?

March 14, 2011 L12-16http://csg.csail.mit.edu/6.375

Page 17: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Multithreading without threads or processies int icnt, ocnt = 0;Complex iframe[sz];Complex oframe[sz];…// IMDCT loopwhile(…){ … // producer “thread” for(i = 0; i<2,icnt<n; i++) if(pcie.can_put()) pcie.put(iframe[icnt++]); // consumer “thread” for(i = 0; i<2,ocnt<n*2; i++) if(pcie.can_get()) oframe[ocnt++]= pcie.get(); … }

Embedded execution environments often have little or no OS support, so multithreading must be emulated in user code

Getting the arbitration right is a complex task

All existing issues are compounded with the complexity of the duplicated states for each “thread”

March 14, 2011 L12-17http://csg.csail.mit.edu/6.375

Page 18: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

The messageWriting SW which can safely exploit HW parallelism is difficult…

Particularly difficult if shared resources (e.g. bus) are involved

Need for automated solutions doing a good job

March 14, 2011 L12-18http://csg.csail.mit.edu/6.375

Page 19: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Today’s LectureCase Study: IMDCT

Interfacing with HW Extracting ParallelismAutomated Solutions Bluespec Inc.: SCE-MI Intel/MIT: LEAP RRR

March 14, 2011 L12-19http://csg.csail.mit.edu/6.375

Page 20: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Bluespec Co-design: SCE-MI

Circuit verification is difficult Billions of cycles of gate-level

simulation

How do we retain cycle accuracy? Use SCE-MI

Technology Cycles/sec*Bluesim 4KFPGA 20MASIC 60M+

*Target: WiFi TransceiverMarch 14, 2011 L12-20http://csg.csail.mit.edu/6.375

Page 21: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

SCE-MIUse gated clocks to preserve cycle-accuracy Circuit internals run at “Model Clock” “Model Clock” ticks only when inputs and outputs

to the circuit stabilize Another Co-design problem

March 14, 2011 L12-21http://csg.csail.mit.edu/6.375

Page 22: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Bluespec SCE-MIUsed already in Lab With a controlled clock on the FPGABluespec has a rich SCE-MI library Get/Put transactors provided User provides C++ and HW

transactors for exotic interfaces

March 14, 2011 L12-22http://csg.csail.mit.edu/6.375

Page 23: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

Intel/MIT: LEAP RRRAsynchronous Remote Request-Response stack for FPGA Uses common Client/Server paradigmSimilar in many respects to Bluespec SCE-MI Constrained user interface Open, many platforms supported

March 14, 2011 L12-23http://csg.csail.mit.edu/6.375

Page 24: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

http://csg.csail.mit.edu/6.375March 14, 2009 L12-24

Client/Server interfacesGet/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose

interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface

interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface

client

data

ready

enabledata

enable

ready

getserver

data

ready

enabledata

enable

readyget put

put

req_t resp_t

Page 25: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

RRR Specification Language

// ----------------------------------------// create a new service called ISA_EMULATOR// ----------------------------------------service ISA_EMULATOR{ // -------------------------------- // declare services provided by CPU // -------------------------------- server CPU <- FPGA; { method UpdateRegister(in REG_INDEX, in REG_VALUE); method Emulate(in INST_INFO, out INST_ADDR); };

// --------------------------------- // declare services provided by FPGA // --------------------------------- server FPGA <- CPU; { method SyncRegister(in REG_INDEX, in REG_VALUE); };};

March 14, 2011 L12-25http://csg.csail.mit.edu/6.375

Page 26: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

LEAP Abstraction Layers: RRR

Channel IO

Kernel DriverFPGA Platform Physical Devices

Channel IO

FPGA CPU

March 14, 2011 L12-26http://csg.csail.mit.edu/6.375

Page 27: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

LEAP Abstraction Layers: RRR

Channel IO

Kernel DriverFPGA Platform Physical Devices

Channel IO

FPGA CPU

RRR Client/Server ManagerRRR Client/Server Manager

Client Stub Server Stub

RRRspecification

files

March 14, 2011 L12-27http://csg.csail.mit.edu/6.375

Page 28: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

LEAP Abstraction Layers: RRR

Channel IO

Kernel DriverFPGA Platform Physical Devices

Channel IO

FPGA CPU

RRR Client/Server ManagerRRR Client/Server Manager

ClientStubs.ISA_EMULATOR iemu;......iemu.UpdateRegister.Request( REG_R27, regFile[REG_R27]);......iemu.Emulate.Request(inst);......tgtPC <- iemu.Emulate.Response();

ISA_EMULATOR::UpdateRegister( REG_INDEX i, REG_VALUE v){ regFile[i] = v;}

ISA_EMULATOR::Emulate( INST_INFO inst){ // emulate the instruction

return target_PC;}

Client Stub Server Stub

Use

r Cod

e User Code

March 14, 2011 L12-28http://csg.csail.mit.edu/6.375

Page 29: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

LEAP Abstraction Layers: RRR

Channel IO

Kernel DriverFPGA Platform Physical Devices

Channel IO

FPGA CPU

RRR Client/Server ManagerRRR Client/Server Manager

StubStubStub StubStubStub

User Application

March 14, 2011 L12-29http://csg.csail.mit.edu/6.375

Page 30: Hardware-Software  Codesign Kermin  Fleming Computer Science & Artificial Intelligence Lab

ConclusionWriting SW which can safely exploit HW parallelism is difficult…

Several automated tools are available Development ongoing

March 14, 2011 L12-30http://csg.csail.mit.edu/6.375