Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab
description
Transcript of Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab
Hardware-Software Codesign Kermin FlemingComputer Science & Artificial Intelligence LabMassachusetts Institute of Technology
Many slides produced by: Arvind, Myron King, Man Cheuk Ng, Angshuman Parashar
March 14, 2011 L12-1http://csg.csail.mit.edu/6.375
Hello, world!int main (int argc, char* argv[]){ int n = atoi(argv[1]);
for (int i = 0; i < n; i++) { printf(“Hello, world!\n”); } return 0;}
March 14, 2011 L12-2http://csg.csail.mit.edu/6.375
module mkHello#(TOP_LEVEL_WIRES wires);
CHANNEL_IFC channel <- mkChannel(wires); // has a software counterpart
Reg#(Bit#(8)) count <- mkReg(0); Reg#(Bit#(5)) state <- mkReg(0);
rule init (count == 0); count <- channel.recv(); state <= 0; endrule
rule hello (count != 0); case (state) 0: channel.send(‘H’); 1: channel.send(‘e’); 2: channel.send(‘l’); 3: channel.send(‘l’); ... 16: count <= count – 1; endcase
if (state != 16) state <= state + 1; else state <= 0; endruleendmodule
Today’s LectureCase Study: IMDCT Interfacing with HW Extracting ParallelismAutomated Solutions Bluespec Inc.: SCE-MI Intel/MIT: LEAP RRR
March 14, 2011 L12-3http://csg.csail.mit.edu/6.375
Ogg Vorbis PipelineOgg Vorbis is a audio compression format roughly comparable to other compression formats: e.g. MP3, AAC, MWA.
Input is a stream of compressed bitsParsed into frame residues and floor “predictions”The summed frequency results are converted to time-valued sequenciesFinal frames are windows to smooth out irregularities
IMDCT takes the most computation
Stream Parser
Floor Decoder
Residue Decoder
Windowing
PCM Output
Bits
IMDCT
March 14, 2011 L12-4http://csg.csail.mit.edu/6.375
IMDCTArray imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); }
// postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }
// do the IFFTvifft = ifft(2*N, vin);
Suppose we want to use hardware to accelerate FFT/IFFT computation
March 14, 2011 L12-5http://csg.csail.mit.edu/6.375
IMDCTArray imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // call the hardware vifft = call_hw(2*N, vin); // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }
// do the IFFTvifft = ifft(2*N, vin);
Implement or find a hardware IFFTHow will the HW/SW communication work?How do we explore design alternatives?
March 14, 2011 L12-6http://csg.csail.mit.edu/6.375
HW Accelerator in a systemCommunication via bus
DMA transfer?
Accelerators are all multiplexed on bus
Possibly introduces conflicts
Fair sharing of bus bandwidth
SoftwareCPU
Bus (PCI Express)
HW IFFTAccelerator
1
HW IFFTAccelerator
2
March 14, 2011 L12-7http://csg.csail.mit.edu/6.375
The HW InterfaceSW calls turn into a set of memory-mapped calls through BusThree communication tasks
Set size of IFFT Enter data stream Take output outBus (PCI Express)
setSize
inputData
outputData
March 14, 2011 L12-8http://csg.csail.mit.edu/6.375
Data Compatibility Issue
template <typename F, typename I>struct FixedPt{ F fract; I integer;};template <typename T> struct Complex{ T rel; T img;};
C++
IFFT takes Complex fixed point numbers.How do we represent such numbers in C and in RTL?
typedef struct { bit [31:0] fract; bit [31:0] integer;} FixedPt; typedef struct { FixedPt rel; FixedPt img;} Complex_FixedPt;
VerilogMarch 14, 2011 L12-9http://csg.csail.mit.edu/6.375
Data CompatibilityKeeping HW and SW representation is tedious and error prone
Issues of endianness (bit and byte) Layout changes based on C compiler
(gcc vs. icc vs. msvc++)Some SW representation do not have a natural HW analog
What is a pointer? Do we disallow passing trees and lists directly?
Ideally translation should be automatically generated
Let us assume that data compatibility issue have been solved and focus on control issues
March 14, 2011 L12-10http://csg.csail.mit.edu/6.375
First Attempt at AccelerationArray imdct(int N, Array<Complex<FixedPt<int,int>> vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); }
// postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; }
pcie_ifc.setSize(2*N); for(i = 0; i < 2*N; i++) pcie_ifc.put(vin[i]); for(i = 0; i < 2*N; i++) vifft[i] = pcie_ifc.get();
Sends 1 element
Gets 1 element
Sets size
Software blocks until response exists
March 14, 2011 L12-11http://csg.csail.mit.edu/6.375
Exposing more details//mem-mapped hw registervolatile int* hw_flag = …//mem-mapped hw frame buffervolatile int* fbuffer = …
Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ … assert(*hw_flag== IDLE); for(cnt = 0; cnt<n; cnt++) *(fbuffer +cnt)= frame[cnt]; *hw_flag = GO; while(*hw_flag != IDLE) {;} for(cnt = 0; cnt<n*2; cnt++) frame[cnt++]=*(fbuffer+cnt); … } What happens if SW
has a cache?March 14, 2011 L12-12http://csg.csail.mit.edu/6.375
IssuesAre the internal hardware conditions exposed correctly by the hw_flag control register?Blocking SW is problematic: Prevents the processor from doing anything
while the accelerator is in use Hard to pipeline the accelerator Does not handle variation in timing well
March 14, 2011 L12-13http://csg.csail.mit.edu/6.375
Driving a Pipelined HW…int pid = fork();if(pid){ // producer process while(…) { … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); }} else { // consumer process while(…){ for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … }}
Multiple processes exploit pipeline parallelism in the IFFT accelerator.How does the BSV exert back pressure on the producer thread?How does the consumer thread exert back pressure on the BSV module?What if our frames are really large, could the HW begin working before the entire frame is transmitted?
March 14, 2011 L12-14http://csg.csail.mit.edu/6.375
Data Parallelism 1…SyncQueue<Complex<…>> workQ();int pid = fork();// both threads do same workwhile(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = pcie.get(); …}
How do we isolate each thread’s use of the HW accelerator?
Do two synchronization points (workQ and the HW accelerator) cause our design to deadlock?
March 14, 2011 L12-15http://csg.csail.mit.edu/6.375
Data Parallelism 2
…SyncQueue<Complex<…>> workQ();int pid = fork();// both threads do same workwhile(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) get_hw(pid).put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = get_hw(pid).get(); …}
PCIE get_hw(int pid){ if(pid==0) return pcieA; else return pcieB;}
By giving each thread its own HW accelerator, we have further increased data parallelism
If the HW is not the bottleneck this could be a waste of resources.
Do we multiplex the use of the physical BUS between the two threads?
March 14, 2011 L12-16http://csg.csail.mit.edu/6.375
Multithreading without threads or processies int icnt, ocnt = 0;Complex iframe[sz];Complex oframe[sz];…// IMDCT loopwhile(…){ … // producer “thread” for(i = 0; i<2,icnt<n; i++) if(pcie.can_put()) pcie.put(iframe[icnt++]); // consumer “thread” for(i = 0; i<2,ocnt<n*2; i++) if(pcie.can_get()) oframe[ocnt++]= pcie.get(); … }
Embedded execution environments often have little or no OS support, so multithreading must be emulated in user code
Getting the arbitration right is a complex task
All existing issues are compounded with the complexity of the duplicated states for each “thread”
March 14, 2011 L12-17http://csg.csail.mit.edu/6.375
The messageWriting SW which can safely exploit HW parallelism is difficult…
Particularly difficult if shared resources (e.g. bus) are involved
Need for automated solutions doing a good job
March 14, 2011 L12-18http://csg.csail.mit.edu/6.375
Today’s LectureCase Study: IMDCT
Interfacing with HW Extracting ParallelismAutomated Solutions Bluespec Inc.: SCE-MI Intel/MIT: LEAP RRR
March 14, 2011 L12-19http://csg.csail.mit.edu/6.375
Bluespec Co-design: SCE-MI
Circuit verification is difficult Billions of cycles of gate-level
simulation
How do we retain cycle accuracy? Use SCE-MI
Technology Cycles/sec*Bluesim 4KFPGA 20MASIC 60M+
*Target: WiFi TransceiverMarch 14, 2011 L12-20http://csg.csail.mit.edu/6.375
SCE-MIUse gated clocks to preserve cycle-accuracy Circuit internals run at “Model Clock” “Model Clock” ticks only when inputs and outputs
to the circuit stabilize Another Co-design problem
March 14, 2011 L12-21http://csg.csail.mit.edu/6.375
Bluespec SCE-MIUsed already in Lab With a controlled clock on the FPGABluespec has a rich SCE-MI library Get/Put transactors provided User provides C++ and HW
transactors for exotic interfaces
March 14, 2011 L12-22http://csg.csail.mit.edu/6.375
Intel/MIT: LEAP RRRAsynchronous Remote Request-Response stack for FPGA Uses common Client/Server paradigmSimilar in many respects to Bluespec SCE-MI Constrained user interface Open, many platforms supported
March 14, 2011 L12-23http://csg.csail.mit.edu/6.375
http://csg.csail.mit.edu/6.375March 14, 2009 L12-24
Client/Server interfacesGet/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose
interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface
interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface
client
data
ready
enabledata
enable
ready
getserver
data
ready
enabledata
enable
readyget put
put
req_t resp_t
RRR Specification Language
// ----------------------------------------// create a new service called ISA_EMULATOR// ----------------------------------------service ISA_EMULATOR{ // -------------------------------- // declare services provided by CPU // -------------------------------- server CPU <- FPGA; { method UpdateRegister(in REG_INDEX, in REG_VALUE); method Emulate(in INST_INFO, out INST_ADDR); };
// --------------------------------- // declare services provided by FPGA // --------------------------------- server FPGA <- CPU; { method SyncRegister(in REG_INDEX, in REG_VALUE); };};
March 14, 2011 L12-25http://csg.csail.mit.edu/6.375
LEAP Abstraction Layers: RRR
Channel IO
Kernel DriverFPGA Platform Physical Devices
Channel IO
FPGA CPU
March 14, 2011 L12-26http://csg.csail.mit.edu/6.375
LEAP Abstraction Layers: RRR
Channel IO
Kernel DriverFPGA Platform Physical Devices
Channel IO
FPGA CPU
RRR Client/Server ManagerRRR Client/Server Manager
Client Stub Server Stub
RRRspecification
files
March 14, 2011 L12-27http://csg.csail.mit.edu/6.375
LEAP Abstraction Layers: RRR
Channel IO
Kernel DriverFPGA Platform Physical Devices
Channel IO
FPGA CPU
RRR Client/Server ManagerRRR Client/Server Manager
ClientStubs.ISA_EMULATOR iemu;......iemu.UpdateRegister.Request( REG_R27, regFile[REG_R27]);......iemu.Emulate.Request(inst);......tgtPC <- iemu.Emulate.Response();
ISA_EMULATOR::UpdateRegister( REG_INDEX i, REG_VALUE v){ regFile[i] = v;}
ISA_EMULATOR::Emulate( INST_INFO inst){ // emulate the instruction
return target_PC;}
Client Stub Server Stub
Use
r Cod
e User Code
March 14, 2011 L12-28http://csg.csail.mit.edu/6.375
LEAP Abstraction Layers: RRR
Channel IO
Kernel DriverFPGA Platform Physical Devices
Channel IO
FPGA CPU
RRR Client/Server ManagerRRR Client/Server Manager
StubStubStub StubStubStub
User Application
March 14, 2011 L12-29http://csg.csail.mit.edu/6.375
ConclusionWriting SW which can safely exploit HW parallelism is difficult…
Several automated tools are available Development ongoing
March 14, 2011 L12-30http://csg.csail.mit.edu/6.375