VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas...
Transcript of VI Escola de Sistemas Embarcados - INFcaco/ESSE2016/ESSE2016_Nacif.pdf · VI Escola de Sistemas...
Lucas Bragança da Silva
Fredy Alves
José Nacif ( Apresentador )
Ricardo Ferreira ( Apresentador )
Universidade Federal de Viçosa
Finantial Support: Intel Brasil, Intel Labs, Capes, Cnpq, Fapemig
VI Escola de Sistemas Embarcados
ESSE 2016VI Brazilian Symposium on Computing Systems Engineering
CPU/FPGA Heterogeneous
Architectures
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Outline
• Motivation• FPGA and CPU • OpenCL and FPGA accelerators• HARP Platform
• HARP Layers• Demo• HARP CGRA
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation
Moore Law continues….
2005
Single Thread
Frequency
Power
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation
Moore Law continues….
multiple
cores
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
MotivationIoT and Cloud Computing• “Coherently-attached FPGA accelerator for Xeon processors in the
datacenter which is estimated to have a $1B market opportunity by 2020”
- Prabhat K. Gupta - General Manager of Xeon+FPGA Product at Intel Corporation
• Microsoft Cataput: layer of reconfigurable logic (FPGAs) between the network switchesand the servers (enabling the FPGAs to communicate directly, at datacenter scale) - IEEE Micro2016 - “A Cloud-Scale Acceleration Architecture”
• Baidu, Inc. (NASDAQ: BIDU), the leading Chinese language Internet search provider• Accelerators = greater throughput at low latency while retaining practical power levels.• 10-20X performance/watt improvement.• Baidu-optimized FPGA platforms are tuned for machine learning applications such as image
and speech recognition.
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation FPGA
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation FPGA
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation FPGA
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation FPGA
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Motivation FPGA
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPGA is scalable !
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPGAs
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPGAs
• Scalable• Energy Efficiency• Parallel and distributed computing• Temporal and Spatial Parallelism• From low cost embedded to high performance
cloud
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPGAs andTools
Hardware Description
Languages
Compilers
High Level Synthesis
General Purpose
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPGAs andTools
Specific tools for
specific applications
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
CPU and FPGAs
• Heterogeneous applications and Heterogeneous hardware
• Real World• HARP - Intel/Altera Platform
• Microsoft Cataput
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPL 2016 - PK Gupta - Intel
Accelerating DataCenter Workloads
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPL 2016 - PK Gupta - Intel
Accelerating DataCenter Workloads
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPL 2016 - PK Gupta - Intel
Accelerating DataCenter Workloads
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Microsoft Cataput v2
172.6K ALMs ,4 GB DDR3
RoundTrip - 250,000 machines
in 20 microseconds,
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Microsoft Cataput v2
• Microsoft's FPGA Translates Wikipedia in less than a Tenth of a Second
• FPGA network - breaking the “chicken and egg”• accelerators cannot be added until enough applications need them,
but applications will not rely upon the accelerators until they are present in the infrastructure.
• By decoupling the servers and FPGAs, software services that demand more FPGA capacity
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Universidade Federal de Viçosa
Instructions
Universidade Federal de Viçosa
Temporal and
Spatial Parallelism
Universidade Federal de Viçosa
Temporal and
Spatial Parallelism
OpenCL
OpenCL example
__attribute__(num_compute_units(4,4))
kernel void PE() {
...
}
PE
0,0
PE
0,1
PE
0,2
PE
0,3
How to build a systolic computer…..
__attribute__(num_compute_units(4,4))
kernel void PE() {
...
}
PE
0,0
PE
0,1
PE
0,2
PE
0,3
PE
1,0
PE
1,1
PE
1,2
PE
1,3
PE
2,0
PE
2,1
PE
2,2
PE
2,3
PE
3,0
PE
3,1
PE
3,2
PE
3,3
OpenCL example
__attribute__(num_compute_units(4,4))
kernel void PE() {
…
row = get_compute_id(0);
col = get_compute_id(1);
….
} PE
0,0
PE
0,1
PE
0,2
PE
0,3
PE
1,0
PE
1,1
PE
1,2
PE
1,3
PE
2,0
PE
2,1
PE
2,2
PE
2,3
PE
3,0
PE
3,1
PE
3,2
PE
3,3
OpenCL example
channel float4 ch_bottom[4];
…
PE() {
…
float4 a,b;
if (row == 0)
a = read_channel
(ch_bottom[col]); PE
0,0
PE
0,1
PE
0,2
PE
0,3
PE
1,0
PE
1,1
PE
1,2
PE
1,3
PE
2,0
PE
2,1
PE
2,2
PE
2,3
PE
3,0
PE
3,1
PE
3,2
PE
3,3
OpenCL example channel float4 ch_bottom[4];
channel float4 ch_PE_col[4][4];
…PE() {
…
float4 a,b;
if (row == 0)
a = read_channel(ch_bottom[col]);
else
a = read_channel(ch_PE_col[row-1][col])
PE
0,0
PE
0,1
PE
0,2
PE
0,3
PE
1,0
PE
1,1
PE
1,2
PE
1,3
PE
2,0
PE
2,1
PE
2,2
PE
2,3
PE
3,0
PE
3,1
PE
3,2
PE
3,3
Coarse Grained Reconfigurable Array CGRA vs FPGA
Huge Bitstream…..
fine grained Bitstream
word
level
CGRA as a virtual layer Small Bitstream
FPGA
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP - Legal Disclaimer
Copyright (C) 2008-2016 Intel Corporation All Rights Reserved.The source code contained or described herein and all documents relatedto the source code ("Material") are owned by Intel Corporation or itssuppliers or licensors. Title to the Material remains with Intel Corporationor its suppliers and licensors. The Material contains trade secrets andproprietary and confidential information of Intel or its suppliers andlicensors. The Material is protected by worldwide copyright and tradesecret laws and treaty provisions. No part of the Material may be copied,reproduced, modified, published, uploaded, posted, transmitted,distributed, or disclosed in any way without Intel's prior express writtenpermission.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP Prototype Xeon+FPGA* system disclaimerThis talk is about prototype hardware and software which has been madeavailable to universities in the HARP program.
Details of production Xeon+FPGA systems will be made available at a laterdate
Results and details in this presentation were generated using pre-production hardware and software, and may not reflect production or futuresystems
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP Accelerating Workloads using Xeon and coherently attached FPGA in-socket
QPI1 6 GB/s
Heterogeneous architecture with homogenous platform support1QuickPath Interconnect
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP-1 – Development platform
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP-1 – Development platform
• 96 GB RAM
• Xeon 10 cores
• FPGA Stratix V- 622K LUTs
- 1M Registers
- 2.5K Memory Modules (M20K)
- 512 DSPs
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP-1 – Development platform
Stratix
V
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP-1 – Development platform
Xeon
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP-1 – USB Programmer
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP HDL Programming
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – General Architecture
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – General Architecture
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP - Accelerator Abstraction Layer (AAL)
• Set of software tools for development and deployment of systems composed by asymmetric computing resources
• CPUs, GPUs, FPGAs as a server
• An application uses the server by requesting resources
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP - Accelerator Abstraction Layer (AAL)
• Resource manager• Ensuring exclusiveness of the use of a resource.
• Service-oriented and object-oriented • Interface definitions, attributes and the objects which implements those interfaces.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP - Accelerator Abstraction Layer (AAL)
• Service-Oriented Architecture• Service: Encapsulation of functionality which consumes computing resources
• Registrar: Registers services and APIs, used to locate and acquire service interfaces.
• Client: Executable that uses a service by acquiring API from registrar.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – AAL Object Communications
• AAL uses an asynchronous communication, It returns to the application while the requested service execute in parallel.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Services, Interfaces, Composition
• Client accesses the service through virtual interfaces published on the Registrar which does not expose the implementation.
• Component objects implement the interface.
class IMyInterface{public:virtual doThis(void)=0;virtual doThat(void)=0;virtual
~IMyInterface(){}};
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Abstraction, Resource Management
• AAL abstracts service instantiation from the application.
• Services can be created dynamically.
• When multiple implementations of a service are available in one or more compute resources, AAL returns the most suitable one.
• AAL Resource Manager controls the allocation and provisioning of compute resources to services.
• Resource management is important for precious and shared
resources such as accelerators on FPGAs.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – AAL Service Broker and Registrar
• Service Broker gets the information required to instantiate a service from the Registrar.
• Service libraries are loadable software such as DLLs.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – AAL Service Broker and Registrar
• Client 1 consults Service Broker for Service Compute.
• Service Broker obtains data record describing Service Compute from Service Registrar.
• Service broker consults Resource Manager which consults implementations and computing resources.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – AAL Service Broker and Registrar
• Resource Manager returns information to allow Broker to load service package.
• Service broker calls Service factory to instantiate it.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – General Architecture
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Core Cache Interface (CCI)
• Interface between AFU and QPI
- Read and write requests to the system coherent memory.
- Coherent memory is mapped to CPU DRAM.
• FPGA implements Intel QPI
- Processor uses QPI to access the system cache.
6 GB/s
64 KB
Line = 64B
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Core Cache Interface (CCI)
• Accelerated Function Units• Accelerates an application kernel, in the FPGA.
• Blue dotted box is the multiprocessor boundary.
• Red dotted box is the Cache access domain
6 GB/s
64 KB
Line = 64B
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Interface Definitions: Attach points
• QPI-FPGA implements the Caching and Configuration agents.
- The caching agent assures memory coherence.
- The Configuration Agent receives and handles read and write cycles from processor.
- System Protocol Layer (Virtual Address Translation)
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – General Architecture
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Interface Definitions: Attach points
• Processor-FPGA is RX.
• FPGA-Processor is TX.
• Designed to accept one read and write per clock cycle.
• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.
• SPL2: up to 2GB pinned virtual address space to an AFU.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Interface Definitions: Attach points
• Processor-FPGA is RX.
• FPGA-Processor is TX.
• Designed to accept one read and write per clock cycle.
• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.
• SPL2: up to 2GB pinned virtual address space to an AFU.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Interface Definitions: Attach points
• Processor-FPGA is RX.
• FPGA-Processor is TX.
• Designed to accept one read and write per clock cycle.
• AFU with CCI-E connected via SPL2, ordered read responses, writes out of order.
• SPL2: up to 2GB pinned virtual address space to an AFU.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Interface Definitions: Attach points
• AFU connected via CCI Standard (CCI-S) or CCI Extended (CCI-E).
• CCI-S uses physical addressing and out of order responses.
• CCI-E uses virtual addressing.
• Intel provides SPL2 IP to translate virtual to physical addresses.
• AFU connected to SPL2 via CCI-E.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – General Architecture
“CGRA HARP”
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Intel HARP
“Hello World”
Example.
Are you ready???
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
HARP – Accelerator “Hello World”
• AFU capable of adding two CPU memory values- SPL2 RTL for address translation- SW application (C++) & AFU RTL (Verilog)
• In this Example we will be demonstrating the use of:- AAL Runtime- Service API- Example AFU to build an user AFU.
AAL Application code: Run method
m_runtimClient->getRuntime()->allocService(dynamic_cast<IBase
*>(this), Manifest);
m_Sem.Wait();
if(0 == m_Result){
MSG("Running Test");
btVirtAddr pWSUsrVirt = m_pWkspcVirt; // Address of
Workspace
const btWSSize WSLen = m_WkspcSize; // Length of
workspace
INFO("Allocated " << WSLen << "-byte Workspace at virtual
address " << std::hex << (void *)pWSUsrVirt);
// Number of bytes in each of the source and destination buffers
(4 MiB in this case)
btUnsigned32bitInt a_num_bytes= (btUnsigned32bitInt) ((WSLen -
sizeof(VAFU2_CNTXT)) / 2);
• Allocates services, workspace in Device
Status Memory (DSM) using allocService().
• If service is successfully allocated, runs the
test.
• Get the address and the length of the
workspace.
• Defines the size of the source and
destination buffers in bytes.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
btUnsigned32bitInt a_num_cl = a_num_bytes / CL(1); //
number of cache lines in buffer
// VAFU Context is at the beginning of the buffer
VAFU2_CNTXT *pVAFU2_cntxt = reinterpret_cast<VAFU2_CNTXT
*>(pWSUsrVirt);
// The source buffer is right after the VAFU Context
btVirtAddr pSource = pWSUsrVirt + sizeof(VAFU2_CNTXT);
// The destination buffer is right after the source buffer
btVirtAddr pDest = pSource + a_num_bytes;
• Defines the number of cache lines on each
buffer in a_num_cl.
• Get the pointer to afu context
(pVAFU2_cntxt), to source buffer
(pSource) and destiny buffer (pDest).
• pDest is pSource plus the size of pSource
in bytes (a_num_bytes).
VI Brazilian Symposium on Computing Systems Engineering, November 2016
AAL Application code: Run method
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
// Initialize the command buffer
::memset(pVAFU2_cntxt, 0, sizeof(VAFU2_CNTXT));
pVAFU2_cntxt->num_cl = a_num_cl;
pVAFU2_cntxt->pSource = pSource;
pVAFU2_cntxt->pDest = pDest;
INFO("Starting SPL Transaction with Workspace");
m_Sem.Wait();
int num1 = 3;
int num2 = 2;
int *inputs_ADD = (int*)malloc(sizeof(int)*2);
volatile int *addIn = (int*)pSource;
inputs_ADD[0] = numa;
inputs_ADD[1] = numb;
memcpy((void*)addIn, inputs_ADD, sizeof(int)*2);
m_SPLService->StartTransactionContext(TransactionID(), pWSUsrVirt, 100);
• Initialize and copies the AFU
• defines two numbers (numa, numb) and
copy them to the source buffer
• Starts the transaction by using
StartTransactionContext, this enables the
start signal on the AFU and resets it.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
• Initialize and copies the AFU context to the
context pointer.
• defines two numbers (numa, numb) and
copy them to the source buffer using
memcpy.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
• Starts the transaction by using
StartTransactionContext, this enables the
start signal on the AFU and resets it.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
• The AFU writes its AFU_ID to DSM.
• The AFU will be running after the CPU
reads the AFU_ID from the DSM.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
// Wait for SPL VAFU to finish code
volatile bt32bitInt done = pVAFU2_cntxt->Status &
VAFU2_CNTXT_STATUS_DONE;
while (!done && --count) {
SleepMilli( delay );
done = pVAFU2_cntxt->Status & VAFU2_CNTXT_STATUS_DONE;
}
if ( !done ) {
// must have dropped out of loop due to count -- never
saw update
ERR("AFU never signaled it was done. Timing out anyway.
Results may be strange.\n");
}
int *pu32 = reinterpret_cast<int*>(&pDestCL[0]);
for(int i = 0;i< results_num;i++){
cout << *pu32 << "\n";
++pu32;
}
• Gets reference to AFU DONE signal.
• Wait for done to be set to 1.
• If AFU do not answer before the limit time,
prints and error message.
• If AFU answers in time, gets the reference
to destiny buffer and print the results.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
//Issue Stop Transaction and wait for OnTransactionStopped
INFO("Stopping SPL Transaction");
m_SPLService->StopTransactionContext(TransactionID());
m_Sem.Wait();
}
// Clean up and exit
INFO("Workspace verification complete, freeing workspace.");
m_SPLService->WorkspaceFree(m_pWkspcVirt, TransactionID());
m_Sem.Wait();
m_runtimClient->end();
// while(1){}
return m_Result;
}
• After transaction is done, it stops the
transaction.
• By stopping, it resets the AFU and set start
signal to 0.
• Frees the workspace.
\
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
AAL Application code: Run method
• Based on the Sudoku example.
• SPL RTL: Provided by Intel
• AFU RTL
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU
- afu_user: implements
communication interface
with the SPL module.
module afu_user #(CACHE_WIDTH = 512)
(
input clk,
input reset_n,
// Read Request
output [ADDR_LMT-1:0] rd_req_addr,
output [MDATA-1:0] rd_req_mdata,
output reg rd_req_en,
input rd_req_almostfull,
// Read Response
input rd_rsp_valid,
input [MDATA-1:0] rd_rsp_mdata,
input [CACHE_WIDTH-1:0] rd_rsp_data,
• CACHE_WIDTH is the size of the cache
line in bits.
• SW application starts the transaction
(Reset).
• The read request signals are used to
request a read from the source buffer.
• The read response is the read data
reponse. in this case we only use
rd_rsp_data which has the cache line for
the last read request.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU USER Interface
// Write Request
output [ADDR_LMT-1:0] wr_req_addr,
output [MDATA-1:0] wr_req_mdata,
output [CACHE_WIDTH-1:0] wr_req_data,
output reg wr_req_en,
input wr_req_almostfull,
// Write Response
input wr_rsp0_valid,
input [MDATA-1:0] wr_rsp0_mdata,
input wr_rsp1_valid,
input [MDATA-1:0] wr_rsp1_mdata,
// Start input signal
input start,
// Done output signal
output reg done,
// Control info from software
input [511:0] afu_context);
• The write request signals have the same
set of signals as the read request for cache
write operations.
• wr_rsp1_valid is used to identify when the
writing process finishes.
• “start” is the signal set by the CPU in the
AFU to start a transaction.
• “done” is the signal sent to the CPU to
indicate that the transaction processing is
over.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU USER Interface
• Read data from the source buffer, process (AFU), write the results back to the destination buffer.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States
FSM_IDLE: begin
if(start) begin
fsm_ns = FSM_RD_REQ;
end
end
• Waits for start signal to be set to one by
the CPU.
• Changes to FSM_RD_REQ to start reading
process from source buffer.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_IDLE
FSM_RD_REQ: begin
// If there's no more data to copy
if(addr_cnt >= num_clines)
begin
fsm_ns = FSM_RUN_ADD;
addr_cnt_clr = 1'b1;
end
// There's more data to copy
else begin
// Issue rd_req
if(!rd_req_almostfull) begin
rd_req_en = 1'b1;
fsm_ns = FSM_RD_RSP;
end
end
end
• addr_cnt keeps track of which line is being
read.
• if addr_cnt is more than the number of
lines to be read (num_clines), change to
state to run user AFU.
• if addr_cnt is less than the number of lines
to be read and the read buffer is not full,
sends a read request to SPL (rd_req_en)
and changes to state to wait for read
response.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_RD_REQ
always@(posedge clk)
begin
if(rd_rsp_valid)
begin
case(addr_cnt)
'd0:
begin
inputs_add <= rd_rsp_data;
end
endcase // case (addr_cnt)
end // if (rd_rsp_valid)
end // always@ (posedge clk)
adder add0(
.clk(clk),
.start(start),
.numA(inputs_add[31:0]),
.numB(inputs_add[63:32]),
.result(w_outGrid),
.done(w_done)
);
• This always block waits for a response
(when rd_rsp_valid is 1) and then save the
data from the source buffer (rd_rsp_data),
in this case to inputs_add.
• inputs_add is connected to the input of the
user AFU which in this case is adder.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_RD_REQ
FSM_RD_RSP:
begin
// Receive rd_rsp, put read
data into data_buf
if(rd_rsp_valid)
begin
addr_cnt_inc = 1'b1;
fsm_ns = FSM_RD_REQ;
end
end
• Waits for response.
• addr_cnt_inc is set to one, this increases
the addr_cnt by 1 which means that the
next line in source buffer will be read.
• Goes back to FSM_RD_REQ.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_RD_RSP
// --- Address counter
reg [31:0] addr_cnt;
always @ (posedge clk) begin
if(!reset_n)
addr_cnt <= 0;
else
if(addr_cnt_inc)
addr_cnt <= addr_cnt + 1;
else if(addr_cnt_clr)
addr_cnt <= 'd0;
end
• This always block is responsible for
controlling the changes in addr_cnt.
• When addr_cnt_inc is set to 1, increase it
by 1 to go to next buffer line.
• When addr_cnt_clr is set to 1, clears
addr_cnt.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_RD_RSP
FSM_RUN_ADD:
begin
t_start = 1'b1;
fsm_ns = FSM_WAIT_ADD;
n_cnt = 'd0;
end
adder add0(
.clk(clk),
.start(t_start),
.numA(inputs_add[31:0]),
.numB(inputs_add[63:32]),
.result(w_outGrid),
.done(w_done)
);
• Sets t_start connected to the adder to 1.
• Adder starts.
• Goes to state to wait adder to finish.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_RUN_ADD
FSM_WAIT_ADD:
begin
if(w_done | w_error)
begin
fsm_ns = FSM_WR_REQ;
end
end
adder add0(
.clk(clk),
.start(t_start),
.numA(inputs_add[31:0]),
.numB(inputs_add[63:32]),
.result(w_outGrid),
.done(w_done)
);
• Waits for w_done wire connected to done
signal on the adder to be set to one
meaning that the adder processing
finished.
• When finished, goes to state to start writing
results to destination buffer.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_WAIT_ADD
FSM_WR_REQ:
begin
if(addr_cnt >= num_clines)
begin
fsm_ns = FSM_DONE;
end
else if(!wr_req_almostfull)
begin
wr_req_en = 1'b1; // issue wr_req
fsm_ns = FSM_WR_RSP;
end
end
• Requests data to be written to destiny
buffer is the output of the adder.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_WR_REQ
FSM_WR_RSP:
begin
if(wr_rsp0_valid | wr_rsp1_valid)
begin
fsm_ns = FSM_WR_REQ;
addr_cnt_inc = 1'b1; // address counter ++
end
end
• The data to be written to destination buffer
is the output of the adder.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_WR_RSP
FSM_DONE:
begin
done = 1'b1;
fsm_ns = FSM_DONE;
end
• Sets done to one which finishes the
transaction, stops the SPL and sends done
signal to the CPU.
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
The AFU Control States: FSM_DONE
Collision detection
algorithm
• Detects collisions between rigid bodies
in a space and calculate the results for
these collision.
• Used in a wide variety of applications
such as games, simulations and
robotics.
• Implemented in engines. In our case we
integrate the HARP platform with ODE
(Open Dynamics Engine), an open-
source engine.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Case study:
spheres collision detection
• Inputs: Position, speed and
form of bodies in space.
• Outputs: Contact points
(Potential, fake and true).
Collision results (New
position for spheres).
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
In-game gameplay from the game Besiege.
The system is composed by:
• The ODE application integrated
with AAL.
• The FPGA with the collision
detection AFU and SPL for
address translation.
• Source buffer to hold input data
for AFU.
• Destination buffer to hold the
collision detection results for the
application.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
For each simulation step
• The CPU sends the collisions
data to the Source Buffer.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
For each simulation step
• CPU sends start and reset signal
to SPL which propagates to the
AFU.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
For each simulation step
• The AFU sends its AFU ID to the
SRC Buffer which is read by the
CPU indicating that the AFU
started.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
For each simulation step
• After it finishes processing the
collisions, the AFU sends the
results to the destination Buffer.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
For each simulation step
• AFU indicates to CPU that it is
done processing the transaction.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
For each simulation step
• CPU retrieves results from
destination buffer.
Universidade Federal de Viçosa
VI Brazilian Symposium on Computing Systems Engineering, November 2016
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
FPL 2016 - PK Gupta - Intel
Accelerating DataCenter Workloads
Two application examples
FPGA Board Evaluation - DAC,2016
FPGA Board Evaluation - DAC,2016
FCCM,2016
DNA accelerator
short sequence
Harp/Intel
FCCM,2016
DNA accelerator
short sequence
Harp/Intel
FCCM,2016
Inside a PE
FCCM,2016
Applications mapped on HARP
• “Runtime Parameterizable Regular Expression Operators for Databases”
• tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP LIKE in SQL).
• “High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform”
• 2.9x and 1.9x compared with CPU-only and FPGA-only baselines • 2.3x vs. FPGA implementation for sorting
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
On Going work
• Previous Work• Modulo Scheduling
• Virtual CGRA
• High Level Stream Computation mapped onto HARP
Every day, we create 2.5 quintillion bytes of data —
so much that 90% of the data in the world today
has been created in the last two years alone.
IBM Big-Data
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop
A
B
C
D
E
F
G
H
I
J
End loop
A B
streams
C E
F
D
G H
I J
Sequential
CodeParallel
Data Flow
streams
FU FU FU
FU FU FU
Physical
Parallel
Architecture
CGRA
RUNTIME
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
streams
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
Iteration i
Iteration i+1
Iteration i+2
Iteration i+3
Overlap iterations !
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
Iteration i
Iteration i+1
Iteration i+2
Iteration i+3
At same time, all operations are executed.....
One Clock Cycle THROUGHPUT !
ILP=10
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
Iteration i
Iteration i+1
Iteration i+2
Iteration i+3
Physical
Architecture
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
Iteration i
Iteration i+1
Iteration i+2
Iteration i+3
Physical
Architecture
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
Iteration i
Iteration i+1
Iteration i+2
Iteration i+3
Physical
Architecture
A
B
I
J
C
D
F
E
G
H
A
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
Iteration i
Iteration i+1
Iteration i+2
Iteration i+3
Physical
Architecture
A
B
I
J
C
D
F
E
G
H
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
FU
FU
FU
FU
FU
FU
10 OPs
6 units
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
B
A
G
F
FU
H
t0
t1
t2
t3
t4
t5
t6
t7
6 Units, t3
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
D
C
I
E
FU
J
t0
t1
t2
t3
t4
t5
t6
t7
6 Units, t4
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
B
A
G
F
FU
H
t0
t1
t2
t3
t4
t5
t6
t7
6 Units, t5
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
D
C
I
E
FU
J
t0
t1
t2
t3
t4
t5
t6
t7
New result, every 2 cycles ….ILP=5
Initial Interval (II) = 2 cycles 6 Units, t6
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
B
A
FU
t0
t1
t2
t3
t4
t5
t6
t7
FU
FU FU
Configuration C0
C0
placement
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
B
A
FU
t0
t1
t2
t3
t4
t5
t6
t7
FU
FU FU
Configuration C0
C0
D
C
FUFU
E FU
Configuration C1
C1
Placement and
Routing
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
t0
t1
t2
t3
t4
t5
t6
t7
C0
C1
C0
B
A
FUH
F G
Configuration C0
D
C
FUFU
E FU
Configuration C1
Placement and
Routing
t0
Iteration i
Iteration i+1
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
t0
t1
t2
t3
t4
t5
t6
t7
C0
C1
C0
C1
B
A
FUH
F G
Configuration C0
D
C
FUFU
E FU
Configuration C1
Placement and
Routing
I
J
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
t0
t1
t2
t3
t4
t5
t6
C0
C1
C0
t0
Iteration i
Iteration i+1
B
A
FUF
G H
Configuration C1
Configuration C0
Configuration Memory
Physical
Architecture
CGRA
C1
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
A B
C E
F
D
G H
I J
t0
t1
t2
t3
t4
t5
t6
C0
C1
C0
t0
Iteration i
Iteration i+1
E
C
JFU
D I
Configuration C1
Configuration C0
Configuration Memory
Physical
Architecture
CGRA
C1
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Loop Unrolling - Modulo Scheduling
Global Register
FUFUFUFU
FUFUFUFU
FU
FU
FU
FU
FU FU
FU FU
RF RF RF RF
RF
RF
RF
RF
RF
RF
RF
RF
Virtual CGRA on the top of
Commercial FPGA
XILINX XC6VLX75T
FlipFlop 2.5 %
LUTs 14.7 %
Mem Bank 16.0 %
Clock 110 Mhz
F
UF
UF
UF
UF
UF
UF
UF
U
.
.
.
FlipFlop 2.7 %
LUTs 17.6 %
Mem Bank 4.5 %
Clock 90 Mhz
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Virtualization
VI Brazilian Symposium on Computing Systems Engineering, November 2016
Universidade Federal de Viçosa
Questions ?