An FPGA-based in-line accelerator for Memcached · Memcached: A Highly-Used Server Application Web...
Transcript of An FPGA-based in-line accelerator for Memcached · Memcached: A Highly-Used Server Application Web...
An FPGA-based In-line Accelerator for Memcached
MAYSAM LAVASANI, HARI ANGEPAT, AND DEREK CHIOU
THE UNIVERSITY OF TEXAS AT AUSTIN
1
Challenges for Server Processors
Workload changes Social networking Cloud applications Big data applications
Power wall Dark silicon More and more cores insufficient
Applications
Architectures
Devices
2
Memcached: A Highly-Used Server Application
Web Server
Database
Memcached
1-Get
2-Hit
3
An application level cache Database queries Server computations
Used in social networking sites Facebook YouTube Twitter Reddit
Web Server 1-Get
2-Miss
3-Request 4-Reply
5-Set
4
On Memcached Miss Access database Update Memcached (Set)
Database
Memcached
Memcached: A Highly-Used Server Application
Simplified Memcached Control Flow Read-parse command
Update Retrieval
Add / Inc / Dec
Append / Prepend
Set CAS
Remove Flush Statistics Delete
Find item
Send data
Remove the expired item
Create new item and/or detach the old item
Attach the new item
Clean up/Acknowledge
Copy the data to new item
Find item in hash
Remove the item
Find old item
5
Get Gets
Memcached on an FPGA?
Memory system
Memcached control flow graph
Update Retrieva
l Remove s
t
t
6
Pros: Specialization benefit Cons: Complex to be implemented as hardware
Memcached entirely on an FPGA Waste of hardware resources
Deal with application modifications
Rx Fifo
Tx Fifo
FPGA
Network
Hybrid Architecture: Optimize the common case
Rx Fifo
Tx Fifo
FPGA
7
General purpose core
$
Interconnect/Shared memory
Memcached control flow graph
Update Retrieva
l Remove s
t
t
Size problem App Modifications ? Common case extraction
May depend on state
Network
In-Line Acceleration
8
CAL 2013
In-line Acceleration – Fast Path: Hot Trace
$
Last level cache
Control / Status
NIC/In-line accelerator
Rx Fifo
Tx Fifo
FPGA resources
Application Control Flow Graph
Update Retrieva
l Remove s
t
t
9
Network
General purpose core
Compiler
Memory
In-line Acceleration – Slow Path: Original App
$
Last level cache
Control / Status
NIC/In-line accelerator
Rx Fifo
Tx Fifo
FPGA resources
Application Control Flow Graph
Update Retrieva
l
Append/Prepend
Remove s
10
General purpose core
Network
Memory
Bail out Issue
Problem: Fast path may modify some global data and later decides to bail out Transfer the computation to the slow path without causing any inconsistency
Solution: Roll back global updates User-defined roll back routine
Many server applications are roll back friendly Rollback code is already available in transactional applications (i.e. databases) Partial updates are isolated to provide atomicity Memcached bail out : 30 Lines of Code
11
In-line Accelerator Generation Process
Slicing
Programming
Profiling
Template-based HLS tool
Write the roll back code Annotate memory types
Packet IO Shared coherent Private
Input application
Verilog model
12
Supported applications Small Hot trace Roll back friendly Worker thread model
Synthesizes the code based on pre-defined Templates: Multiple threads Multiple engines (FPGA’12)
Required programming << Hardware design efforts
Memcached Results
13
The Memcached In-line Accelerator
Slice registers: 6570 Slice LUTs: 8739 Clock cycle: 7.30 ns
Single engine/Single thread - 5% of FPGA’s LUTs
14
FPGA
Xilinx Virtex 6vhx250T
0
100
200
300
400
500
600
700
800
900
50 100 200 400 800
Thro
ugh
pu
t K
Get
s/se
c
Offered load K Gets/sec
1x Xeon core 2x Xeon core 1x FPGA-based accelerator 2x FPGA-based accelerator
Fast Path Performance/Power
8 Watts, 175K RPS, 21.8 Reqs/J
1 Watt, 590K RPS, 590 Reqs/J
27x more energy efficient One FPGA engine vs. single Xeon core
15
Xeon frequency: 2.2 Ghz FPGA frequency: 100 Mhz
Assumes accelerator has dedicated last level cache port
0
100
200
300
400
500
600
700
32 64 128 256 512 1024
K G
ets/
sec
Item size (bytes) 0 K sets/sec 10 K sets/sec 15 K sets/sec 20 K sets/sec
Cache coherency, Locking overhead
Less overhead for larger items
Full system stack - Gem5 simulation of Client and server
16
Server: LLC-connected FPGA-based accelerator (single engine fast path) + one Alpha core (slow path)
Fast Path/Slow path Interaction Cost
Assumes accelerator has dedicated last level cache port
Projecting the Hybrid Architecture Performance
Active Xeon cores
Accelerator engines
Processor + FPGA power (watts)
Performance (requests/sec)
Energy efficiency (requests/J)
Xeon processor 8 0 92 1.4 M 15.2 K
Xeon processor + In-line accelerator
1 4 33 1.6 M 48 K
17
Assumes each engine has dedicated last level cache port
Conclusion/Future work
18
Conclusion Accelerated Memcached Performance/energy efficiency: Important Software-like programmability: Also important
Future work In-line acceleration of other applications Tradeoffs in the selection of fast path Alternative bail out solutions Reduce cache port assumptions
Thank you
19
This material is based upon work supported in part by the National Science Foundation under grants 0747438 and 0917158.
Recent work Reference Platform Main contribution
Berezecki et al. , IGCC 2011
Tilera Many-small-cores are more efficient
Hetherington et al., ISPASS 2012
GPU GPU can improve the performance
Chalamalasetti et al., FPGA 2012
FPGA Get requests on FPGA
Lim et al., ISCA 2013
ASIC/FPGA + General purpose cores
Get on FPGA, the rest on CPU
Blott et al., HotClouds 2013
FPGA + General purpose cores
Improving the throughput
Lavasani et al., CAL 2013
FPGA + General purpose cores
Automatic slicing and HW generation
20
Generating Memcached In-line Accelerator
Slicing
Programming
Profiling
Template-based HLS
Input application
C simulation model
Verilog model
Memcached 1.4.10 10687 LOCs User-level working thread: Libevent Profiled and sliced Hot trace: Almost all get operations on UDP Hot routines LOCs: 963 Hot instructions: <2000
Bail-out code is 30 lines of code Memory type annotation
21
IP Routers Are Similar
Control plane
Interconnect network
Network processor
Line cards
Network
Rx Fifo
Tx Fifo
Routing application control flow graph (abstract)
Update Retrieva
l
re
Remove s
t
t
22
Synchronization/Address Translation
$
Last level cache
Lock table
NIC/In-line accelerator
Rx Fifo
Tx Fifo
FPGA resources
Application Control Flow Graph
Update Retrieva
l Remove s
t
t
23
Network
General purpose core
TLB
DRAM