An FPGA-based in-line accelerator for Memcached · Memcached: A Highly-Used Server Application Web...

An FPGA-based In-line Accelerator for Memcached

MAYSAM LAVASANI, HARI ANGEPAT, AND DEREK CHIOU

THE UNIVERSITY OF TEXAS AT AUSTIN

1

Challenges for Server Processors

Workload changes Social networking Cloud applications Big data applications

Power wall Dark silicon More and more cores insufficient

Applications

Architectures

Devices

2

Memcached: A Highly-Used Server Application

Web Server

Database

Memcached

1-Get

2-Hit

3

An application level cache Database queries Server computations

Used in social networking sites Facebook YouTube Twitter Reddit

Web Server 1-Get

2-Miss

3-Request 4-Reply

5-Set

4

On Memcached Miss Access database Update Memcached (Set)

Database

Memcached

Memcached: A Highly-Used Server Application

Simplified Memcached Control Flow Read-parse command

Update Retrieval

Add / Inc / Dec

Append / Prepend

Set CAS

Remove Flush Statistics Delete

Find item

Send data

Remove the expired item

Create new item and/or detach the old item

Attach the new item

Clean up/Acknowledge

Copy the data to new item

Find item in hash

Remove the item

Find old item

5

Get Gets

Memcached on an FPGA?

Memory system

Memcached control flow graph

Update Retrieva

l Remove s

t

t

6

Pros: Specialization benefit Cons: Complex to be implemented as hardware

Memcached entirely on an FPGA Waste of hardware resources

Deal with application modifications

Rx Fifo

Tx Fifo

FPGA

Network

Hybrid Architecture: Optimize the common case

Rx Fifo

Tx Fifo

FPGA

7

General purpose core

$

Interconnect/Shared memory

Memcached control flow graph

Update Retrieva

l Remove s

t

t

Size problem App Modifications ? Common case extraction

May depend on state

Network

In-Line Acceleration

8

CAL 2013

In-line Acceleration – Fast Path: Hot Trace

$

Last level cache

Control / Status

NIC/In-line accelerator

Rx Fifo

Tx Fifo

FPGA resources

Application Control Flow Graph

Update Retrieva

l Remove s

t

t

9

Network


Compiler

Memory

In-line Acceleration – Slow Path: Original App

$

Last level cache

Control / Status


Rx Fifo

Tx Fifo

FPGA resources


Update Retrieva

l

Append/Prepend

Remove s

10


Network

Memory

Bail out Issue

Problem: Fast path may modify some global data and later decides to bail out Transfer the computation to the slow path without causing any inconsistency

Solution: Roll back global updates User-defined roll back routine

Many server applications are roll back friendly Rollback code is already available in transactional applications (i.e. databases) Partial updates are isolated to provide atomicity Memcached bail out : 30 Lines of Code

11

In-line Accelerator Generation Process

Slicing

Programming

Profiling

Template-based HLS tool

Write the roll back code Annotate memory types

Packet IO Shared coherent Private

Input application

Verilog model

12

Supported applications Small Hot trace Roll back friendly Worker thread model

Synthesizes the code based on pre-defined Templates: Multiple threads Multiple engines (FPGA’12)

Required programming << Hardware design efforts

Memcached Results

13

The Memcached In-line Accelerator

Slice registers: 6570 Slice LUTs: 8739 Clock cycle: 7.30 ns

Single engine/Single thread - 5% of FPGA’s LUTs

14

FPGA

Xilinx Virtex 6vhx250T

0

100

200

300

400

500

600

700

800

900

50 100 200 400 800

Thro

ugh

pu

t K

Get

s/se

c

Offered load K Gets/sec

1x Xeon core 2x Xeon core 1x FPGA-based accelerator 2x FPGA-based accelerator

Fast Path Performance/Power

8 Watts, 175K RPS, 21.8 Reqs/J

1 Watt, 590K RPS, 590 Reqs/J

27x more energy efficient One FPGA engine vs. single Xeon core

15

Xeon frequency: 2.2 Ghz FPGA frequency: 100 Mhz

Assumes accelerator has dedicated last level cache port

0

100

200

300

400

500

600

700

32 64 128 256 512 1024

K G

ets/

sec

Item size (bytes) 0 K sets/sec 10 K sets/sec 15 K sets/sec 20 K sets/sec

Cache coherency, Locking overhead

Less overhead for larger items

Full system stack - Gem5 simulation of Client and server

16

Server: LLC-connected FPGA-based accelerator (single engine fast path) + one Alpha core (slow path)

Fast Path/Slow path Interaction Cost

Assumes accelerator has dedicated last level cache port

Projecting the Hybrid Architecture Performance

Active Xeon cores

Accelerator engines

Processor + FPGA power (watts)

Performance (requests/sec)

Energy efficiency (requests/J)

Xeon processor 8 0 92 1.4 M 15.2 K

Xeon processor + In-line accelerator

1 4 33 1.6 M 48 K

17

Assumes each engine has dedicated last level cache port

Conclusion/Future work

18

Conclusion Accelerated Memcached Performance/energy efficiency: Important Software-like programmability: Also important

Future work In-line acceleration of other applications Tradeoffs in the selection of fast path Alternative bail out solutions Reduce cache port assumptions

Thank you

19

This material is based upon work supported in part by the National Science Foundation under grants 0747438 and 0917158.

Recent work Reference Platform Main contribution

Berezecki et al. , IGCC 2011

Tilera Many-small-cores are more efficient

Hetherington et al., ISPASS 2012

GPU GPU can improve the performance

Chalamalasetti et al., FPGA 2012

FPGA Get requests on FPGA

Lim et al., ISCA 2013

ASIC/FPGA + General purpose cores

Get on FPGA, the rest on CPU

Blott et al., HotClouds 2013

FPGA + General purpose cores

Improving the throughput

Lavasani et al., CAL 2013

FPGA + General purpose cores

Automatic slicing and HW generation

20

Generating Memcached In-line Accelerator

Slicing

Programming

Profiling

Template-based HLS

Input application

C simulation model

Verilog model

Memcached 1.4.10 10687 LOCs User-level working thread: Libevent Profiled and sliced Hot trace: Almost all get operations on UDP Hot routines LOCs: 963 Hot instructions: <2000

Bail-out code is 30 lines of code Memory type annotation

21

IP Routers Are Similar

Control plane

Interconnect network

Network processor

Line cards

Network

Rx Fifo

Tx Fifo

Routing application control flow graph (abstract)

Update Retrieva

l

re

Remove s

t

t

22

Synchronization/Address Translation

$

Last level cache

Lock table


Rx Fifo

Tx Fifo

FPGA resources


Update Retrieva

l Remove s

t

t

23

Network


TLB

DRAM

An FPGA-based in-line accelerator for Memcached · Memcached: A Highly-Used Server Application Web...

Documents

Transcript of An FPGA-based in-line accelerator for Memcached · Memcached: A Highly-Used Server Application Web...