Trading Communication with Computing Near Storagegunjaeko/pubs/Gunjae_MICRO17_slides.pdf · Trading...

39
Summarizer: Trading Communication with Computing Near Storage Gunjae Koo *, Kiran Kumar Matam*, Te I , H.V. Krishina Giri Nara*, Jing Li , Hung-Wei Tseng , Steven Swanson , Murali Annavaram* *University of Southern California North Carolina State University University of California, San Diego

Transcript of Trading Communication with Computing Near Storagegunjaeko/pubs/Gunjae_MICRO17_slides.pdf · Trading...

Summarizer: Trading Communication with Computing Near Storage

Gunjae Koo*, Kiran Kumar Matam*, Te I†, H.V. Krishina Giri Nara*, Jing Li‡,Hung-Wei Tseng†, Steven Swanson‡, Murali Annavaram*

*University of Southern California†North Carolina State University

‡University of California, San Diego

Host

Motivation – High Data Movement Cost

2

CPU Storage interface

Data computation @ host Data transfer from storage

External (host -- storage) Internal

Limited data bandwidth High access latency

External (host -- storage) Internal

StorageProcessor

(SP)

Host

Near Data Processing (NDP)

3

CPU Storage interface

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

Host

CPU

Near Data Processing (NDP)

4

Storage interface

StorageProcessor

(SP)

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDPData computation @ storage

Host

Near Data Processing (NDP) on SSDs

5

CPU Storage interface SP

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDPData computation @ storage

Garbage collection

Wear-leveling

Data computation @ storage

Host

Near Data Processing (NDP) on SSDs

6

CPU Storage interface SP

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDP

Garbage collection

Wear-leveling

Data computation @ storage

Obstacles to in-SSD processing

• Less powerful embedded processor

• Dynamic computation resource availability

• Manual workload partitioning is difficult Summarizer: Dynamic NDP framework for SSD

Host

CPU

Summarizer – Basic Concept

7

Storage interface AP

Monitoring resources

Host

CPU

Summarizer – Basic Concept

8

Storage interface AP

Monitoring resources

Summarizer – Detailed Firmware Architecture

9

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

SSD Embedded Processors

Normal Page Read Request

10

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

RD ( LBA)

(RD) PPA

Normal Page Read Request

11

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

RD(PPA 1)RD(PPA 2)

Page data

Normal Page Read Request

12

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

Page data

Summarizer – Initialization (Function Offloading)

13

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

INIT ( foo)

foo()

foo()f#1Function offloading

Function registration

New NVMe command

Summarizer – Computation (Dynamic mode)

14

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo()f#1

RD&PROC( LBA,foo)

New NVMe command

New NVMe command decode

RD&PROC(PPA,foo)

goo()f#2

Summarizer – Computation (Dynamic mode)

15

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo()f#1

RD&PROC(PPA,foo)

RD&P(PPA1,foo)

RD&P(PPA2,foo)

Page data

RD&P(PPA1,foo)

goo()f#2

Summarizer – Computation (Dynamic mode)

16

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo1()f#1

RD&PROC(PPA,foo)

Page data

RD&P(PPA1,foo)

buf1, foo

CC/Proc

Register in TQ

goo()f#2

Summarizer – Computation (Dynamic mode)

17

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

foo()f#1

RD&PROC(PPA,foo)

Page data

RD&P(PPA1,foo)

CC

TQ is full

goo()f#2

Summarizer – Finalization

18

Host Memory

SQ CQ

Host CPU

Sto

rag

e I

nte

rfa

ce (P

CIe

/ N

VM

e)

SSD Firmware

NAND FlashNAND FlashNAND FlashNAND Flash

Flash Controller

SSD DRAM

DRAM Controller

Summarizer

User Functions

TQ

Re

qu

est

qu

eu

e

Re

spo

nse

qu

eu

e

I/O Controller(NVMe command decoder)

SSD SoC Interconnection

Flash Translation Layer (FTL)

NVMe Host Driver

User Applications /Operating Systems

Task Controller

FINAL ( foo)

New NVMe command

foo()f#1

Results

goo()f#2

Summarizer API and NVMe commands

19

Initialization

Finalization

Computation

• NVMe command: INIT_TSKn• Transfer a in-SSD procedure to SSD memory• Initialize data structure and temporal variables for in-SSD

computation

• NVMe command: READ_PROC_TSKn• Page read command is issued with the flag indicating the user

procedure embedded in SSD memory• Return the special code if the requested page is processed in SSD• Page data is transferred to the host if the requested page is NOT

computed in SSD

• NVMe command: FINAL_TSKn• Gather final in-SSD computation results and transfer to the host

Evaluation Platform

• LS2085a intelligent SSD development platform

• ARM cores running FTL and Summarizerfirmware

• FPGA implementing NAND flash controller

• PCIe Gen. 3 4x lanes for host communication

20

LS2085a

Interconnection

DDR4 Memory Controller

DRAM DRAM

CPU

L1D(32KB)

L2(1MB)

L1I(48KB)

CPU

L1D(32KB)

L1I(48KB)

PC

Ie(h

ost

–L

S2

08

5a

)

PC

Ie(L

S2

08

5a

-F

PG

A)

FPGA(ALTERA Stratix V)

NAND flash DIMMNAND flash DIMMs

CPU

L1D(32KB)

L2(1MB)

L1I(48KB)

CPU

L1D(32KB)

L1I(48KB)

Evaluation - Performance

21

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Static workload offloading

Evaluation - Performance

22

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

CPU only processing (baseline) SSD only processing

Evaluation - Performance

23

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Summarizer Dynamic Offloading

Evaluation - Performance

24

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

SSD processing + transfer time(internal + external + In-SSD processing)

Host CPU processing time

Evaluation - Performance

25

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host timeExecution time normalized to baseline (CPU only)

Evaluation - Performance

26

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Evaluation - Performance

27

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

0.70 0.60

0.30

0.24

0.0

0.2

0.4

0.6

0.8

1.0

1.2

CPU only Dynamic

Chart TitleSDD time Host timeE

xe

cuti

on

tim

e (n

orm

ali

zed

to

ba

seli

ne

)

Evaluation - Performance

28

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

0.70 0.62

0.30

0.24

0.0

0.2

0.4

0.6

0.8

1.0

1.2

CPU only Dynamic

Chart TitleSDD time Host time

Performance improved by 14%

Data computation @ host Data transfer from storage

InternalExternal (host – storage)

W/O NDP

With NDPData computation @ storage

Evaluation - Performance

29

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1

Static Dynamic

TPC-H Query6

SDD time Host time

Performance degraded by static NDP

Evaluation - Performance

30

16% 10%

20% 7%

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Ex

ecu

tio

n t

ime

(no

rma

lize

d t

o b

ase

lin

e)

Design Exploration – Higher Internal Bandwidth

31

Host

CPU Storage interface

Data transfer bottleneck

Commercial SSD maintains internal bandwidth ≈ external bandwidth

Design Exploration – Higher Internal Bandwidth

32

Host

CPU Storage interface SP

Data transfer bottleneck

Higher internal bandwidth without increasing external bandwidth

0%

20%

40%

60%

80%

100%

1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4

TPC-H Query 6 TPC-H Query 1 TPC-H Query 14 String Similarity Join Average

Sp

ee

du

pDesign Exploration – Higher Internal Bandwidth

33

External : Internal bandwidth ratio

0%

20%

40%

60%

80%

100%

1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4

TPC-H Query 6 TPC-H Query 1 TPC-H Query 14 String Similarity Join Average

Sp

ee

du

pDesign Exploration – Higher Internal Bandwidth

34

Summarizer is effective if an SSD platform has higher internal bandwidth

Design Exploration – Better SSD Processor

35

Host

CPU Storage interface

Better embedded processor is cost effective

AP

Design Exploration – Higher Internal Bandwidth

36

0%

20%

40%

60%

80%

100%

120%

X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16

TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average

Sp

ee

du

p

Embedded processor performance

Design Exploration – Higher Internal Bandwidth

37

0%

20%

40%

60%

80%

100%

120%

X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16

TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average

Sp

ee

du

p

Summarizer is a cost effective NDP solution with powerful storage processors

Conclusion

38

▪Dynamic computation offloading framework• Opportunistic in-SSD computation

• Page-level task control

• Optimal performance improvement

▪ Summrizer programming model

✓ Dynamic NDP framework for SSDs• Opportunistically enables in-SSD processing• Page-level NDP control• Automatic workload partitioning

✓ Summarizer programming model• Evaluation on the real development platform• Explored design space for future SSDs

Thank you

Summarizer:Trading Communication with Computing Near Storage

Gunjae Koo, Kiran Kumar Matam, Te I, H. V. Krishna Giri Nara, Jing Li,Hung-Wei Tseng, Steven Swanson, Murali Annavaram

(We thank to Dell EMC for supporting the SSD development board)