Reducing OLTP Instruction Misses with Thread Migration

21
Reducing OLTP Instruction Misses with Thread Migration Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos University of Toronto École Polytechnique Fédérale de Lausanne

description

Reducing OLTP Instruction Misses with Thread Migration. Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos. University of Toronto École Polytechnique Fédérale de Lausanne. OLTP on a Intel Xeon5660. Shore-MT Hyper-threading disabled . better. - PowerPoint PPT Presentation

Transcript of Reducing OLTP Instruction Misses with Thread Migration

Page 1: Reducing OLTP Instruction Misses with Thread Migration

Reducing OLTP Instruction Misses with Thread Migration

Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos

University of TorontoÉcole Polytechnique Fédérale de Lausanne

Page 2: Reducing OLTP Instruction Misses with Thread Migration

2

OLTP on a Intel Xeon5660Shore-MTHyper-threading disabled

IPC < 1 on a 4-issue machineTPC-C TPC-E

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Inst

ructi

ons

per C

ycle

TPC-C TPC-E0%

10%20%30%40%50%60%70%80%90%

100%

Resource (includes data)Instructions

Brea

kdow

n of

Cor

e St

alls

bette

r

70-80% of stalls are instruction stalls

Page 3: Reducing OLTP Instruction Misses with Thread Migration

3

16 32 64 128 256 512 10240

10

20

30

40

50

60

TPC-CTPC-E

Cache Size (KB)

Mis

ses p

er k

-Inst

ructi

onOLTP L1 Instruction Cache Misses

Trace Simulation4-way L1-I Cache

Shore-MT

bette

r

~512KB is enough for OLTP instruction footprint

Most common today!

Page 4: Reducing OLTP Instruction Misses with Thread Migration

4

• Larger L1-I cache size Higher access latency

• Different replacement policies Does not really affect OLTP workloads

• Advanced prefetching Has too much space overhead (40KB per core)

• Simultaneous multi-threading Increases IPC per hardware context Cache polluting

Reducing Instruction Stallsat the hardware level

Page 5: Reducing OLTP Instruction Misses with Thread Migration

5

• Enables usage of aggregate L1-I capacity– Large cache size without increased latency

• Can exploit instruction commonality– Localizes common transaction instructions

• Dynamic hardware solution– More general purpose

Alternative: Thread Migration

Page 6: Reducing OLTP Instruction Misses with Thread Migration

6

Transactions Running Parallel

T1 T2 T3

Instruction parts that can fit into L1-I

Threads

TransactionT1T2T3

Common instructions among concurrent threads

Page 7: Reducing OLTP Instruction Misses with Thread Migration

7

Scheduling Threads

0 1 2 3T1

T2 T1

T3 T2 T1

T1 T3 T2

CORES

1

T3

0 1 2 3T1

T1 T2

T1 T2 T3

T1 T2 T3

CORES

T3

Traditional TMi

L1I

3

6

9

10

1

2

3

4

4

T1

T2

T3

Threadstim

e

TotalMisses

TotalMisses

Page 8: Reducing OLTP Instruction Misses with Thread Migration

8

TMi

0 1

T1

CORES • Group threads• Wait till L1-I is almost full

– Count misses– Record last N misses– Misses > threshold => Migrate

L1I

T2T1Transaction A

T4T3Transaction B tim

e

Page 9: Reducing OLTP Instruction Misses with Thread Migration

9

TMi

0 1

T1

T2 T1

T1 T2

T1 T2

CORES Where to migrate?• Check the last N misses recorded

in other caches1) No matching cache => Move to an idle core if exists2) Matching cache => Move to that core3) None of above => Do not move

L1I

T2T1Transaction A

time

Page 10: Reducing OLTP Instruction Misses with Thread Migration

10

• Trace Simulation– PIN to extract instructions & data accesses per transaction– 16 core system– 32KB 8-way set-associative L1 caches– Miss-threshold is 256– Last 6 misses are kept

• Shore-MT as the storage manager– Workloads: TPC-C, TPC-E

Experimental Setup

Page 11: Reducing OLTP Instruction Misses with Thread Migration

11

Impact on L1-I Misses

Instruction misses reduced by half

bette

r

No M

igra

tion

TMi

TMi B

lind

No M

igra

tion

TMi

TMi B

lind

TPC-C TPC-E

05

1015202530354045

Instruction

Mis

ses p

er k

-Inst

ructi

on

Page 12: Reducing OLTP Instruction Misses with Thread Migration

12

Impact on L1-D Misses

Cannot ignore increased data misses

No M

igra

tion

TMi

TMi B

lind

No M

igra

tion

TMi

TMi B

lind

TPC-C TPC-E

05

1015202530354045 Write Data

Read DataInstruction

Mis

ses p

er k

-Inst

ructi

on

bette

r

Page 13: Reducing OLTP Instruction Misses with Thread Migration

13

• Dealing with the data left behind– Prefetching

• Depends on thread identification– Software assisted– Hardware detection

• OS support needed– Disabling OS control over thread scheduling

TMi’s Challenges

Page 14: Reducing OLTP Instruction Misses with Thread Migration

14

• ~50% of the time OLTP stalls on instructions• Spread computation through thread migration• TMi

– Halves L1-I misses– Time-wise ~30% expected improvement– Data misses should be handled

Conclusion

Thank you!

Page 15: Reducing OLTP Instruction Misses with Thread Migration

15

BACKUP

Page 16: Reducing OLTP Instruction Misses with Thread Migration

16

L1-I Misses per K-Instruction16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M

2-way 4-way 8-way FA 2-way 4-way 8-way FATPC-C TPC-E

0

10

20

30

40

50 Capacity Conflict Compulsory

Inst

ructi

ons M

PKI

Page 17: Reducing OLTP Instruction Misses with Thread Migration

17

L1-D Misses per K-Instruction16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M 16

K32

K64

K12

8K25

6K51

2K 1M 16K

32K

64K

128K

256K

512K 1M

8-way FA 8-way FATPC-C TPC-E

0

2

4

6

8

10 CapacityConflictCompulsory

Data

MPK

I

Page 18: Reducing OLTP Instruction Misses with Thread Migration

18

Replacement Policies

I-MPKI D-MPKI I-MPKI D-MPKITPC-C TPC-E

0

5

10

15

20

25

30 LRU LIP BIP DIP

MPK

I

Page 19: Reducing OLTP Instruction Misses with Thread Migration

Experimental Setup

Intel Xeon X5660 Server

#Sockets 2

#Cores in a Socket 6 (OoO)

#HW Contexts 24

Clock Speed 2.80GHz

Memory 48GB

LLC (L3) 12 MB

L2 (per core) 256KB

L1 (per core) 32KB (both I and D)

Hyper-Threading Enabled

OS Ubuntu 10.04 with Linux kernel 2.6.32

• Intel VTune 2011– Interface for hardware

counters• Working set fits in RAM• Log flushed to RAM• Each run:

– Starts with initial database

– Each worker executes 1000 xcts before Vtune starts collecting numbers for 60 secs

Page 20: Reducing OLTP Instruction Misses with Thread Migration

20

Formulas• IPC = INST_RETIRED.ANY_P /

CPU_CLK_UNHALTED.THREAD

• Data Stalls = RESOURCE_STALLS.ANY

• Instruction Stalls = UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY

Page 21: Reducing OLTP Instruction Misses with Thread Migration

21

16K

32K

64K

128K

256K

512K 1M 16

K

32K

64K

128K

256K

512K 1M

TPC-C TPC-E

05

101520253035404550

Cache Size

Capa

city

Mis

ses p

er k

-Inst

ructi

onOLTP L1 Instruction Cache Misses

Trace Simulation4-way L1-I Cache

Shore-MT

Most common today!

bette

r

~512KB is enough for OLTP instruction footprint