Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

31
Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware Yaoping Ruan Princeton University Vivek Pai, Princeton University Erich Nahum, IBM T.J. Watson John Tracey, IBM T.J. Watson

description

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware. Yaoping Ruan Princeton University. Vivek Pai, Princeton University Erich Nahum , IBM T.J. Watson John Tracey , IBM T.J. Watson. Motivation. Network servers Throughput matters Hardware intensive - PowerPoint PPT Presentation

Transcript of Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Page 1: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Evaluating the Impact of Simultaneous Multithreading on Network Servers

Using Real Hardware

Yaoping RuanPrinceton University

Vivek Pai, Princeton UniversityErich Nahum, IBM T.J. WatsonJohn Tracey, IBM T.J. Watson

Page 2: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2

Motivation

Network servers Throughput matters Hardware intensive

Simultaneous Multithreading (SMT) Processor support for high throughput Simulated since mid-90s Now - Intel Xeon/Pentium 4 (Hyper-

Threading), IBM POWER5 available

Page 3: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 3

How Does SMT Work? Simultaneous execution of multiple jobs Higher utilization of functional units

cycles (direction of data flow)

Job 1Processor 1

Job 2Processor 2

Job 1&2SMT processor

(Colored blocks are functional units currently in use)

Page 4: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 4

SMT Architecture

Appear as multi-processors for OS and app.

Architectural State Registers #1

DuplicatedResource

Architectural State Registers #2

Shared Resource

Pipeline Execution Units

Cache Hierarchy

System Bus

Main Memory

Page 5: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 5

Contributions Detailed analysis of multiple real hardware

platforms and server packagesIncludes previously ignored OS overheads

Micro-architectural performance analysisDemonstrates dominance of memory hierarchy

Comparison with simulation studiesExplain why SMT provides relatively small

benefits on real hardwareOverly-aggressive memory simulation yielded

higher expected benefits

Page 6: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 6

Outline

BackgroundMeasurement methodologyThroughput & improvementMicro-architectural performanceDiscussion

Page 7: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 7

Measurements OverviewMetrics

Server throughputThroughput improvements (relative speedups)Architectural features (CPI, miss ratio, etc.)

Multiple configurationsHardware platforms (clock speed, cache, etc.)Server software (Apache, Flash, TUX, etc.)Kernel configuration (uniprocessor and

multiprocessor)

Page 8: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 8

Hardware Platforms

Three models of Xeon processors

Clock rate 2.0GHz 3.06Ghz 3.06GHz L3

L3 - 1MB

Mem latency (cycles)

220 350 cycles

L1/L2 cache sizes, main memory, buses and # threads/processor are the same

Clock rate Cache

Page 9: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 9

Web Servers

5 Web server packages Apache-MP: multi-process Apache-MT: multi-thread Flash: event-driven TUX: in-kernel Haboob: Java server, staged multi-thread model

Benchmark SPECweb96 and SPECweb99

Page 10: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 10

System Configuration

5 configuration labels # CPUs, SMT on/off, kernel type

1P-UP 1P-MP 2T 2P 4T

on onSMT

Multiprocessor kernelkernel

1# CPUs 2

(T – # threads, P – # processors)

Page 11: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 11

Outline

BackgroundMeasurement methodologyThroughput & improvement

Single processor Dual-processor

Micro-architectural performanceDiscussion

Page 12: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 12

Apache-MP, 3.06GHz

0

200

400

600

800

1000

1200

1P-UP 1P-MP 2Tw/ SMT

2P 4Tw/ SMT

Th

rou

gh

pu

t (M

b/s

)Throughput Evaluation

2T vs. 1P-MP

4T vs. 2P

2T vs. 1P-UP

single processor dual-processor

Page 13: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 13

Improvement on Single Processor

2T : 2 threads, multiprocessor kernel1P-MP: 1 thread, multiprocessor kernel

2T vs. 1P-MP

-10

0

10

20

30

40

Apache-MP Apache-MT Flash TUX Haboob

Th

rou

gh

pu

t im

pro

vem

ent

(%)

2.0GHz 3.06GHz 3.06GHz L3

Page 14: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 14

2T vs. 1P-UP

-10

0

10

20

30

40

Apache-MP Apache-MT Flash TUX Haboob

Th

rou

gh

pu

t im

pro

vem

ent

(%)

2.0GHz 3.06GHz 3.06GHz L3

Improvement on Single Processor

2T : 2 threads, Multiprocessor kernel1P-UP: 1 threads, Uniprocessor kernel

Kernel overhead

Page 15: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 15

Improvement on Dual-processor4T: 4 threads (2 processors, 2T/Processor)2P: 2 physical processors (SMT disabled)

4T vs. 2P

-20

-10

0

10

20

30

40

Apache-MP Apache-MT Flash TUX HaboobTh

rou

gh

pu

t im

pro

vem

ent

(%)

2.0GHz 3.06GHz 3.06GHz L3

2.0GHz & 3.06GHz with L3 are better Memory is still the

bottleneck

Page 16: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 16

Micro-architectural Analysis

Use Oprofile In-house patch to measure extra events

About 25 performance events Cache miss/hit TLB miss/hit Branches Pipeline stall, clear, etc. Bus utilization

Page 17: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 17

L1 Instruction Cache Miss Rate

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Apache-MP Apache-MT Flash TUX Haboob

1P-UP 1P-MP 2T(SMT)

2P 4T(SMT)

Page 18: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 18

L2 Cache Miss Rate

Instruction & data unified Lower rate in SMT due to higher L1 misses

0%

2%

4%

6%

8%

10%

Apache-MP Apache-MT Flash TUX Haboob

1P-UP 1P-MP 2T(SMT)

2P 4T(SMT)

Page 19: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 19

Apache-MP

02468

10121416

1P-UP 1P-MP 2T 2P 4T

Putting Events TogetherC

ycle

s pe

r In

stru

ctio

n (C

PI)

work L1 Miss L2 Miss ITLBDTLB Branch Clear Buffer

work

L1 Miss

L2 Miss

others

Page 20: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 20

Non-overlapped CPI

L1/L2 miss penalty dominates

Page 21: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 21

Measuring Bus Utilization

Event: FSB_DATA_ACTIVITYCPU cycles when the bus is busy

Normalized to CPU speedComparable across all CPU clock rate

Page 22: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 22

Bus Utilization Results 2.0GHz & 3.06GHz

L3 have less data transfer cyclesLower memory

latency in 2.0GHz & 3.06GHz with L3

Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95

Apache-MP

0

5

10

15

20

1P-UP

1P-M

P 2T 2P 4T

Bu

s U

tiliz

atio

n (

%)

2.0GHz 3.06GHz 3.06GHz L3

Page 23: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 23

Outline

BackgroundMeasurement parametersThroughput speedupMicro-architectural performanceDiscussion

Compare to simulationOther Web workloads

Page 24: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 24

SMT Performance on Web Servers

Simulation

Multiprocessorkernel

Uniprocessor kernel

Dualprocessor

-10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Th

rou

gh

pu

t im

pro

vem

ent

Page 25: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 25

Compare to Simulation

Simulation Measurement

Size Miss rate Size Miss rate

L1-I 128 KB 2.0% 12 KB 17%

L1-D 128 KB 3.6% 8 KB 5.7%

L2 16 MB 1.4% 512 KB 3.9%

Mem latency 90 cycles 220 ~ 350 cycles

Page 26: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 26

Processor Development Trend

2000 20031996

62-cycle mem

32 KB L1

256 KB L2

90-cycle mem

128 KB L1

16384 KB L2

90-cycle mem

64 KB L1

16384 KB L2

74-cycle mem

16 KB L1

256 KB L2

94-cycle mem

16 KB L1

512 KB L2

350-cycle mem

8-12 KB L1

512 KB L2

Simulated models:

Actual processors:

Page 27: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 27

SMT on SPECweb99

SPECweb99 results in paperDynamic + staticMultiple programs

• CGI requests, user profile logging, etc.

Speedup very close to static-only workloadsNo more negative speedups in FlashMay be due to better sharing of resources of

different programs

Page 28: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 28

Summary

More realistic speedup evaluation of SMT 3 processors, 5 servers, 2 kernels Exposed factors not previously examined 5~15% speedup in our best cases

Detailed analysis of memory hierarchy impact on SMT performance All other architecture overheads secondary Reasons why simulation results were overly

optimistic

Page 29: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Thank you

http://www.cs.princeton.edu/~yruan

Page 30: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 30

Future Work

Ways of improving Simultaneous Multithreading performanceServer performance on POWER5Using execution driven simulation for deeper

understanding

Study Chip Multiprocessor (CMP)Intel, AMD, and IBM

Page 31: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 31

Pipeline Clears (per Byte)

Conditions when the whole pipeline needs to be flushed

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Apache-MP Apache-MT Flash TUX Haboob

1T-UP 1T-MP 2T 2P 4T