Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware
description
Transcript of Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware
![Page 1: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/1.jpg)
Evaluating the Impact of Simultaneous Multithreading on Network Servers
Using Real Hardware
Yaoping RuanPrinceton University
Vivek Pai, Princeton UniversityErich Nahum, IBM T.J. WatsonJohn Tracey, IBM T.J. Watson
![Page 2: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/2.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2
Motivation
Network servers Throughput matters Hardware intensive
Simultaneous Multithreading (SMT) Processor support for high throughput Simulated since mid-90s Now - Intel Xeon/Pentium 4 (Hyper-
Threading), IBM POWER5 available
![Page 3: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/3.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 3
How Does SMT Work? Simultaneous execution of multiple jobs Higher utilization of functional units
cycles (direction of data flow)
Job 1Processor 1
Job 2Processor 2
Job 1&2SMT processor
(Colored blocks are functional units currently in use)
![Page 4: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/4.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 4
SMT Architecture
Appear as multi-processors for OS and app.
Architectural State Registers #1
DuplicatedResource
Architectural State Registers #2
Shared Resource
Pipeline Execution Units
Cache Hierarchy
System Bus
Main Memory
![Page 5: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/5.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 5
Contributions Detailed analysis of multiple real hardware
platforms and server packagesIncludes previously ignored OS overheads
Micro-architectural performance analysisDemonstrates dominance of memory hierarchy
Comparison with simulation studiesExplain why SMT provides relatively small
benefits on real hardwareOverly-aggressive memory simulation yielded
higher expected benefits
![Page 6: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/6.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 6
Outline
BackgroundMeasurement methodologyThroughput & improvementMicro-architectural performanceDiscussion
![Page 7: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/7.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 7
Measurements OverviewMetrics
Server throughputThroughput improvements (relative speedups)Architectural features (CPI, miss ratio, etc.)
Multiple configurationsHardware platforms (clock speed, cache, etc.)Server software (Apache, Flash, TUX, etc.)Kernel configuration (uniprocessor and
multiprocessor)
![Page 8: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/8.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 8
Hardware Platforms
Three models of Xeon processors
Clock rate 2.0GHz 3.06Ghz 3.06GHz L3
L3 - 1MB
Mem latency (cycles)
220 350 cycles
L1/L2 cache sizes, main memory, buses and # threads/processor are the same
Clock rate Cache
![Page 9: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/9.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 9
Web Servers
5 Web server packages Apache-MP: multi-process Apache-MT: multi-thread Flash: event-driven TUX: in-kernel Haboob: Java server, staged multi-thread model
Benchmark SPECweb96 and SPECweb99
![Page 10: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/10.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 10
System Configuration
5 configuration labels # CPUs, SMT on/off, kernel type
1P-UP 1P-MP 2T 2P 4T
on onSMT
Multiprocessor kernelkernel
1# CPUs 2
(T – # threads, P – # processors)
![Page 11: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/11.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 11
Outline
BackgroundMeasurement methodologyThroughput & improvement
Single processor Dual-processor
Micro-architectural performanceDiscussion
![Page 12: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/12.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 12
Apache-MP, 3.06GHz
0
200
400
600
800
1000
1200
1P-UP 1P-MP 2Tw/ SMT
2P 4Tw/ SMT
Th
rou
gh
pu
t (M
b/s
)Throughput Evaluation
2T vs. 1P-MP
4T vs. 2P
2T vs. 1P-UP
single processor dual-processor
![Page 13: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/13.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 13
Improvement on Single Processor
2T : 2 threads, multiprocessor kernel1P-MP: 1 thread, multiprocessor kernel
2T vs. 1P-MP
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX Haboob
Th
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
![Page 14: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/14.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 14
2T vs. 1P-UP
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX Haboob
Th
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
Improvement on Single Processor
2T : 2 threads, Multiprocessor kernel1P-UP: 1 threads, Uniprocessor kernel
Kernel overhead
![Page 15: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/15.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 15
Improvement on Dual-processor4T: 4 threads (2 processors, 2T/Processor)2P: 2 physical processors (SMT disabled)
4T vs. 2P
-20
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX HaboobTh
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
2.0GHz & 3.06GHz with L3 are better Memory is still the
bottleneck
![Page 16: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/16.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 16
Micro-architectural Analysis
Use Oprofile In-house patch to measure extra events
About 25 performance events Cache miss/hit TLB miss/hit Branches Pipeline stall, clear, etc. Bus utilization
![Page 17: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/17.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 17
L1 Instruction Cache Miss Rate
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Apache-MP Apache-MT Flash TUX Haboob
1P-UP 1P-MP 2T(SMT)
2P 4T(SMT)
![Page 18: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/18.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 18
L2 Cache Miss Rate
Instruction & data unified Lower rate in SMT due to higher L1 misses
0%
2%
4%
6%
8%
10%
Apache-MP Apache-MT Flash TUX Haboob
1P-UP 1P-MP 2T(SMT)
2P 4T(SMT)
![Page 19: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/19.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 19
Apache-MP
02468
10121416
1P-UP 1P-MP 2T 2P 4T
Putting Events TogetherC
ycle
s pe
r In
stru
ctio
n (C
PI)
work L1 Miss L2 Miss ITLBDTLB Branch Clear Buffer
work
L1 Miss
L2 Miss
others
![Page 20: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/20.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 20
Non-overlapped CPI
L1/L2 miss penalty dominates
![Page 21: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/21.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 21
Measuring Bus Utilization
Event: FSB_DATA_ACTIVITYCPU cycles when the bus is busy
Normalized to CPU speedComparable across all CPU clock rate
![Page 22: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/22.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 22
Bus Utilization Results 2.0GHz & 3.06GHz
L3 have less data transfer cyclesLower memory
latency in 2.0GHz & 3.06GHz with L3
Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95
Apache-MP
0
5
10
15
20
1P-UP
1P-M
P 2T 2P 4T
Bu
s U
tiliz
atio
n (
%)
2.0GHz 3.06GHz 3.06GHz L3
![Page 23: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/23.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 23
Outline
BackgroundMeasurement parametersThroughput speedupMicro-architectural performanceDiscussion
Compare to simulationOther Web workloads
![Page 24: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/24.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 24
SMT Performance on Web Servers
Simulation
Multiprocessorkernel
Uniprocessor kernel
Dualprocessor
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Th
rou
gh
pu
t im
pro
vem
ent
![Page 25: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/25.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 25
Compare to Simulation
Simulation Measurement
Size Miss rate Size Miss rate
L1-I 128 KB 2.0% 12 KB 17%
L1-D 128 KB 3.6% 8 KB 5.7%
L2 16 MB 1.4% 512 KB 3.9%
Mem latency 90 cycles 220 ~ 350 cycles
![Page 26: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/26.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 26
Processor Development Trend
2000 20031996
62-cycle mem
32 KB L1
256 KB L2
90-cycle mem
128 KB L1
16384 KB L2
90-cycle mem
64 KB L1
16384 KB L2
74-cycle mem
16 KB L1
256 KB L2
94-cycle mem
16 KB L1
512 KB L2
350-cycle mem
8-12 KB L1
512 KB L2
Simulated models:
Actual processors:
![Page 27: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/27.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 27
SMT on SPECweb99
SPECweb99 results in paperDynamic + staticMultiple programs
• CGI requests, user profile logging, etc.
Speedup very close to static-only workloadsNo more negative speedups in FlashMay be due to better sharing of resources of
different programs
![Page 28: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/28.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 28
Summary
More realistic speedup evaluation of SMT 3 processors, 5 servers, 2 kernels Exposed factors not previously examined 5~15% speedup in our best cases
Detailed analysis of memory hierarchy impact on SMT performance All other architecture overheads secondary Reasons why simulation results were overly
optimistic
![Page 29: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/29.jpg)
Thank you
http://www.cs.princeton.edu/~yruan
![Page 30: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/30.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 30
Future Work
Ways of improving Simultaneous Multithreading performanceServer performance on POWER5Using execution driven simulation for deeper
understanding
Study Chip Multiprocessor (CMP)Intel, AMD, and IBM
![Page 31: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.fdocuments.net/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/31.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 31
Pipeline Clears (per Byte)
Conditions when the whole pipeline needs to be flushed
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Apache-MP Apache-MT Flash TUX Haboob
1T-UP 1T-MP 2T 2P 4T