Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...
Transcript of Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...
IBM Research - Tokyo
IISWC 2010 @ Atlanta © 2010 IBM Corporation
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors
Hiroshi Inoue and Toshio NakataniIBM Research – Tokyo
2
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
An Old Question on New Platforms
Threads vs. Processes: Which is better to achieve higher performance?
– Each process has own virtual memory space
Using processes provides better inter-process isolation
– Threads in one process shares a virtual memory space
Multi-thread processing is better for performance due to its memory efficiency (smaller footprint)
Is this answer still valid on today’s processors with multiple cores and multiple SMT threads in a core?
3
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Approach
Comparing multi-thread model and multi-process model on two types of hardware parallelism
– SMT scalability
– Core scalability
4
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
SMT Scalability and Core Scalability
core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
SMT scalability: performance improvement using increasing number of SMT threads in one core
threadthreadthreadthread
5
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
SMT Scalability and Core Scalability
core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
threadthreadthread thread
threadthread
threadthreadthread
threadthreadthread
threadthreadthread
threadthreadthread
threadthreadthread
threadthreadthread
SMT scalability: performance improvement using increasing number of SMT threads in one core
Core scalability: performance improvement using increasing number of cores with one thread in each core
thread thread thread thread thread thread threadthread
6
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Experimental Setup
Systems– Niagara system
• UltraSPARC T1 (Niagara 1) 1.2 GHz • 8 cores with 4 SMT threads in each core• Solaris 10
– Nehalem system• Xeon X5570 (Nehalem) 2.93 GHz• 4 cores with 2 SMT threads in each core• Red Hat Enterprise Linux 5.4
Software– Benchmarks: SPECjbb2005, SPECjvm2008 – 32-bit HotSpot Server VM for Java 6 Update 17– Java heap size: 256 MB per thread using large page
7
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0
2000
4000
6000
8000
10000
12000
14000
0 1 2 3 4number of SMT threads
thro
ughp
ut (j
bb_o
ps).
multi-thread modelmulti-proces model
0
10000
20000
30000
40000
50000
60000
70000
0 1 2number of SMT threads
thro
ughp
ut (j
bb_o
ps) .
multi-thread modelmulti-proces model
SMT Scalability of SPECjbb2005
multi-thread model was 9.2% faster multi-thread model was 5.5% faster
multi-thread model uses 1 JVM
multi-process model uses 4 JVMs
higher is faster
on Niagara on Nehalem
8
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0
5000
10000
15000
20000
25000
30000
35000
40000
0 1 2 3 4 5 6 7 8number of cores
thro
ughp
ut (j
bb_o
ps).
multi-thread modelmulti-proces model
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 1 2 3 4number of cores
thro
ughp
ut (j
bb_o
ps).
multi-thread modelmulti-proces model
Core Scalability of SPECjbb2005
multi-thread model was 2.1% slower
higher is faster
on Niagara
multi-thread model was 3.4% faster
on Nehalem
9
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Core Scalability and SMT Scalability on Niagara
multi-thread model was 9.6% faster on average
Core scalabilitySMT scalability
No performance advantage for multi-thread model
(please refer to the paper on results for Nehalem)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
SPECjbb20
05
compil
er.co
mpiler
compil
er.su
nflow
xml.v
alida
tion
xml.tr
ansfo
rmse
rial
crypto
.aes
compre
ssmpe
gaud
ioav
erage
spee
dup
with
four
thre
ads
multi-thread modelmulti-process model
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
SPECjbb20
05
compil
er.co
mpiler
compil
er.su
nflow
xml.v
alida
tion
xml.tr
ansfo
rmse
rial
crypto
.aes
compre
ssmpe
gaud
ioav
erage
spee
dup
with
eig
ht c
ores
multi-thread modelmulti-process model
higher is faster
10
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
L1I cache miss L2 instructionmiss
L1D cache miss L2 data miss DTLB missrela
tive
num
ber o
f eve
nts
for m
ulti-
thre
adm
odel
ove
r mul
ti-pr
oces
s m
odel
1 thread2 threads 3 threads 4 threads
Micro Architectural Statistics for SPECjbb2005
using increasing number of SMT threads (up to 4 threads)
multi-process model = 1.0
multi-thread m
odelis better
multi-process m
odelis better
11
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
L1I cache miss L2 instructionmiss
L1D cache miss L2 data miss DTLB miss
rela
tive
num
ber o
f eve
nts
for m
ulti-
thre
adm
odel
ove
r mul
ti-pr
oces
s m
odel 1 core
2 cores4 cores6 cores8 cores
Micro Architectural Statistics for SPECjbb2005
significant increase in DTLB misses for multi-thread modelwith increasing number of cores used.(7.4x on 8 cores of Niagara and 3.3x on 4 cores on Nehalem)
multi-process model = 1.0
multi-thread m
odelis better
multi-process m
odelis better
using increasing number of cores (up to 8 cores)
12
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Difference in Memory Access Patterns
core 0 core 1
JVM 0 JVM 1
core 2 core 3
JVM 2 JVM 3
Javaheap
Javaheap
Javaheap
Javaheap
256 MB 256 MB 256 MB 256 MB
core 0 core 1 core 2 core 3
JVM
Javaheap
1 GB = 256 MB x 4
☺ each core accesses only 256-MB heap☺ each memory page is accessed
from only 1 core
each core accesses 1-GB memory spaceeach memory page is accessed from 4 cores
multi-process (multi-JVM) modelmulti-thread (one-JVM) model
13
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Experimental Setup for a larger PHP workload
Benchmark– MediaWiki (wiki server used in Wikipedia)
PHPruntimes
mysqldclientemulator
lighttpd
client
x86 / Linux
application server
UltraSPARC T1 1.2 GHz / Solaris
databaseserver
FastCGIover Unix domain socket
x86 / Linux
14
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
PHP runtime configuration
each runtime instance handles independent requests no communication among PHP runtime instances
processsharing virtual memory space
PHP runtimeinstance 1
PHP runtimeinstance 2
PHP runtimeinstance 3
PHP runtimeinstance 4
PHP runtimeinstance 1
PHP runtimeinstance 2
PHP runtimeinstance 3
PHP runtimeinstance 4
process
HTTPserver
multi-process PHP runtime (default) multi-threaded PHP runtime
HTTPserver
15
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0
2
4
6
8
10
12
14
16
0 1 2 3 4number of SMT threads
thro
ughp
ut (t
rans
actio
ns/s
ec) .
multi-thread modelmulti-proces model
05
101520253035404550
0 1 2 3 4 5 6 7 8number of cores
thro
ughp
ut (t
rans
actio
ns/s
ec) .
multi-thread modelmulti-proces model
Core Scalability and SMT Scalability of MediaWiki
consistent with results for Java benchmarks
core scalabilitySMT scalability
higher is faster
multi-thread model was 2.5% slowermulti-thread model was 5.5% faster
16
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
L1I cache miss L2 instructionmiss
L1D cache miss L2 data miss DTLB missrela
tive
num
ber o
f eve
nts
for m
ulti-
thre
adm
odel
ove
r mul
ti-pr
oces
s m
odel
1 thread2 threads 3 threads 4 threads
Micro Architectural Statistics for MediaWiki
multi-process model = 1.0
using increasing number of SMT threads (up to 4 threads)
multi-thread m
odelis better
multi-process m
odelis better
17
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
L1I cache miss L2 instructionmiss
L1D cache miss L2 data miss DTLB missrela
tive
num
ber o
f eve
nts
for m
ulti-
thre
adm
odel
ove
r mul
ti-pr
oces
s m
odel
1 core2 cores4 cores6 cores8 cores
Micro Architectural Statistics for MediaWiki
multi-process model = 1.0
using increasing number of cores (up to 8 cores)
multi-thread m
odelis better
multi-process m
odelis better
18
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Performance of MediaWiki using All SMT Threads
core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
threadthreadthreadthread thread
threadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
☺ multi-thread model was 5.5% faster
☺ TLB misses were reduced by 60%
19
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Performance of MediaWiki using All SMT Threads
core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
threadthreadthreadthread
☺ multi-thread model was 5.5% faster
☺ TLB misses were reduced by 60%
multi-thread model was only 1.7% faster
TLB misses were reduced by only 19%
20
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Our Technique: Core-aware Memory Allocation
core 0 core 1 core 2 core 3
multi-threaded PHP runtime
heap
21
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Our Technique: Core-aware Memory Allocation
core 0 core 1 core 2 core 3
multi-threaded PHP runtime
physical page size (4 MB)
22
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Our Technique: Core-aware Memory Allocation
core 0 core 1 core 2 core 3
multi-threaded PHP runtime
core 0 core 1 core 2 core 3
multi-threaded PHP runtime
physical page size (4 MB)physical page size (4 MB)
- avoid sharing the memory space among cores within a physical page
Core-aware Memory Allocation
23
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Performance of MediaWiki with Our Core-aware Malloc
Our core-aware allocator improved the performance of multi-thread model by 3.0% over the default allocator in libc
relative throughput of multi-thread model over multi-process model
0.96
0.98
1
1.02
1.04
1.06
1.08
0 1 2 3 4 5 6 7 8number of cores
rela
tive
thro
ughp
ut o
f mul
ti-th
read
mod
el .
over
mul
ti-pr
oces
s m
odel
default allocatorour core-aware allocator
higher is faster
multi-process model = 1.0
(non-zeroorigin)
3.0%
24
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
DTLB misses with Our Core-aware Malloc
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8number of cores
rela
tive
num
ber o
f DTL
B m
isse
s fo
r mul
ti-th
read
mod
el o
ver m
ulti-
proc
ess
mod
el
default allocatorour core-aware allocator
multi-process model = 1.0
shorter is faster
reduced DTLB missesby 46.7%
Our core-aware allocator reduced the DTLB misses for the multi-thread model by 46.7%
25
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Summary
The multi-thread model tends to generate fewer cache misses but more DTLB misses on multi-core processors
The increase in DTLB misses becomes more significant with increasing number of cores
Core-aware memory allocation can maximize the benefit of multi-thread processing by reducing DTLB misses
26
IBM Research - Tokyo
Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation
Our Answer to the Question
Threads vs. Processes: Which is better to achieve higher performance?
Multi-thread model has advantage over multi-process model, but memory allocator need to be enhanced