Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...

IBM Research - Tokyo

IISWC 2010 @ Atlanta © 2010 IBM Corporation

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors

Hiroshi Inoue and Toshio NakataniIBM Research – Tokyo

2


Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

An Old Question on New Platforms

Threads vs. Processes: Which is better to achieve higher performance?

– Each process has own virtual memory space

Using processes provides better inter-process isolation

– Threads in one process shares a virtual memory space

Multi-thread processing is better for performance due to its memory efficiency (smaller footprint)

Is this answer still valid on today’s processors with multiple cores and multiple SMT threads in a core?

3



Approach

Comparing multi-thread model and multi-process model on two types of hardware parallelism

– SMT scalability

– Core scalability

4



SMT Scalability and Core Scalability

core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7

threadthreadthreadthread







SMT scalability: performance improvement using increasing number of SMT threads in one core


5



SMT Scalability and Core Scalability


threadthreadthread thread

threadthread

threadthreadthread

threadthreadthread

threadthreadthread

threadthreadthread

threadthreadthread

threadthreadthread

SMT scalability: performance improvement using increasing number of SMT threads in one core

Core scalability: performance improvement using increasing number of cores with one thread in each core

thread thread thread thread thread thread threadthread

6



Experimental Setup

Systems– Niagara system

• UltraSPARC T1 (Niagara 1) 1.2 GHz • 8 cores with 4 SMT threads in each core• Solaris 10

– Nehalem system• Xeon X5570 (Nehalem) 2.93 GHz• 4 cores with 2 SMT threads in each core• Red Hat Enterprise Linux 5.4

Software– Benchmarks: SPECjbb2005, SPECjvm2008 – 32-bit HotSpot Server VM for Java 6 Update 17– Java heap size: 256 MB per thread using large page

7



0

2000

4000

6000

8000

10000

12000

14000

0 1 2 3 4number of SMT threads

thro

ughp

ut (j

bb_o

ps).

multi-thread modelmulti-proces model

0

10000

20000

30000

40000

50000

60000

70000

0 1 2number of SMT threads

thro

ughp

ut (j

bb_o

ps) .


SMT Scalability of SPECjbb2005

multi-thread model was 9.2% faster multi-thread model was 5.5% faster

multi-thread model uses 1 JVM

multi-process model uses 4 JVMs

higher is faster

on Niagara on Nehalem

8



0

5000

10000

15000

20000

25000

30000

35000

40000

0 1 2 3 4 5 6 7 8number of cores

thro

ughp

ut (j

bb_o

ps).


0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1 2 3 4number of cores

thro

ughp

ut (j

bb_o

ps).


Core Scalability of SPECjbb2005

multi-thread model was 2.1% slower

higher is faster

on Niagara

multi-thread model was 3.4% faster

on Nehalem

9



Core Scalability and SMT Scalability on Niagara

multi-thread model was 9.6% faster on average

Core scalabilitySMT scalability

No performance advantage for multi-thread model

(please refer to the paper on results for Nehalem)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

SPECjbb20

05

compil

er.co

mpiler

compil

er.su

nflow

xml.v

alida

tion

xml.tr

ansfo

rmse

rial

crypto

.aes

compre

ssmpe

gaud

ioav

erage

spee

dup

with

four

thre

ads

multi-thread modelmulti-process model

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

SPECjbb20

05

compil

er.co

mpiler

compil

er.su

nflow

xml.v

alida

tion

xml.tr

ansfo

rmse

rial

crypto

.aes

compre

ssmpe

gaud

ioav

erage

spee

dup

with

eig

ht c

ores

multi-thread modelmulti-process model

higher is faster

10



0.0

0.2

0.4

0.6

0.8

1.0

1.2

L1I cache miss L2 instructionmiss

L1D cache miss L2 data miss DTLB missrela

tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel

1 thread2 threads 3 threads 4 threads

Micro Architectural Statistics for SPECjbb2005

using increasing number of SMT threads (up to 4 threads)

multi-process model = 1.0

multi-thread m

odelis better

multi-process m

odelis better

11



0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0


L1D cache miss L2 data miss DTLB miss

rela

tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel 1 core

2 cores4 cores6 cores8 cores

Micro Architectural Statistics for SPECjbb2005

significant increase in DTLB misses for multi-thread modelwith increasing number of cores used.(7.4x on 8 cores of Niagara and 3.3x on 4 cores on Nehalem)


multi-thread m

odelis better

multi-process m

odelis better

using increasing number of cores (up to 8 cores)

12



Difference in Memory Access Patterns

core 0 core 1

JVM 0 JVM 1

core 2 core 3

JVM 2 JVM 3

Javaheap

Javaheap

Javaheap

Javaheap

256 MB 256 MB 256 MB 256 MB

core 0 core 1 core 2 core 3

JVM

Javaheap

1 GB = 256 MB x 4

☺ each core accesses only 256-MB heap☺ each memory page is accessed

from only 1 core

each core accesses 1-GB memory spaceeach memory page is accessed from 4 cores

multi-process (multi-JVM) modelmulti-thread (one-JVM) model

13



Experimental Setup for a larger PHP workload

Benchmark– MediaWiki (wiki server used in Wikipedia)

PHPruntimes

mysqldclientemulator

lighttpd

client

x86 / Linux

application server

UltraSPARC T1 1.2 GHz / Solaris

databaseserver

FastCGIover Unix domain socket

x86 / Linux

14



PHP runtime configuration

each runtime instance handles independent requests no communication among PHP runtime instances

processsharing virtual memory space

PHP runtimeinstance 1








process

HTTPserver

multi-process PHP runtime (default) multi-threaded PHP runtime

HTTPserver

15



0

2

4

6

8

10

12

14

16

0 1 2 3 4number of SMT threads

thro

ughp

ut (t

rans

actio

ns/s

ec) .


05

101520253035404550


thro

ughp

ut (t

rans

actio

ns/s

ec) .


Core Scalability and SMT Scalability of MediaWiki

consistent with results for Java benchmarks

core scalabilitySMT scalability

higher is faster

multi-thread model was 2.5% slowermulti-thread model was 5.5% faster

16



0.0

0.2

0.4

0.6

0.8

1.0

1.2



tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel

1 thread2 threads 3 threads 4 threads

Micro Architectural Statistics for MediaWiki


using increasing number of SMT threads (up to 4 threads)

multi-thread m

odelis better

multi-process m

odelis better

17



0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4



tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel

1 core2 cores4 cores6 cores8 cores

Micro Architectural Statistics for MediaWiki


using increasing number of cores (up to 8 cores)

multi-thread m

odelis better

multi-process m

odelis better

18



Performance of MediaWiki using All SMT Threads


threadthreadthreadthread thread

threadthreadthread







☺ multi-thread model was 5.5% faster

☺ TLB misses were reduced by 60%

19



Performance of MediaWiki using All SMT Threads










☺ multi-thread model was 5.5% faster

☺ TLB misses were reduced by 60%

multi-thread model was only 1.7% faster

TLB misses were reduced by only 19%

20



Our Technique: Core-aware Memory Allocation


multi-threaded PHP runtime

heap

21






physical page size (4 MB)

22








physical page size (4 MB)physical page size (4 MB)

- avoid sharing the memory space among cores within a physical page

Core-aware Memory Allocation

23



Performance of MediaWiki with Our Core-aware Malloc

Our core-aware allocator improved the performance of multi-thread model by 3.0% over the default allocator in libc

relative throughput of multi-thread model over multi-process model

0.96

0.98

1

1.02

1.04

1.06

1.08


rela

tive

thro

ughp

ut o

f mul

ti-th

read

mod

el .

over

mul

ti-pr

oces

s m

odel

default allocatorour core-aware allocator

higher is faster


(non-zeroorigin)

3.0%

24



DTLB misses with Our Core-aware Malloc

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


rela

tive

num

ber o

f DTL

B m

isse

s fo

r mul

ti-th

read

mod

el o

ver m

ulti-

proc

ess

mod

el

default allocatorour core-aware allocator


shorter is faster

reduced DTLB missesby 46.7%

Our core-aware allocator reduced the DTLB misses for the multi-thread model by 46.7%

25



Summary

The multi-thread model tends to generate fewer cache misses but more DTLB misses on multi-core processors

The increase in DTLB misses becomes more significant with increasing number of cores

Core-aware memory allocation can maximize the benefit of multi-thread processing by reducing DTLB misses

26



Our Answer to the Question

Threads vs. Processes: Which is better to achieve higher performance?

Multi-thread model has advantage over multi-process model, but memory allocator need to be enhanced

Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...

Documents

Transcript of Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...