Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....

Memory Performance Profiling via Sampled Performance Monitor

Event TracesDiana Villa, Patricia J. Teller, and Jaime AcostaThe University of Texas at El PasoDepartment of Computer Science

Trevor MorganExxon/Mobil

Bret Olszewski IBM Corporation-Austin

5th Annual IBM Austin CAS Conference – 20 February 2004


Outline

Motivation Data

Events Profiled Information Collected

Analysis Approach Performance Evaluation Framework

Results Conclusions and Future Work


Motivation

Overall research goalGeneral workload characterization model

Project goal Develop a performance evaluation framework to

facilitate analysis of large sampled event traces Study load access patterns of key applications Identify and remedy performance impediments


Data Collection Environment

IBM eserver p-Series 690 architecture8- and 32-processor configurations

TPC-C benchmarkData collected via event trace sampling:

Timestamp Effective instruction and data addresses CPU id Process id Thread id


Platform -1

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

MCM 0 MCM 1

X

8-processor p690 configuration


Platform - 2

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 0

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 2

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 1

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 3

P

32-processor p690 configuration


Events

Resolution of L2-cache data-load misses L2.5

L2.5 shared L2.5 modified

L2.75 L2.75 shared L2.75 modified

L3 L3.5


L2.5

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 73 cycles

MCM 0 MCM 1

X


L2.75

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 96 cycles

MCM 0 MCM 1

X


L3

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 112 cycles

MCM 0 MCM 1

X


L3.5

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 143 cycles

MCM 0 MCM 1

X


Analysis

Identify application-specific sources of performance degradation associated with data references

Level of Memory

Hierarchy

kernel

….

….

text

buffer pool

data,bss,heap

….

….

….

Address Space

Segment

Page

Page Offset/Cache line


Performance Evaluation Framework

Database

Load DB Java Tool

Report Generation Java Tool

p690TPC-C

Data Collection Environment

Reports

5 BufferPool 56893 293846 Data,BSS,Heap 8799 48551 Kernel 23485 9840

Distribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

SharedData

Stack

U-BlockandKernelStack

KERN_HEAP

Ad

dre

ss r

egio

n

Fraction of data loads

Unique cache line

Hit %

Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment

050

100150200

250300

350400

100 1600 3100 4600 6100 7600

Page [0-65536]

Hit

/Cac

he

lin

e co

un

t

Total loads

Unique cache line

Graphs

Sampled Event Traces

PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.


Resolution of L2 Data Load Misses

0 0.1 0.2 0.3 0.4 0.5 0.6

L2.5 Shared

L2.5 Modified

L2.75 Shared

L2.75 Modified

L3

L3.5

Memory

Even

ts

Fraction of loads satisfied

32-way

8-way

Results


Results - Memory Regions

Distribution of Memory Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

Stack

Ublock&KernelStack

M_BUF

KERN_HEAP

Ad

dre

ss r

egio

n


Unique cache line

Hit %


Results - L3 Cache

Distribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

Stack

UBlockandKernelStack

M_BUF

KERN_HEAP

Ad

dre

ss r

egio

n


Unique cache line

Hit %


Results - Segment

Distribution of L3 Data Load Hits in Buffer Pool by Segment

0 0.1 0.2 0.3 0.4 0.5

070000000

070000001

070000002

070000009

07000039C

Seg

men

t


Unique cache line

Hit %


Results - Pages

Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment

050

100150200250300350400

100 1600 3100 4600 6100 7600

Page [0-65536]

Hit

/Cac

he

lin

e co

un

t

Total loads

Unique cache line


Results – Cache Lines

Distribution of L3 Data Load Hits by Cache line

0

5

10

15

20

25

30

0 100 200 300 400 500 600

Time (s)

Cac

he

lin

e


Results - Instructions

Lock Operations Atomic Operations

simple_lock fetch_and_add

simple_lock_ppc fetch_and_add_h

simple_unlock fetch_and_addlp

disable_lock fetch_and_or

unlock_enable fetch_and_orlp

simple_unlock_mem fetch_and_and

unlock_enable_mem fetch_and_andlp


Targets for performance improvement of TPC-C are associated mainly with two regions of the address space: buffer pool data, bss, heap

TPC-C lock instructions are not key to performance degradation

8- and 32-processor data have same reference pattern, thus, a model of TPC-C memory access may be possible

Conclusions


Suggest ways to improve performance of applications executed on p690

Enhance performance evaluation framework

Quantify representativeness of sampled event traces

Expand study of application data load behavior Process characterization Process migration Other performance issues

Compulsory vs. capacity/conflict misses False sharing Contention for resources

Develop synthetic applications that mimic the behavior of key p690 applications; use these to study application behavior and experiment with modifications to applications that may affect performance

Future Work


Questions?

Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....

Documents

Transcript of Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....