Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....
Transcript of Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....
Memory Performance Profiling via Sampled Performance Monitor
Event TracesDiana Villa, Patricia J. Teller, and Jaime AcostaThe University of Texas at El PasoDepartment of Computer Science
Trevor MorganExxon/Mobil
Bret Olszewski IBM Corporation-Austin
5th Annual IBM Austin CAS Conference – 20 February 2004
5th Annual IBM Austin CAS Conference – 20 February 2004
Outline
Motivation Data
Events Profiled Information Collected
Analysis Approach Performance Evaluation Framework
Results Conclusions and Future Work
5th Annual IBM Austin CAS Conference – 20 February 2004
Motivation
Overall research goalGeneral workload characterization model
Project goal Develop a performance evaluation framework to
facilitate analysis of large sampled event traces Study load access patterns of key applications Identify and remedy performance impediments
5th Annual IBM Austin CAS Conference – 20 February 2004
Data Collection Environment
IBM eserver p-Series 690 architecture8- and 32-processor configurations
TPC-C benchmarkData collected via event trace sampling:
Timestamp Effective instruction and data addresses CPU id Process id Thread id
5th Annual IBM Austin CAS Conference – 20 February 2004
Platform -1
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
MCM 0 MCM 1
X
8-processor p690 configuration
5th Annual IBM Austin CAS Conference – 20 February 2004
Platform - 2
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 0
P
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 2
P
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 1
P
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 3
P
32-processor p690 configuration
5th Annual IBM Austin CAS Conference – 20 February 2004
Events
Resolution of L2-cache data-load misses L2.5
L2.5 shared L2.5 modified
L2.75 L2.75 shared L2.75 modified
L3 L3.5
5th Annual IBM Austin CAS Conference – 20 February 2004
L2.5
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 73 cycles
MCM 0 MCM 1
X
5th Annual IBM Austin CAS Conference – 20 February 2004
L2.75
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 96 cycles
MCM 0 MCM 1
X
5th Annual IBM Austin CAS Conference – 20 February 2004
L3
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 112 cycles
MCM 0 MCM 1
X
5th Annual IBM Austin CAS Conference – 20 February 2004
L3.5
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 143 cycles
MCM 0 MCM 1
X
5th Annual IBM Austin CAS Conference – 20 February 2004
Analysis
Identify application-specific sources of performance degradation associated with data references
Level of Memory
Hierarchy
kernel
….
….
text
buffer pool
data,bss,heap
….
….
….
Address Space
Segment
Page
Page Offset/Cache line
5th Annual IBM Austin CAS Conference – 20 February 2004
Performance Evaluation Framework
Database
Load DB Java Tool
Report Generation Java Tool
p690TPC-C
Data Collection Environment
Reports
5 BufferPool 56893 293846 Data,BSS,Heap 8799 48551 Kernel 23485 9840
Distribution of L3 Data Load Hits
0 0.1 0.2 0.3 0.4 0.5
Kernel
Text
Data,BSS,Heap
BufferPool
SharedData
Stack
U-BlockandKernelStack
KERN_HEAP
Ad
dre
ss r
egio
n
Fraction of data loads
Unique cache line
Hit %
Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment
050
100150200
250300
350400
100 1600 3100 4600 6100 7600
Page [0-65536]
Hit
/Cac
he
lin
e co
un
t
Total loads
Unique cache line
Graphs
Sampled Event Traces
PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.
5th Annual IBM Austin CAS Conference – 20 February 2004
Resolution of L2 Data Load Misses
0 0.1 0.2 0.3 0.4 0.5 0.6
L2.5 Shared
L2.5 Modified
L2.75 Shared
L2.75 Modified
L3
L3.5
Memory
Even
ts
Fraction of loads satisfied
32-way
8-way
Results
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Memory Regions
Distribution of Memory Data Load Hits
0 0.1 0.2 0.3 0.4 0.5
Kernel
Text
Data,BSS,Heap
BufferPool
Stack
Ublock&KernelStack
M_BUF
KERN_HEAP
Ad
dre
ss r
egio
n
Fraction of data loads
Unique cache line
Hit %
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - L3 Cache
Distribution of L3 Data Load Hits
0 0.1 0.2 0.3 0.4 0.5
Kernel
Text
Data,BSS,Heap
BufferPool
Stack
UBlockandKernelStack
M_BUF
KERN_HEAP
Ad
dre
ss r
egio
n
Fraction of data loads
Unique cache line
Hit %
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Segment
Distribution of L3 Data Load Hits in Buffer Pool by Segment
0 0.1 0.2 0.3 0.4 0.5
070000000
070000001
070000002
070000009
07000039C
Seg
men
t
Fraction of data loads
Unique cache line
Hit %
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Pages
Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment
050
100150200250300350400
100 1600 3100 4600 6100 7600
Page [0-65536]
Hit
/Cac
he
lin
e co
un
t
Total loads
Unique cache line
5th Annual IBM Austin CAS Conference – 20 February 2004
Results – Cache Lines
Distribution of L3 Data Load Hits by Cache line
0
5
10
15
20
25
30
0 100 200 300 400 500 600
Time (s)
Cac
he
lin
e
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Instructions
Lock Operations Atomic Operations
simple_lock fetch_and_add
simple_lock_ppc fetch_and_add_h
simple_unlock fetch_and_addlp
disable_lock fetch_and_or
unlock_enable fetch_and_orlp
simple_unlock_mem fetch_and_and
unlock_enable_mem fetch_and_andlp
5th Annual IBM Austin CAS Conference – 20 February 2004
Targets for performance improvement of TPC-C are associated mainly with two regions of the address space: buffer pool data, bss, heap
TPC-C lock instructions are not key to performance degradation
8- and 32-processor data have same reference pattern, thus, a model of TPC-C memory access may be possible
Conclusions
5th Annual IBM Austin CAS Conference – 20 February 2004
Suggest ways to improve performance of applications executed on p690
Enhance performance evaluation framework
Quantify representativeness of sampled event traces
Expand study of application data load behavior Process characterization Process migration Other performance issues
Compulsory vs. capacity/conflict misses False sharing Contention for resources
Develop synthetic applications that mimic the behavior of key p690 applications; use these to study application behavior and experiment with modifications to applications that may affect performance
Future Work
5th Annual IBM Austin CAS Conference – 20 February 2004
Questions?