Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....

23
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa , Patricia J. Teller, and Jaime Acosta The University of Texas at El Paso Department of Computer Science Trevor Morgan Exxon/Mobil Bret Olszewski IBM Corporation-Austin 5 th Annual IBM Austin CAS Conference – 20 February 2004

Transcript of Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J....

Page 1: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

Memory Performance Profiling via Sampled Performance Monitor

Event TracesDiana Villa, Patricia J. Teller, and Jaime AcostaThe University of Texas at El PasoDepartment of Computer Science

Trevor MorganExxon/Mobil

Bret Olszewski IBM Corporation-Austin

5th Annual IBM Austin CAS Conference – 20 February 2004

Page 2: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Outline

Motivation Data

Events Profiled Information Collected

Analysis Approach Performance Evaluation Framework

Results Conclusions and Future Work

Page 3: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Motivation

Overall research goalGeneral workload characterization model

Project goal Develop a performance evaluation framework to

facilitate analysis of large sampled event traces Study load access patterns of key applications Identify and remedy performance impediments

Page 4: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Data Collection Environment

IBM eserver p-Series 690 architecture8- and 32-processor configurations

TPC-C benchmarkData collected via event trace sampling:

Timestamp Effective instruction and data addresses CPU id Process id Thread id

Page 5: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Platform -1

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

MCM 0 MCM 1

X

8-processor p690 configuration

Page 6: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Platform - 2

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 0

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 2

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 1

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 3

P

32-processor p690 configuration

Page 7: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Events

Resolution of L2-cache data-load misses L2.5

L2.5 shared L2.5 modified

L2.75 L2.75 shared L2.75 modified

L3 L3.5

Page 8: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

L2.5

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 73 cycles

MCM 0 MCM 1

X

Page 9: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

L2.75

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 96 cycles

MCM 0 MCM 1

X

Page 10: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

L3

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 112 cycles

MCM 0 MCM 1

X

Page 11: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

L3.5

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 143 cycles

MCM 0 MCM 1

X

Page 12: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Analysis

Identify application-specific sources of performance degradation associated with data references

Level of Memory

Hierarchy

kernel

….

….

text

buffer pool

data,bss,heap

….

….

….

Address Space

Segment

Page

Page Offset/Cache line

Page 13: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Performance Evaluation Framework

Database

Load DB Java Tool

Report Generation Java Tool

p690TPC-C

Data Collection Environment

Reports

5 BufferPool 56893 293846 Data,BSS,Heap 8799 48551 Kernel 23485 9840

Distribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

SharedData

Stack

U-BlockandKernelStack

KERN_HEAP

Ad

dre

ss r

egio

n

Fraction of data loads

Unique cache line

Hit %

Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment

050

100150200

250300

350400

100 1600 3100 4600 6100 7600

Page [0-65536]

Hit

/Cac

he

lin

e co

un

t

Total loads

Unique cache line

Graphs

Sampled Event Traces

PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.

Page 14: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Resolution of L2 Data Load Misses

0 0.1 0.2 0.3 0.4 0.5 0.6

L2.5 Shared

L2.5 Modified

L2.75 Shared

L2.75 Modified

L3

L3.5

Memory

Even

ts

Fraction of loads satisfied

32-way

8-way

Results

Page 15: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Results - Memory Regions

Distribution of Memory Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

Stack

Ublock&KernelStack

M_BUF

KERN_HEAP

Ad

dre

ss r

egio

n

Fraction of data loads

Unique cache line

Hit %

Page 16: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Results - L3 Cache

Distribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

Stack

UBlockandKernelStack

M_BUF

KERN_HEAP

Ad

dre

ss r

egio

n

Fraction of data loads

Unique cache line

Hit %

Page 17: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Results - Segment

Distribution of L3 Data Load Hits in Buffer Pool by Segment

0 0.1 0.2 0.3 0.4 0.5

070000000

070000001

070000002

070000009

07000039C

Seg

men

t

Fraction of data loads

Unique cache line

Hit %

Page 18: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Results - Pages

Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment

050

100150200250300350400

100 1600 3100 4600 6100 7600

Page [0-65536]

Hit

/Cac

he

lin

e co

un

t

Total loads

Unique cache line

Page 19: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Results – Cache Lines

Distribution of L3 Data Load Hits by Cache line

0

5

10

15

20

25

30

0 100 200 300 400 500 600

Time (s)

Cac

he

lin

e

Page 20: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Results - Instructions

Lock Operations Atomic Operations

simple_lock fetch_and_add

simple_lock_ppc fetch_and_add_h

simple_unlock fetch_and_addlp

disable_lock fetch_and_or

unlock_enable fetch_and_orlp

simple_unlock_mem fetch_and_and

unlock_enable_mem fetch_and_andlp

Page 21: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Targets for performance improvement of TPC-C are associated mainly with two regions of the address space: buffer pool data, bss, heap

TPC-C lock instructions are not key to performance degradation

8- and 32-processor data have same reference pattern, thus, a model of TPC-C memory access may be possible

Conclusions

Page 22: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Suggest ways to improve performance of applications executed on p690

Enhance performance evaluation framework

Quantify representativeness of sampled event traces

Expand study of application data load behavior Process characterization Process migration Other performance issues

Compulsory vs. capacity/conflict misses False sharing Contention for resources

Develop synthetic applications that mimic the behavior of key p690 applications; use these to study application behavior and experiment with modifications to applications that may affect performance

Future Work

Page 23: Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

5th Annual IBM Austin CAS Conference – 20 February 2004

Questions?