PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf ·...
Transcript of PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf ·...
![Page 1: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/1.jpg)
UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE
Chester Rebeiro
Embedded LabEmbedded Lab
IIT Kharagpur
![Page 2: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/2.jpg)
Is Time Proportional to Iterations?p
SIZE = 64MBytes; unsigned int A[SIZE];
I i A Iterations A: for(i=0; i<SIZE; i+=1) A[i] *= 3;
Iterations B: Iterations B: for(i=0; i<SIZE; i+=16) A[i] *= 3;
Is Time(A) / Time(B) = 16 ?
![Page 3: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/3.jpg)
Is Time Proportional to Iterations?p
Not Really !Not Really !We get Time(A)/Time(B) = 3 !
Straight forward pencil-and-paper analysis will not sufficeA deeper understanding is neededFor this we use profiling tools
![Page 4: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/4.jpg)
Tools for Profiling Softwareg
Static Program ModificationAutomatic insertion of code to record performance attributes at run time.Example : QPT (Quick program profiling and tracing) p ( p g p g g)for MIPS and SPARC systems, Gprof, ATOM
Hardware CountersR i t f f h d Requires support from processor for hardware performance monitoringVTune (commercial – Intel), oprofile, perfmon
Simulators For simulation of the platform behaviorValgrind (x86 Simulation), Simplescalarg ( ), p
![Page 5: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/5.jpg)
Valgrindg
Opensource : http://valgrind.org//Valgrind is an instrumentation framework for building dynamic analysis tools. There are tools for There are tools for
Memory checking : to detect memory management problems such as no uninitilized data, leaky, overlapped memcpy’s etcmemcpy s etc.Cachegrind : is a cache profilerCallgrind : Extends cachegrind and in addition provides i f ti b t ll hinformation about callgraphs.Massif : is a heap profilerHelgrind : is useful in multi-threaded programs.
![Page 6: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/6.jpg)
Cachegrindg
Pinpoints the sources of cache misses in the code.Pinpoints the sources of cache misses in the code.Can simulate L1, L2, and D1 cache memoriesOn Modern processors : On Modern processors :
L1 cache miss costs around 10 clock cyclesL2 cache miss can cost as much as 200 clock cyclesL2 cache miss can cost as much as 200 clock cycles.
![Page 7: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/7.jpg)
Iteration Example Revisited with C h i dCachegrind
SIZE = 64MBytes; unsigned int A[SIZE];
I i A Iterations A: for(i=0; i<SIZE; i+=1) A[i] *= 3;
Iterations B: Iterations B: for(i=0; i<SIZE; i+=16) A[i] *= 3;
Is the ratio of Time(A) / Time(B) = 16 ?
![Page 8: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/8.jpg)
Running Cachegrindg g
Console Output :Tool Output file name Executable
No. of instructionsNo. o s uc o sNo. of misses in I1
![Page 9: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/9.jpg)
Output of Cachegrind (cg1.out)p g ( g )
No. of Instructions
N f I i Mi i L1 C h N f D R d Mi i L1
No. of Data Writes Missing L2
N f D W i Mi i L1No. of Instructions Missing L1 Cache
No. of Instructions Missing L2 Cache
No. of Data Reads No. of Data Reads Missing L2
No. of Data Reads Missing L2
No. of Data Reads Missing L1
All Data Writes
No. of Data Writes Missing L1
No. of Data Reads missing L1
![Page 10: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/10.jpg)
cg_annotateg_
![Page 11: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/11.jpg)
Effects of Cache Line
Unsigned int takes 4 bytesg yData cache line is of 64 bytesSo every 16th byte falls in a new cache line and results in a cache miss
![Page 12: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/12.jpg)
Direct Mapped Cachepp
Consider a Direct Mapped Cache withpp1024 Bytes32 byte cache line
Number of Cache Lines = 1024/32 = 32Assume memory address is of 32 bits
For e Address = 0 12345678
22 bits 5 Bits 5 Bits
offsetlinetag
For ex: Address = 0x12345678Offset : (11000)2Line : (10011)2( )2
![Page 13: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/13.jpg)
Direct Mapped Cachepp
![Page 14: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/14.jpg)
Cache Grind Results for Direct Mapped
A[31][0]A[31][0]
Thrashing in MCache Memories
A[32][0]
![Page 15: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/15.jpg)
Set Associative Cache
Consider a Direct Mapped Cache with1024 Bytes, 32 byte cache line2 way set-associative
Number of Cache Lines = 1024/32 = 32 (5 bits)Number of sets = 32/2 = 16 (4 bits)Assume memory address is of 32 bits
For ex: Address = 0x12345678
23 bits 4 Bits 5 Bits
offsetsettagFor ex: Address = 0x12345678
Offset : (11000)2Set: (0011)2
![Page 16: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/16.jpg)
2-way Cache Prevents Thrashingy g
Direct Mapped
2-way set associative
![Page 17: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/17.jpg)
Traversal for Large Matricesg
ROW MAJORMiss Rate/Iteration: 8/B
COLUMN MAJORMiss Rate/Iteration: 1
![Page 18: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/18.jpg)
Matrix Multiplication Examplep p
We need to multiply C = A*BWe need to multiply C A B
Matrix A is accessed in Row MajorMatrix A is accessed in Row MajorMatrix B is accessed in Column Major
![Page 19: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/19.jpg)
Analysis of Matrix Multiplicationy p
Huge miss rate because B is accessed in column major fashionmajor fashion.So, each access to B results in a cache miss.A l i i fi d B h l A solution, is to find B transpose, then only row major traversal is required.
![Page 20: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/20.jpg)
Matrix Multiplication (Naïve Transpose)p ( p )
R d ti i b f i b f t f l t 98%Reduction in number of misses by a factor of almost 98%
![Page 21: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/21.jpg)
A Better Transpose21
Cache Memory
p
y
AA:
Partition the Matrix into Tiles
Ar,sTile - Each sub-matrix Ar,s is known as tile.
As,rA
![Page 22: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/22.jpg)
A Better Transpose (load)
Cache Memory
p ( )
y
A
As,r Ar,s
A:
Ar,s
As,rA
![Page 23: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/23.jpg)
A Better Transpose (transpose)23
p ( p )
Cache Memory
A
(As,r)T (As,r)Ty
A:
![Page 24: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/24.jpg)
A Better Transpose (transfer)24
p ( )
Cache Memory
A
(As,r)T (As,r)Ty
A:
(As,r)T
(As,r)T(A )
![Page 25: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/25.jpg)
Cache Oblivious Algorithmsg
An algorithm designed to take advantage of a CPU g g gcache without explicit knowledge the cache parameters.
New branch of algorithm design.
O C fOptimal Cache-oblivious algorithms are known for theCooley-Tukey FFT algorithmMatrix MultiplicationMatrix MultiplicationSortingMatrix Transposition
![Page 26: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/26.jpg)
Summary for Cachegrindy g
Easy to use tool to analyze cache memory behavior Easy to use tool to analyze cache memory behavior for various configurationsSlow, around 20x to 100x slower than normal.Slow, around 20x to 100x slower than normal.What you simulate is not what you may get !What is needed is a way to analyze software at What is needed is a way to analyze software at run-time
![Page 27: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/27.jpg)
Related vs Unrelated Memory Accessesy
Related Data Accesses Unrelated Data Accesses
Time(Related Data Access) = Five x Time(Unrelated Data Accesses)Five x Time(Unrelated Data Accesses)
![Page 28: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/28.jpg)
Vtune
Vtune is an tool for real-time performance analysis p yof software.Unlike Valgrind has less overhead.Uses MSRs : Model Specific Performance-Monitoring Counters
Model Specific because MSRs for one processor may not be compatible with another
There are two banks of registers :There are two banks of registers :IA32_PERFEVTSELx : Performance event select MSRsIA32 PMCx : Performance monitoring event counters3 _ MC g
![Page 29: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/29.jpg)
References
Valgrind website : http://valgrind.org/
Intel, Vtune : http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
Igor Ostrovosky, Gallary of Cache Effects : http://igoro.com/archive/gallery-of-processor-cache-effects/
Siddhartha Chatterjee and Sandeep Sen , Cache Friendly Matrix Transposition
![Page 30: PROCESSOR CACHE EFFECTScse.iitkgp.ac.in/~debdeep/courses_iitkgp/COA2011/slides/Valgrind.pdf · Valgrind is an instrumentation framework for building dynamic analysis tools. There](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbcc3a4d9aeee3f46076d22/html5/thumbnails/30.jpg)
Th k YThank You