ENHANCEMENT OF CACHE PERFORMANCE IN MULTI-CORE … · 2017-02-06 · ENHANCEMENT OF CACHE...
Transcript of ENHANCEMENT OF CACHE PERFORMANCE IN MULTI-CORE … · 2017-02-06 · ENHANCEMENT OF CACHE...
ENHANCEMENT OF CACHE PERFORMANCE IN
MULTI-CORE PROCESSORS
A THESIS
Submitted by
MUTHUKUMAR.S
Under the Guidance of
Dr. P.K. JAWAHAR
in partial fulfillment for the award of the degree of
DOCTOR OF PHILOSOPHY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
B.S.ABDUR RAHMAN UNIVERSITY
(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)
(Estd. u/s 3 of the UGC Act. 1956)
www.bsauniv.ac.in
November 2014
B.S.ABDUR RAHMAN UNIVERSITY
(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)
(Estd. u/s 3 of the UGC Act. 1956)
www.bsauniv.ac.in
BONAFIDE CERTIFICATE
Certified that this thesis ENHANCEMENT OF CACHE PERFORMANCE
IN MULTI-CORE PROCESSORS is the bonafide work of MUTHUKUMAR S.
(RRN: 1184211) who carried out the thesis work under my supervision. Certified
further, that to the best of my knowledge the work reported herein does not form
part of any other thesis or dissertation on the basis of which a degree or award
was conferred on an earlier occasion on this or any other candidate.
SIGNATURE
Dr. P.K.JAWAHAR
RESEARCH SUPERVISOR
Professor and Head
Department of ECE
B.S. Abdur Rahman University
Vandalur, Chennai – 600 048
ACKNOWLEDGEMENT
First and foremost, I would forever be thankful to the almighty for having
bestowed me with this life and knowledge. Next, I would like to express my
sincere gratitude to my supervisor Professor Dr. P.K. Jawahar for his
continuous support during my PhD study and research, for his patience,
motivation, enthusiasm, and immense knowledge. He gave me the freedom to
do the research in my chosen area, as a result of which, I felt more encouraged
and motivated to achieve my objectives. Besides my advisor, I would also like to
thank the doctoral committee members, Dr. P. Rangarajan and Dr. T.R.
Rangaswamy for their valuable suggestions, encouragement and insightful
comments.
My sincere thanks to Dr Kaja Mohideen, Dean, School of Electrical and
Communication Sciences for having provided me with this opportunity.
I would like to thank Professor Dr. V. Sankaranarayanan for his advice and
valuable suggestions throughout my research.
I would also like to thank Dr. M. Arunachalam, Secretary, Sri Venkateswara
College of Engineering for his encouragement and support.
Last but not the least, I would like to thank my family members for their immense
support. Special thanks goes to my wife Mrs. M. Shanthi, my son Mr. M.
Subramanian and my daughter M. Uma for their patience and moral support
that kept me going throughout the entire course.
I am also indebted to my parents who have stimulated my research interests
from the very early stage of my life. I sincerely hope that I can make my family
members proud.
ABSTRACT
The impact of various cache replacement policies act as the main deciding
factor of system performance and efficiency in Chip Multi-core Processors. It is
imperative for any level of cache memory in a multi-core architecture to have a
well-defined, dynamic replacement algorithm in place to ensure consistent
superlative performance. Many existing cache replacement policies such as
Least Recently Used (LRU) policy etc have proved to work well in the shared
Level 2 (L2) cache for most of the data set patterns generated by single
threaded applications. But when it comes to parallel multi-threaded applications
that generate differing patterns of workload at different intervals, the
conventional replacement policies can prove sub-optimal as they generally do
not abide by the spatial and temporal locality theories and do not acquaint
dynamically to the changes in the workload.
This dissertation proposes three novel cache replacement policies for L2
cache that are targeted towards such applications. Context Based Data Pattern
Exploitation Technique assigns a counter for every block of the L2 cache for
monitoring the data access patterns of various threads and modifies the counter
values appropriately to maximize the performance. Logical Cache Partitioning
technique logically partitions the cache elements into four zones based on their
‘likeliness’ to be referenced by the processor in the near future. Replacement
candidates are chosen from the zones in the ascending order of their ‘likeliness
factor’ to minimize the number of cache misses. Sharing and Hit based
Prioritizing replacement technique takes the sharing status of the data elements
into consideration while making replacement decisions. This information is
combined along with the number of hits received by the element to embark upon
a priority based on which the replacement decisions are made.
The proposed techniques are effective at improving cache performance and
offer an improvement of up to 9% in overall hits at the L2 cache level and an IPC
(Instructions Per Cycle) speedup of up to 1.08 times that of LRU for a wide
range of multithreaded benchmarks.
REFERENCES [1] John L. Hennessy and David A. Patterson, "Computer architecture: a
quantitative approach", Fifth Edition, Elsevier, 2012.
[2] Tola John Odule and Idowun Ademola Osinuga, “Dynamically Self-
Adjusting Cache Replacement Algorithm”, International Journal of Future
Generation Communication and Networking, Vol. 6, No. 1, Feb. 2013.
[3] Andhi Janapsatya, Aleksandar Ignjatovic, Jorgen Peddersen and Sri
Parameswaran, "Dueling clock: adaptive cache replacement policy based
on the clock algorithm", IEEE Design, Automation & Test Europe
Conference & Exhibition (DATE), pp. 920-925, 2010.
[4] Geng Tian and Michael Liebelt, "An effectiveness-based adaptive cache
replacement policy", Microprocessors and Microsystems, Vol 38, No.1, pp.
98-111, Elsevier, Feb. 2014.
[5] Anil Kumar Katti and Vijaya Ramachandran, "Competitive cache
replacement strategies for shared cache environments", IEEE 26th
International Parallel & Distributed Processing Symposium (IPDPS), pp.
215-226, 2012.
[6] Martin Kampe, Per Stenstrom and Michel Dubois, "Self-correcting LRU
replacement policies", 1st Conference on Computing Frontiers, ACM
Proceedings, pp. 181-191, 2004.
[7] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik Jhon,
"Adaptive block management for victim cache by exploiting l1 cache history
information", Embedded and Ubiquitous Computing, pp. 1-11, Springer
Berlin Heidelberg, 2004.
[8] Jorge Albericio, Ruben Gran, Pablo Ibanez, Víctor Vinals and Jose Maria
Llaberia, "ABS: A low-cost adaptive controller for prefetching in a banked
shared last-level cache", ACM Transactions on Architecture and Code
Optimization (TACO), Vol. 8, No. 4, Article. 19, 2012.
[9] Ke Zhang1, Zhensong Wang, Yong Chen3, Huaiyu Zhu4 and Xian-He Sun,
“PAC-PLRU: A Cache Replacement Policy to Salvage Discarded
Predictions from Hardware Prefetchers”, 11th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing, p.265-274, May 2011.
[10] Jayesh Gaur, Mainak Chaudhuri and Sreenivas Subramoney, "Bypass and
insertion algorithms for exclusive last-level caches", ACM SIGARCH
Computer Architecture News, Vol. 39, No. 3, 2011.
[11] Hongliang Gao and Chris Wilkerson, "A dueling segmented LRU
replacement algorithm with adaptive bypassing", JWAC 1st JILP Workshop
on Computer Architecture Competitions: cache replacement
Championship, 2010.
[12] Saurabh Gao, Hongliang Gao and Huiyang Zhou, "Adaptive Cache
Bypassing for Inclusive Last Level Caches", IEEE 27th International
Symposium on Parallel & Distributed Processing (IPDPS), pp. 1243-1253,
2013.
[13] Mazen Kharbutli and D. Solihin, "Counter-based cache replacement and
bypassing algorithms", IEEE Transactions on Computers, Vol. 57, No. 4
pp. 433-447, 2008.
[14] Pierre Kharbutli, "Replacement policies for shared caches on symmetric
multicores: a programmer-centric point of view", 6th International
Conference on High Performance and Embedded Architectures and
Compilers, pp. 187-196, 2011.
[15] Moinuddin K Qureshi, "Adaptive spill-receive for robust high-performance
caching in CMPs", IEEE 15th International Symposium on High
Performance Computer Architecture (HPCA), pp. 45-54, 2009.
[16] Kathlene Hurt and Byeong Kil Lee, “Proposed Enhancements to Fixed
Segmented LRU Cache Replacement Policy”, IEEE 32nd International
Conference on Performance and Computing and Communications
(IPCCC), pp. 1-2, Dec. 2013.
[17] Manikantan, R, Kaushik Rajan and Ramaswamy Govindarajan. "NUcache:
An efficient multi-core cache organization based on next-use distance",
IEEE 17th International Symposium on High Performance Computer
Architecture (HPCA), pp. 243-253, 2011.
[18] Kedzierski, Kamil, Miquel Moreto, Francisco J. Cazorla and Mateo Valero,
"Adapting cache partitioning algorithms to pseudo-lru replacement
policies", IEEE International Symposium on Parallel & Distributed
Processing (IPDPS), pp. 1-12, 2010.
[19] Xie, Yuejian and Gabriel H. Loh, "PIPP: promotion/insertion pseudo-
partitioning of multi-core shared caches", ACM SIGARCH Computer
Architecture News, Vol. 37, No. 3, pp. 174-183, 2009.
[20] Hasenplaugh, William, Pritpal S. Ahuja, Aamer Jaleel, Simon Steely Jr and
Joel Emer, "The gradient-based cache partitioning algorithm", ACM
Transactions on Architecture and Code Optimization (TACO), Vol. 8, No.
4, Article 44, Jan 2012.
[21] Haakon Dybdahl, Per Stenstrom and Lasse Natvig, "A cache-partitioning
aware replacement policy for chip multiprocessors", High Performance
Computing-HiPC, Springer, pp. 22-34, 2006.
[22] Naoki Fujieda, and Kenji Kise, "A partitioning method of cooperative
caching with hit frequency counters for many-core processors", Second
International Conference on Networking and Computing (ICNC), pp. 160-
165, 2011.
[23] Konstantinos Nikas, Matthew Horsnell and Jim Garside, "An adaptive
bloom filter cache partitioning scheme for multi-core architectures”,
International Conference on Embedded Computer Systems, Architectures,
Modeling, and Simulation, SAMOS, pp. 25-32, 2008.
[24] Moinuddin K Qureshi and Yale N. Patt, "Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition shared
caches", 39th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 423-432, 2006.
[25] Daniel A Jimenez, "Insertion and promotion for tree-based PseudoLRU
last-level caches", 46th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 284-296, 2013.
[26] Lira, Javier, Carlos Molina and Antonio Gonzalez, "LRU-PEA: A smart
replacement policy for non-uniform cache architectures on chip
multiprocessors", IEEE International Conference on Computer Design, pp.
275-281, 2009.
[27] Nikos Hardavellas, Michael Ferdman, Babak Falsafi and Anastasia
Ailamaki, "Reactive NUCA: near-optimal block placement and replication
in distributed caches", ACM SIGARCH Computer Architecture News, Vol.
37, No. 3, pp. 184-195, 2009.
[28] Fazal Hameed, Bauer L and Henkel J, “Dynamic Cache Management in
Multi-Core Architectures Through Runtime Adaptation”, Design,
Automation & Test in Europe Conference & Exhibition (DATE), pp. 485-
490, March 2012.
[29] Alberto Ros, Marcelo Cintra, Manuel E. Acacio and Jose M. García.
"Distance-aware round-robin mapping for large NUCA caches",
International Conference on High Performance Computing (HiPC), pp. 79-
88, 2009.
[30] Javier Lira, Timothy M. Jones, Carlos Molina and Antonio Gonzalez. "The
migration prefetcher: Anticipating data promotion in dynamic NUCA
caches", ACM Transactions on Architecture and Code Optimization
(TACO), Vol. 8, No. 4, Article. 45, 2012.
[31] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Dynamic
cache clustering for chip multiprocessors", ACM 23rd International
Conference on Supercomputing, pp. 56-67, 2009.
[32] O. Adamao, A. Naz, K. Kavi, T. Janjusic and C. P. Chung, “Smaller split L-
1 data caches for multi-core processing systems”, IEEE 10th International
Symposium on Pervasive Systems, Algorithms and Networks(I-SPAN
2009), Dec 2009.
[33] Yong Chen, Surendra Byna and Xian-He Sun, "Data access history cache
and associated data prefetching mechanisms", ACM/IEEE Conference
on Supercomputing, pp. 1-12, 2007.
[34] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Acm: An
efficient approach for managing shared caches in chip multiprocessors",
High Performance Embedded Architectures and Compilers, Springer Berlin
Heidelberg, pp. 355-372, 2009.
[35] Sourav Roy, "H-NMRU: a low area, high performance cache replacement
policy for embedded processors", 22nd International Conference on VLSI
Design, pp. 553-558, 2009.
[36] Nobuaki Tojo, Nozomu Togawa, Masao Yanagisawa and Tatsuo Ohtsuki,
"Exact and fast L1 cache simulation for embedded systems", Asia and
South Pacific Design Automation Conference, pp. 817-822, 2009.
[37] Sangyeun Cho, and Lory Al Moakar, "Augmented FIFO Cache
Replacement Policies for Low-Power Embedded Processors", Journal of
Circuits, Systems and Computers, Vol. 18, No. 6, pp. 1081-1092, 2009.
[38] Kimish Patel, Luca Benini, Enrico Macii and Massimo Poncino, "Reducing
conflict misses by application-specific reconfigurable indexing", IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 25, No. 12, pp. 2626-2637, 2006.
[39] Yun Liang, and Tulika Mitra, "Instruction cache locking using temporal
reuse profile", 47th Design Automation Conference, pp. 344-349, 2010.
[40] Yun Liang, and Tulika Mitra, "Improved procedure placement for set
associative caches", International Conference on Compilers, Architectures
and Synthesis for Embedded Systems, pp. 147-156, 2010.
[41] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik Jhon,
"Adaptive block management for victim cache by exploiting l1 cache history
information", Embedded and Ubiquitous Computing, Springer Berlin
Heidelberg, pp. 1-11, 2004.
[42] Cristian Ungureanu, Biplob Debnath, Stephen Rago and Akshat Aranya,
"TBF: A memory-efficient replacement policy for flash-based caches",
IEEE 29th International Conference on In Data Engineering (ICDE), pp.
1117-1128, 2013.
[43] Pepijn de Langen and Ben Juurlink, "Reducing Conflict Misses in Caches
by Using Application Specific Placement Functions", 17th Annual
Workshop on Circuits, Systems and Signal Processing, pp. 300-305. 2006.
[44] Dongxing Bao and Xiaoming Li, "A cache scheme based on LRU-like
algorithm", IEEE International Conference on Information and Automation
(ICIA), pp. 2055-2060, 2010.
[45] Livio Soares, David Tam and Michael Stumm, "Reducing the harmful
effects of last-level cache polluters with an OS-level, software-only pollute
buffer", 41st annual IEEE/ACM International Symposium on
Microarchitecture, IEEE Computer Society, pp. 258-269, 2008.
[46] Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose
Duato, "MRU-Tour-based Replacement Algorithms for Last-Level Caches",
23rd International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD), pp. 112-119, 2011.
[47] Gangyong Jia, Xi Li, Chao Wang, Xuehai Zhou and Zongwei Zhu, "Cache
Promotion Policy Using Re-reference Interval Prediction", IEEE
International Conference on Cluster Computing (CLUSTER), pp. 534-537,
2012.
[48] Junmin Wu, Xiufeng Sui, Yixuan Tang, Xiaodong Zhu, Jing Wang and
Guoliang Chen, "Cache Management with Partitioning-Aware Eviction and
Thread-Aware Insertion/Promotion Policy", International Symposium on
Parallel and Distributed Processing with Applications (ISPA), pp. 374-381,
2010.
[49] Ali A. El-Moursy and Fadi N. Sibai, "V-Set cache design for LLC of multi-
core processors", IEEE 14th International Conference on High
Performance Computing and Communication (HPCC), pp. 995-1000,
2012.
[50] Aamer Jaleel, Matthew Mattina and Bruce Jacob, "Last level cache (LLC)
performance of data mining workloads on a CMP-a case study of parallel
bioinformatics workloads", The Twelfth International Symposium on High-
Performance Computer Architecture, pp. 88-98, 2006.
[51] Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri and
Jose Martinez, "Scavenger: A new last level cache architecture with global
block priority", 40th Annual IEEE/ACM International Symposium on
Microarchitecture, IEEE Computer Society, pp. 421-432, 2007.
[52] Miao Zhou, Yu Du, Bruce Childers, Rami Melhem and Daniel Mossé,
"Writeback-aware partitioning and replacement for last-level caches in
phase change main memory systems", ACM Transactions on Architecture
and Code Optimization (TACO), Vol. 8, No. 4, Article. 53, 2012.
[53] Fazal Hameed, Lars Bauer and Jorg Henkel, "Adaptive cache
management for a combined SRAM and DRAM cache hierarchy for multi-
cores", Design, Automation & Test in Europe Conference & Exhibition
(DATE), pp. 77-82, 2013.
[54] Fang Juan and Li Chengyan, "An improved multi-core shared cache
replacement algorithm", 11th International Symposium on Distributed
Computing and Applications to Business, Engineering & Science
(DCABES), pp. 13-17, 2012.
[55] Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas
Subramoney and Joseph Nuzman, "Introducing hierarchy-awareness in
replacement and bypass algorithms for last-level caches", International
Conference on Parallel Architectures and Compilation Techniques, pp.
293-304, 2012.
[56] Yingying Tian, Samira M. Khan and Daniel A. Jimenez, "Temporal-based
multilevel correlating inclusive cache replacement", ACM Transactions on
Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article. 33,
2013.
[57] Liqiang He, Yan Sun and Chaozhong Zhang, "Adaptive Subset Based
Replacement Policy for High Performance Caching", JWAC-1st JILP
Workshop on Computer Architecture Competitions: Cache Replacement
Championship, 2010.
[58] Baozhong Yu, Yifan Hu, Jianliang Ma and Tianzhou Chen, "Access Pattern
Based Re-reference Interval Table for Last Level Cache", 12th
International Conference on Parallel and Distributed Computing,
Applications and Technologies (PDCAT), pp. 251-256, 2011.
[59] Haiming Liu, Michael Ferdman, Jaehyuk Huh and Doug Burger, "Cache
bursts: A new approach for eliminating dead blocks and increasing cache
efficiency", 41st annual IEEE/ACM International Symposium on
Microarchitecture, IEEE Computer Society, pp. 222-233, 2008.
[60] Vineeth Mekkat, Anup Holey, Pen-Chung Yew and Antonia Zhai,
"Managing shared last-level cache in a heterogeneous multicore
processor", 22nd International Conference on Parallel Architectures and
Compilation Techniques, pp. 225-234, 2013.
[61] Rami Sheikh and Mazen Kharbutli, "Improving cache performance by
combining cost-sensitivity and locality principles in cache replacement
algorithms", IEEE International Conference on Computer Design (ICCD),
pp. 76-83, 2010.
[62] Adwan AbdelFattah and Aiman Abu Samra, "Least recently plus five least
frequently replacement policy (LR+ 5LF)", Int. Arab J. Inf. Technology,
Vol. 9, No. 1, pp. 16-21, 2012.
[63] Richa Gupta and Sanjiv Tokekar, "Proficient Pair of Replacement
Algorithms on L1 and L2 Cache for Merge Sort", Journal of Computing, Vol.
2, No. 3, pp. 171-175, March 2010.
[64] Jan Reineke, Daniel Grund, “Sensitivity of Cache Replacement Policies”,
ACM Transactions on Embedded Computer Systems, Vol. 9, No. 4, Article
39, March 2010.
[65] Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu and Yale N. Patt, "A
case for MLP-aware cache replacement", ACM SIGARCH Computer
Architecture News, Vol. 34, No. 2, pp. 167-178, 2006.
[66] Marios Kleanthous and Yiannakis Sazeides, "CATCH: A mechanism for
dynamically detecting cache-content-duplication in instruction caches",
ACM Transactions on Architecture and Code Optimization (TACO), Vol. 8,
No. 3, Article. 11, 2011.
[67] Moinuddin K. Qureshi, David Thompson and Yale N. Patt, "The V-Way
cache: demand-based associativity via global replacement", 32nd
International Symposium on Computer Architecture, ISCA'05 Proceedings,
pp. 544-555, 2005.
[68] Jichuan Chang and Gurindar S. Sohi, “Cooperative caching for chip
multiprocessors”, IEEE Computer Society, Vol. 34, No. 2, 2006.
[69] Shekhar Srikantaiah, Mahmut Kandemir and Mary Jane Irwin, "Adaptive
set pinning: managing shared caches in chip multiprocessors", ACM
SIGARCH Computer Architecture News, Vol. 36, No. 1, pp. 135-144, 2008.
[70] Ranjith Subramanian, Yannis Smaragdakis and Gabriel H. Loh, “Adaptive
Caches: Effective Shaping of Cache Behavior to Workloads”, IEEE
MICRO, Vol. 39, 2006.
[71] Tripti Warrier S, B. Anupama and Madhu Mutyam, "An application-aware
cache replacement policy for last-level caches", Architecture of Computing
Systems–ARCS, Springer Berlin Heidelberg, pp. 207-219, 2013.
[72] Chongmin Li, Dongsheng Wang, Yibo Xue, Haixia Wang and Xi Zhang.
"Enhanced adaptive insertion policy for shared caches", Advanced Parallel
Processing Technologies, Springer Berlin Heidelberg, pp. 16-30, 2011.
[73] Xiaoning Ding, Kaibo Wang and Xiaodong Zhang, "SRM-buffer: an OS
buffer management technique to prevent last level cache from thrashing in
multi-cores", ACM 6th Conference on Computer Systems, pp. 243-256,
2011.
[74] Jiayuan Meng and Kevin Skadron, "Avoiding cache thrashing due to private
data placement in last-level cache for many core scaling", IEEE
International Conference on Computer Design (ICCD), pp. 282-288, 2009.
[75] Vivek Seshadri, Onur Mutlu, Michael A. Kozuch and Todd C. Mowry, "The
Evicted-Address Filter: A unified mechanism to address both cache
pollution and thrashing", 21st International Conference on Parallel
Architectures and Compilation Techniques, pp. 355-366, 2012.
[76] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr and
Joel Emer, “Adaptive Insertion Policies for High Performance Caching”,
ACM SIGARCH Computer Architecture News, Vol. 35, No. 2, pp. 381-391,
June 2007.
[77] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero
and Alexander V. Veidenbaum, "Improving cache management policies
using dynamic reuse distances", 45th Annual IEEE/ACM International
Symposium on Microarchitecture, IEEE Computer Society, pp. 389-400,
2012.
[78] Xi Zhang, Chongmin Li, Haixia Wang and Dongsheng Wang, "A cache
replacement policy using adaptive insertion and re-reference prediction",
22nd International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD), pp. 95-102, 2010.
[79] Min Feng, Chen Tian, Changhui Lin and Rajiv Gupta, "Dynamic access
distance driven cache replacement", ACM Transactions on Architecture
and Code Optimization (TACO), Vol. 8, No. 3, Article 14, 2011.
[80] Jun Zhang, Xiaoya Fan and Song-He Liu, "A pollution alleviative L2 cache
replacement policy for chip multiprocessor architecture", International
Conference on Networking, Architecture and Storage, NAS'08, pp. 310-
316, 2008.
[81] Yuejian Xie and Gabriel H. Loh, "Scalable shared-cache management by
containing thrashing workloads", High Performance Embedded
Architectures and Compilers, Springer Berlin Heidelberg, pp. 262-276,
2010.
[82] Seongbeom Kim, Dhruba Chandra and Yan Solihin, "Fair cache sharing
and partitioning in a chip multiprocessor architecture", 13th International
Conference on Parallel Architectures and Compilation Techniques, IEEE
Computer Society, pp. 111-122, 2004.
[83] Jichuan Chang and Gurindar S. Sohi, "Cooperative cache partitioning for
chip multiprocessors", 21st ACM International Conference on
Supercomputing, pp. 242-252, 2007.
[84] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot,
Simon Steely Jr and Joel Emer, "Adaptive insertion policies for managing
shared caches", 17th International Conference on Parallel Architectures
and Compilation Techniques, pp. 208-219, 2008.
[85] Carole-Jean Wu and Margaret Martonosi, "Adaptive timekeeping
replacement: Fine-grained capacity management for shared CMP
caches", ACM Transactions on Architecture and Code Optimization
(TACO), Vol. 8, No. 1, Article. 3, 2011.
[86] Mainak Chaudhuri, "Pseudo-LIFO: the foundation of a new family of
replacement policies for last-level caches", 42nd Annual IEEE/ACM
International Symposium on Microarchitecture, pp. 401-412, 2009.
[87] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr and Joel Emer, "High
performance cache replacement using re-reference interval prediction
(RRIP)", ACM SIGARCH Computer Architecture News, Vol. 38, No. 3, pp.
60-71, 2010.
[88] Viacheslav V Fedorov, Sheng Qiu, A. L. Reddy and Paul V. Gratz, "ARI:
Adaptive LLC-memory traffic management", ACM Transactions on
Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article. 46,
2013.
[89] Moinuddin, K., Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr, and Joel
Emer, "Set-dueling-controlled adaptive insertion for high-performance
caching", Micro IEEE, Vol. 28, No. 1, pp. 91-98, 2008.
[90] Aamer Jaleel, Hashem H. Najaf-Abadi, Samantika Subramaniam, Simon
C. Steely and Joel Emer, "Cruise: cache replacement and utility-aware
scheduling", ACM SIGARCH Computer Architecture News, Vol. 40, No. 1,
pp. 249-260, 2012.
[91] George Kurian, Srinivas Devadas and Omer Khan, "Locality-Aware Data
Replication in the Last-Level Cache", High-Performance Computer
Architecture, 20th International Symposium on IEEE, 2014.
[92] Jason Zebchuk, Srihari Makineni and Donald Newell, "Re-examining cache
replacement policies", IEEE International Conference on Computer Design
(ICCD), pp. 671-678, 2008.
[93] Samira Manabi Khan, Zhe Wang and Daniel A. Jimenez, "Decoupled
dynamic cache segmentation", IEEE 18th International Symposium
on High Performance Computer Architecture (HPCA), pp. 1-12, 2012.
[94] Song Jiang and Xiaodong Zhang, "LIRS: an efficient low inter-reference
recency set replacement policy to improve buffer cache performance" ACM
SIGMETRICS Performance Evaluation Review, Vol. 30, No.1, pp. 31-42,
2002.
[95] Song Jiang and Xiaodong Zhang, "Making LRU friendly to weak locality
workloads: A novel replacement algorithm to improve buffer cache
performance", IEEE Transactions on Computers, Vol. 54, No. 8, pp. 939-
952, 2005.
[96] Izuchukwu Nwachukwu, Krishna Kavi, Fawibe Ademola and Chris Yan,
"Evaluation of techniques to improve cache access uniformities",
International Conference on Parallel Processing (ICPP), pp. 31-40, 2011.
[97] Dyer Rolán, Basilio B. Fraguela and Ramon Doallo, “Virtually split cache:
An efficient mechanism to distribute instructions and data”, ACM
Transactions on Architecture and Code Optimization (TACO), Vol. 10, No.
4, Article. 27, Dec 2013.
[98] Huang ZhiBin, Zhu Mingfa and Xiao Limin, "LvtPPP: live-time protected
pseudopartitioning of multicore shared caches", IEEE Transactions
on Parallel and Distributed Systems, Vol. 24, No. 8, pp. 1622-1632, 2013.
[99] Bradford M Beckmann, Michael R. Marty and David A. Wood, "ASR:
Adaptive selective replication for CMP caches", 39th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-39, pp. 443-454,
2006.
[100] Garcia-Guirado, Antonio, Ricardo Fernandez-Pascual, Alberto Ros and
José M. García, "DAPSCO: Distance-aware partially shared cache
organization", ACM Transactions on Architecture and Code Optimization
(TACO), Vol. 8, No. 4, Article. 25, 2012.
[101] Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi,
Simon C. Steely Jr and Joel Emer, "SHiP: signature-based hit predictor for
high performance caching", 44th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 430-441, 2011.
[102] Nathan Binkert, Bradford Beckman, Gabriel Black, Steven K. Reinhardt, Ali
Saidi, Arkaprava Basu, Joel Hestness, Derek. R. Hower, Tushar Krishna,
Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay
Vaish, Mark D. Hill and David A. Wood, "The gem5 simulator", ACM
SIGARCH Computer Architecture News, Vol. 39, No. 2 pp. 1-7, 2011.
[103] Christian Bienia, Sanjeev Kumar and Kai Li, "PARSEC vs. SPLASH-2: A
quantitative comparison of two multithreaded benchmark suites on chip-
multiprocessors", IEEE International Symposium on Workload
Characterization (IISWC), pp. 47-56, 2008.
[104] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh and Kai Li, "The
PARSEC benchmark suite: Characterization and architectural
implications", 17th International Conference on Parallel Architectures and
Compilation Techniques, pp. 72-81, 2008.
[105] Mark Gebhart, Joel Hestness, Ehsan Fatehi, Paul Gratz and Stephen W.
Keckler, "Running PARSEC 2.1 on M5", University of Texas at Austin,
Department of Computer Science, Technical Report# TR-09-32, 2009.
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.
ABSTRACT
LIST OF TABLES
LIST OF FIGURES
LIST OF SYMBOLS AND ABBREVIATIONS
iv
viii
ix
xi
1. INTRODUCTION 1
1.1 Chip Multi-core Processors
1.2 Memory Management in CMP
1.3 Overview of Existing Replacement Policies
1.4 The Problem
1.5 Scope of Work
1.6 Research Objectives
1.7 Contributions
1.8 Outline of the Thesis
1.9 Summary
2
3
3
6
6
7
7
8
9
2. LITERATURE SURVEY 10
2.1 Replacement Policies
2.2 Data Prefetching and Cache Bypassing
2.3 Effectiveness of Cache Partitioning
2.4 Embedded System Processors
2.5 Performance of Shared Last-Level Caches
2.6 Cache Thrashing and Cache Pollution
2.7 Prevalence of Multi-Threaded Applications
2.8 Summary
10
11
11
12
13
15
16
17
3. CONTEXT-BASED DATA PATTERN EXPLOITATION
TECHNIQUE 18
CHAPTER NO. TITLE PAGE NO.
3.1 Introduction
3.1.1. Different Types of Application Workloads
18
19
3.2 Basic Structure of CB-DPET
3.3 DPET
3.4 DPET-V: A Thrash Proof Version
3.5 Policy Selection Technique
3.6 Thread Based Decision Making
3.7 Block Status Based Decision Making
3.8 Modifications in the Proposed Policies
3.9 Comparison between LRU, NRU and
CB-DPET
3.9.1 Performance of LRU
3.9.2 Performance of NRU
3.9.3 Performance of DPET
3.10 Results
3.10.1 CB-DPET Performance
3.11 Summary
22
22
24
31
33
34
34
35
35
36
37
38
39
43
4. LOGICAL CACHE PARTITIONING TECHNIQUE 45
4.1 Introduction
4.2 Logical Cache Partitioning Method
4.3 Replacement Policy
4.4 Insertion Policy
4.5 Promotion Policy
4.6 Boundary Condition
4.7 Comparison with LRU
4.8 Results
4.8.1 LCP Performance
4.9 Summary
45
46
47
47
48
49
53
53
54
57
CHAPTER NO. TITLE PAGE NO.
5. SHARING AND HIT-BASED PRIORITIZING
TECHNIQUE
58
5.1 Introduction
5.2 Sharing and Hit-Based Prioritizing Method
5.3 Replacement Priority
58
59
60
5.4 Replacement Policy
5.5 Insertion Policy
5.6 Promotion Policy
5.7 Results
5.7.1 SHP Performance
5.8 Summary
61
66
66
66
66
69
6.
EXPERIMENTAL METHODOLOGY
63
6.1 Simulation Environment
6.2 PARSEC Benchmark
6.3 Evaluation of Many-Core Systems
6.4 Storage Overhead
6.4.1 CB-DPET
6.4.2 LCP
6.4.3 SHP
70
71
71
76
76
76
76
7. CONCLUSION AND FUTURE WORK 78
7.1 Conclusion
7.2 Scope for Further Work
78
79
REFERENCES
LIST OF PUBLICATIONS
TECHNICAL BIOGRAPHY
81
97
98
LIST OF PUBLICATIONS
[1] Muthukumar S and Jawahar P.K., “Hit Rate Maximization by Logical Cache
Partitioning in a Multi-Core Environment”, Journal of Computer Science,
Vol. 10, Issue 3, pp. 492-498, 2014.
[2] Muthukumar S and Jawahar P.K., “Sharing and Hit Based Prioritizing
Replacement Algorithm for Multi-Threaded Applications”, International
Journal of Computer Applications, Vol. 90, Issue 12, pp. 34-38, 2014.
[3] Muthukumar S and Jawahar P.K., “Cache Replacement for Multi-Threaded
Applications using Context-Based Data Pattern Exploitation Technique”,
Malaysian Journal of Computer Science, Vol. 26, Issue 4, pp. 277-293,
2014.
[4] Muthukumar S and Jawahar P.K., “Redundant Cache Data Eviction in a
Multi-Core Environment”, International Journal of Advances in Engineering
and Technology, Vol. 5, Issue 2, pp.168-175, 2013.
[5] Muthukumar S and Jawahar P.K., “Cache Replacement Policies for
Improving LLC Performance in Multi-Core Processors”, International
Journal of Computer Applications, Vol. 105, Issue 8, pp. 12-17, 2014.
1. INTRODUCTION
The advent of multi-core computers consisting of two or more processors
working in tandem on a single integrated circuit has produced a phenomenal
breakthrough in performance over the recent years. Efficient memory
management is required to utilize the complete potential of CMPs.
One of the main concerns in recent times is the processor memory gap.
Processor memory gap indicates a scenario where the processor is stalled while
a memory operation is happening. Fig 1.1 shows that this situation is getting
worse as the processor performance tends to increase manifold but the memory
latency continues to be almost the same. The on-chip cache memory clock
speeds have hit a wall where practically no improvement could be achieved in
that area anymore. The usage of multiple cores has mitigated this effect to some
extent but this gap tends to widen lot more in the future stressing the need for
effective memory management techniques.
Fig 1.1 Processor memory gap [1]
1.1. Chip Multi-core Processors (CMP)
In a CMP, multiple processor cores are packaged into a single physical
processor chip that is capable of executing multiple tasks simultaneously. Multi-
core processors provide room for faster execution of applications by exploiting
the available parallelism (mainly instruction level parallelism), which implies the
ability to work on multiple problems simultaneously. Parallel multi-threaded
applications demand for high computational and throughput requirements. To
expedite their execution speed, these applications spawn multiple threads which
run in parallel. Each thread has its own execution context and is assigned to
perform a particular task. In a multi-core architecture, these threads can run on the
same core or on different cores. A two-core processor with two threads T1 and
T2 is depicted in Fig 1.2.
Fig 1.2 Two core processor-T1 and T2 represent the threads
Every core now behaves like an individual processor and has its own
execution pipeline. They can execute multiple threads in parallel thus expediting
the running time of the program as a whole and boosting the overall system
performance and throughput. Efficiency of CMPs largely rely on how well a
software application can leverage the multiple core advantage while execution.
Many applications that have emerged in recent times have been optimized to
utilize the multi-core advantage (if present) to improve the performance.
Networking applications and other such CPU-intensive applications have
benefitted hugely from the CMP architecture.
1.2. Memory Management in CMP
Memory management forms an integral part in determining the performance of
CMPs. Cache memory is used to hide the latency of main memory and it also
helps in reducing the processing time in CMPs. Generally every core in a CMP
environment will be having a private L1 cache and all the available cores share a
relatively larger unified L2 cache (which in most cases can also be referred to as
the shared Last Level Cache). The L1 cache is split into ‘instruction cache’ and
‘data cache’ as shown in Fig 1.3. The mapping method used is ‘set-associative’
mapping [1], where the cache is divided into a specified number of sets with
each set having a specified number of blocks.
The number of blocks in each set indicates the associativity of the cache. The
size of L1 cache has to be kept small owing to chip constraints. Since it is
directly integrated into the chip that carries the processor, its data retrieval time
is expedited.
Fig 1.3 General memory hierarchy in a CMP environment
1.3. Overview of Existing Replacement Policies
Replacement policies play a vital role in determining the performance of a
system. In a multi-core environment, where processing speed and throughput
are of essence, selecting a suitable replacement technique for the underlying
cache memories becomes crucial. Primary objective of any replacement
algorithm is to utilize the available cache space effectively and reduce the number
of data misses to the maximum extent possible. Misses can be of three types
namely- compulsory misses, conflict misses and capacity misses.
Compulsory misses happen when the data item is fetched from the main
memory and stored into the cache for the first time. In most cases it is
unavoidable. Conflict misses are quite common in direct mapped cache
architectures where more than one data item compete for the same cache line
resulting in eviction of the previously inserted data item. As the name suggests,
capacity misses arise due to the inability of the cache to hold the entire working
set which can be observed in most applications owing to the smaller cache size.
An efficient replacement algorithm targets the capacity misses and tries to reduce
them as much as possible.
Since L2 cache is the one that will be accessed by multiple cores and it has
shared data, it is essential that judicious replacement strategies need to be
adopted here, in order to improve the performance. At present, many
applications make use of the LRU cache replacement algorithm [1], which
discards the ‘oldest possible’ cache line available in the cache i.e. the cache line
which has not been accessed over a period of time. This implies that the ‘age’ of
every cache line needs to be updated after every access. This is done with the
help of either a counter or a queue data structure. In the queue based
approach, data that is referenced by the processor is moved to the top of the
queue, pushing all the elements below by one step. At any point of time, to
accommodate an incoming block, LRU deletes the block which is found at the
bottom of the queue (least recently used data).
This method seems to work well with many current applications. However for
certain workloads which possess an unpredictable data reference pattern that
fluctuates frequently with time, LRU can result in poor performance. A closer
observation on LRU can help reveal the cause for this problem. It assumes that
all the applications follow the same reference pattern i.e. they abide by the
spatial and temporal locality theories. Any data item that is not referenced by the
processor for some considerable amount of time is tagged as ‘least recently
used’ and evicted from the cache. When such a data item is required by the
processor in future, it is not found in the cache and has to be fetched from the
main memory.
Most Recently Used (MRU) policy also works in a similar fashion, with a slight
difference that the victim is chosen from the opposite end of the data structure
i.e. the most recently referred data is evicted to accommodate an incoming
block. The above stated algorithms work fine when the ‘locality of reference’ is
high. But for applications whose data sets do not exhibit any degree of regularity
in their access pattern, these algorithms might result in sub-optimal
performance. An example of such a pattern is given in Fig 1.4. In the ensuing
sections, individual data items are referred by enclosing them within flower
braces.
Fig 1.4 Data sequence containing a recurring pattern {1,5,7}
As it is observed from Fig 1.4, the starting pattern {1,5,7} seems to recur after
a long burst of random data. If LRU is applied for the above set of data, it will
result in poor performance as there is no regularity in the access pattern.
Pseudo-LRU keeps a single status bit for each cache block. Initially all bits are
reset. Every access to a block sets the bit indicating that the block is recently
used. This also may degrade the performance for the above data sequence.
Another replacement technique that is commonly being employed apart from
LRU and MRU is the Not Recently Used (NRU) algorithm. NRU associates a
single bit counter with every cache line to aid in making replacement decisions. It
might be effective compared to LRU in certain cases, but a 1-bit counter may not
be powerful enough to create the required dynamicity.
Random replacement algorithm works by arbitrarily selecting a victim for
replacement when necessary. It does not keep any information about the cache
access history. Least Frequently Used technique tracks the number of times a
block is referenced using a counter and a block with lowest count is selected for
replacement.
1.4. The Problem
An effective replacement algorithm is important for cache management in
CMP. When it comes to the on-chip cache, it becomes even more crucial as the
size of the cache memory is considerably small. In order to fine tune the
performance of the cache, it is imperative to have an efficient replacement
technique in place. Apart from being efficient it is also important that it should be
dynamic, i.e. adjusting itself according to the changing data access patterns.
The most prevalently used replacement algorithms such as Least Recently
Used, Most Recently Used, Not Recently Used etc may have been proven to
work well with many applications but they are not dynamic in nature as
discussed in the earlier section. Irrespective of the data pattern exhibited by the
applications, they follow the same technique to find out the replacement victim.
It was observed that if a replacement algorithm could make decisions
based on the access history and the current access pattern, the number of
cache hits received can increase considerably thereby reducing the number of
misses and improving the overall performance. The primary aim of this
dissertation is to propose and design such dynamic novel cache replacement
algorithms and evaluate them against the conventional LRU approach.
1.5. Scope of Work
Managing cache memory involves many challenges that need to be addressed
from various angles. The behavior of various existing replacement policies on
shared last level cache is analyzed. In multi-threaded applications shared data
has to be handled with care because when there is a miss on a data item which is
shared by multiple threads, the execution of many threads gets stalled.
Replacement decisions made need to be dynamic and self-adjusting according to
the workload pattern changes. Since the size of the cache is small, available
cache space must be utilized efficiently. Techniques proposed in this work cover
all the above mentioned aspects while refraining from in-depth focusing on cache
memory hardware architecture optimizations.
1.6. Research Objectives
The following are the objectives of this research
To augment the cache subsystem of a CMP with novel replacement
algorithm for improving the performance of shared Last Level Cache
(LLC).
To propose a dynamic and self adjusting cache replacement policy with
low storage overhead and robustness across workloads.
To manage the shared cache efficiently between multiple applications.
Make LLC thrash resistant and scan resistant and maximize the utilization
of LLC, which helps to accelerate overall performance for memory
intensive workloads
1.7. Contributions
In order to address the shortcomings of the conventional replacement
algorithms, three novel counter based replacement strategies, which are
targeted towards the shared LLC, have been proposed and evaluated over a set
of multi-threaded benchmarks.
First of the three methods, Context based data pattern exploitation
technique, maximizes the cache performance by closely monitoring and
adapting itself to the pattern that occurs in the workload sequence with
the help of a block-level 3-bit counter. Results shows that the overall
number of hits obtained at the L2 cache level is on average 8% more than
LRU approach and an improvement of IPC speedup of up to 6% is
achieved.
Logical cache partitioning, which is the second technique, logically partitions
the cache into four different zones namely – Most likely to be referenced,
Likely to be referenced, Less likely to be referenced and Never likely to be
referenced in the decreasing order of their likeliness factor. Replacement
decisions are made based on likeliness to be referenced by the processor
in the near future. Results shows that an improvement of up to 9% in overall
number of hits and 5% in IPC speedup compared to LRU.
Finally, Sharing and hit-based prioritizing is an effective novel counter
based replacement technique, which makes replacement decisions based
on the sharing nature of the cache blocks. Every cache block is classified
based on four degrees of sharing namely – not shared (or private), lightly
shared, heavily shared and very heavily shared with the help of a 2-bit
counter. This sharing degree information is combined along with the
number of hits received by the element based on which the replacement
decisions are made. Results shows that this method achieves 9%
improvement in the overall number of hits and 8% improvement in IPC
speedup compared to baseline LRU Policy.
1.8. Outline of the Thesis
This thesis is organized into seven chapters which are as follows
Chapter 2 consists of the literature survey which discusses about a versatile set
of researches that have been carried out on cache management in single core
and multi-core architectures.
Chapter 3 presents a novel counter based replacement technique namely
Context Based Data Pattern Exploitation Technique. Initially it presents the
different types of workloads and the basic structure of the policy. Then it
elaborates on the algorithm and results and finally summarizes the method.
Chapter 4 presents the Logical Cache Partitioning technique. First it presents
the proposed replacement policy and finally it evaluates its significance and
results compared to LRU.
Chapter 5 describes the novel replacement policy Sharing and Hit-based
Prioritizing technique and evaluates its performance.
Chapter 6 discusses about the simulation environment and the benchmarks
used to evaluate all the three methods and also about the results obtained for all
the techniques, comparing and contrasting with the results obtained for the
same set of benchmarks by applying the conventional LRU approach. The final
section of this chapter presents the storage overhead.
Chapter 7 wraps up by presenting the conclusion and directions for future work.
1.9. Summary
This Chapter has provided an overview of the problem associated with
existing cache replacement policies for shared last level cache present in chip
multiprocessors and this forms the essential foundation for the chapters 3, 4 and
5. It has presented an outline of the thesis, specifically devoting three chapters
to target the drawbacks associated with shared LLC in CMPs. The next chapter
presents the literature survey of various cache replacement policies for multi-
core processors.
2. LITERATURE SURVEY
2.1. Replacement Policies
Replacement policies play a crucial role in CMP cache management.
Considering the on-chip cache memory of a processor, choosing a perfect
replacement technique is extremely important in determining the performance of
the system. The more adaptive a replacement technique is to the incoming
workload, the lesser is the overhead involved in transferring data to and from the
secondary memory frequently. John Odule and Ademola Osingua have come up
with a dynamically self-adjusting cache replacement algorithm [2] that makes
page replacement decisions based on changing access patterns by adapting to
the spatial and temporal features of the workload. Though it dynamically changes
to the workload needs, it does not detect changes or make decisions at a finer
level i.e. at block level which can be crucial when the working set changes quite
frequently.
Many such algorithms which work better than LRU on specific workloads have
evolved in recent times. One such algorithm is the Dueling Clock (DC) [3]
algorithm, designed to be applied on the shared L2 cache. But it does not consider
applications that involve multiple threads executing in parallel. Effectiveness-
Based Replacement technique (EBR) [4] ranks the cache elements within a set,
based on the recency and frequency of access and chooses the replacement
victim based on those ranks. In a multi-threaded environment, a miss on shared
data can be pivotal and calls for more attention which is not provided here.
Competitive cache replacement [5] proposes a replacement algorithm for
inclusive, exclusive and partially-inclusive caches by considering two knowledge
models. Self correcting LRU [6] proposes a technique in which LRU is augmented
with a framework to constantly monitor and correct the replacement mistakes
which LRU can tend to make in its execution cycle. Adaptive Block management
[7] proposes a method to use the victim cache more efficiently thereby reducing
the number of accesses to L2 cache.
2.2. Data Prefetching and Cache Bypassing
Data prefetching can be very useful in any cache architecture since it helps in
reducing the cache miss latency [8]. Prediction-Aware Confidence-based Pseudo
LRU (PAC-PLRU) [9] algorithm efficiently makes use of the pre-fetched
information and also saves those information from unnecessarily being evicted
from the cache. There can also be scenarios where the overhead can be reduced
by promoting the data fetched from the main memory directly to the processor,
bypassing the cache [10]. Hongliang Gao and Chris Wilkerson’s Dual Segmented
LRU algorithm [11] has explored the positive aspects obtained from cache
bypassing in detail. Though cache bypassing [12] can help in reducing the transfer
overhead, it may not work well with the ‘locality principles’ i.e. bypassing a data
item likely to be referenced in the near future might not be a good idea.
Counter based cache replacement and cache bypassing algorithm [13] tries to
locate the dead cache lines well in advance and choose them as replacement
victims, thus preventing such lines from polluting the cache. Augmented cache
architecture is one that comprises two parts- a larger direct-mapped cache and a
smaller, fully-associative cache which are accessed in parallel. A replacement
technique similar to LRU known as LRU-Like algorithm [14] was proposed to be
applied across augmented cache architectures.
Backing up few of the evicted lines in another cache can help in incrementing
the overall hits received. Adaptive Spill-Receive Cache [15] works on similar lines
by having a spiller cache (one that ‘spills’ the evicted line) and a receiver cache
(one that stores the ‘spilled’ lines).But if the cache line reuse prediction is not
efficient this can result in lots of stale cache lines as the algorithm executes.
2.3. Effectiveness of Cache Partitioning
In a multi-processor environment, efficient cache partitioning can help in
improving the performance and throughput of the system [17,18,19]. Partition
aware replacement algorithms exploit the cache partitions and try to minimize the
cache misses [20,21]. Filter based partitioning techniques are widely being
proposed by many studies [22,24]. Adaptive Bloom Filter Cache Partitioning
(ABFCP) [23] involves dynamic cache partitioning at periodic intervals using
bloom filters and counters to adapt to the needs of concurrent applications. But
since every processor core is allotted an array of bloom filters, the hardware
complexity can increase as the number of cores becomes higher. Pseudo LRU
algorithm (similar to LRU but with lesser hardware complexity) has been proven
to work well with partitioned caches [25]. Studies have also been conducted in
enhancing the performance of Non-Uniform Cache Architecture (NUCA)
[26,27,29]. A dynamic cache management scheme [28] aimed at the Last-Level
Caches in NUCA claims to have improved the system performance.
Although many schemes work well by adapting to the dynamic data
requirements of the applications, they do not attach importance to data that is
shared between the threads of the parallel workloads. Hits on such data items not
only improve the performance of those workloads, but also reduces the overhead
involved in transferring the data to and from the secondary memory frequently.
Dynamic cache clustering [31] groups cache blocks into multiple dynamic cache
banks on a per core basis to expedite data access. Placing the new incoming data
items into the appropriate cache bank during run time can induce transfer
overheads though. Smaller split L-1 data caches [32] proposes a split design for
L1 data cache using separate L-1 array and L-2 scalar caches which has shown
performance improvements. Yong Chen et al. suggest the use of separate cache
to store the data access histories and studies its associated prefetching
mechanism which has produced performance gain [33].
2.4. Embedded System Processors
Embedded processors differ from the conventional processors in a way that they
are tailor-made for very specific applications. Replacement policies are being
proposed which are targeted towards embedded systems [36,37,38]. Hierarchical
Non-Most-Recently-Used (H-NMRU) [35] replacement policy is one such
algorithm that has proven to work well in those processors. [39] deals with the
cache block locking problem and provides a heuristic and efficient solution
whereas [40] proposes improved procedure placement policies in the cache
memory. A Two Bloom Filter (TBF) based approach [42], proposed primarily for
flash-based cache systems has proven to work well compared to LRU. The
proposal made by Pepijn de Langen et al. [43] is targeted to improve the
performance of direct mapped cache for application specific processors.
Dongxing Bao et al. suggest a method [44] to improve the performance of direct
mapped cache in which blocks are evicted based on their usage efficiency. This
method has shown improvement in miss rate.
2.5. Performance of Shared Last-Level Caches
Performance of the shared LLC in multi-core architecture plays a crucial role in
deciding the overall performance of the system, since all the cores will access the
data present here. Versatile researches [45,46,47] have been conducted in
improving the shared LLC performance. Phase Change Memory (PCM) with a
LLC forms an effective low-power consuming alternative to the traditional DRAMs
in existence. Writeback-aware Cache Partitioning (WCP) [52] efficiently partitions
the LLC in PCM among multiple applications and Write Queue Balancing (WQB)
replacement policy [52] helps in improving the throughput of PCM. But one of the
main drawbacks of PCM is that the writes are much slower compared to DRAM.
The Adaptive Cache Management technique [53] combines SRAM and DRAM
architectures into the LLC and proposes an adaptive DRAM replacement policy
to accommodate the changing access patterns of the applications. Though it
reports improvement over LRU it does not discuss on the hardware overhead that
might result with the change in the architecture. Frequency based LRU (FLRU)
[54] takes the recent access information, as the LRU does, along with the partition
and frequency information while making replacement decision.
Cache Hierarchy Aware Replacement (CHAR) [55] makes replacement
decisions in LLC based on the activities that occur in the inner levels of the cache.
It studies the data pattern that hits/misses in the higher levels of cache and makes
appropriate decisions in LLC. Yingying Tian et al. [56] have proposed a method
which also functions on similar lines for making decisions at LLC. But propagating
such information frequently across various levels of cache might slow down the
system over a period of time. Cache space is another area of concern. Instead of
utilizing the entire cache space, an Adaptive Subset Based Replacement Policy
(ASRP) [57] has been proposed, where the cache space is divided into multiple
subsets and at any point of time only one subset is active. Whenever a
replacement needs to be made, the victim is selected only from the active set. But
there is a possibility of under-utilization of the available cache space in this
technique. For any replacement algorithm to be successful, it is always essential
to make good use of the available cache space. This can be achieved by
eradicating the dead cache lines i.e. the lines which will never be referenced by
the processor in the future. Lot of researches have been carried out in this area
[58,59]. But the problem which can arise here is that for applications with an
unpredictable data access pattern, the process of dead line prediction becomes
intricate and in the long run the predictor might lose its accuracy.
Vineeth Mekkat et al. [60] have proposed a method that focuses on improving
the LLC performance on heterogeneous multi-core processors by targeting and
optimizing certain Graphical Processing Unit (GPU) based applications since they
spawn huge amount of threads compared to any other applications.
The effect of miss penalty can have a serious impact on the system efficiency.
Locality Aware Cost Sensitive (LACS) replacement technique [61] works on
reducing the miss penalty in caches. It calculates the number of instructions
issued during a miss for every cache block and when a victim needs to be selected
for replacement, it selects the block which issues the minimum number of miss
instructions. While it helps in reducing the miss penalty, it does not attach
importance to the data access pattern of the application under execution. Another
replacement policy which is called the LR+5LF [62] policy combines the concepts
of LRU and LFU to find a replacement candidate. Hardware complexity is an issue
here. Replacement algorithms that work well with L1 cache may not suit L2 cache.
Thus choosing an efficient pair of replacement algorithm for L1 and L2 becomes
important for maximizing the hit rate [63]. Studies have indicated that execution
history like the initial hardware state of the cache can also impact the sensitivity
of replacement algorithms [64]. Memory Level Parallelism (MLP) involves
servicing multiple memory accesses simultaneously which will save the processor
from stalling due to a single long-latency memory request. Replacement
technique proposed in [65] incorporates MLP and dynamically makes a
replacement decision on a miss such that the memory related stalls are reduced.
Tag based data duplication may occur in the caches which can be exploited to
eliminate lower-level memory accesses thereby reducing memory latency [66].
Demand-Based associativity [67] involves varying the associativity of the cache
set (basically total number of cache blocks per set) according to the requirements
of the application’s data access pattern. Storage complexity will be high in this
method though. Cooperative caching [68] proposes a unified framework for
managing the on-chip cache. Set pinning [69] proposes a technique for avoiding
inter-core misses and reducing intra-core misses in a shared cache which has
shown improvement in miss rate.
Adaptive caches [70] propose a method which dynamically selects any two of
cache replacement algorithms (LRU, LFU, FIFO, and Random) by tracking the
locality information. Application aware cache [71] tracks the life time of cache
blocks in shared LLC during runtime for parallel applications and thereby utilises
the cache space more efficiently. EDIP [72] proposes a method for memory
intensive workloads where a new block is inserted at the LRU position and claims
to have obtained less miss rate than LRU for workloads with large working set.
2.6. Cache Thrashing and Cache Pollution
One predominant problem with LRU is that it may result in thrashing (a condition
where most of the time is spent in replacement and there will not be any hits at
all) when it is applied for memory intensive applications. Investigations are being
conducted to reduce the effect of thrashing while making replacement decisions
[73,74]. Adaptive Insertion Policies [76] have tweaked the insertion policy of LRU
to reduce the cache miss for memory intensive workloads. It is termed as LRU
Insertion Policy (LIP) which has outperformed LRU with respect to thrashing. OS
buffer management technique [73] tries to reduce thrashing in LLCs by adopting
techniques to improve the management of OS buffer caches. Since it has to
operate at an OS level, it may pull down the system performance. Evicted-Address
filter mechanism [75] keeps track of the recently evicted blocks to predict the reuse
of the incoming cache blocks thereby mitigating the effects of thrashing but at the
cost of implementation complexity.
Cache pollution happens when the cache is predominantly filled up with stale
data that the processor will never reference in the future. Inefficient cache
replacement algorithms can lead to cache pollution. Cache line reuse prediction
techniques [77,78,79] are required to avoid or mitigate cache pollution. Software-
only pollute buffer based approach [45] works on eliminating the LLC polluters
dynamically using a special buffer that is controlled at the Operating System level
but as the algorithm executes, it may burden the OS downgrading the
performance. A Pollution Alleviative L2 cache [80] proposes a method to avoid
cache pollution that might arise due to the execution of wrong path in case of miss
predictions and claims to have improved the IPC and cache hit rate. Scalable
shared cache [81] proposes a method for managing thrashing applications in
shared LLC of embedded multi-core architectures.
2.7. Prevalence of Multi-Threaded Applications
Concurrent multi-threaded applications have become prevalent in today’s
CMPs. Works have been proposed on how effectively cache can be shared by
multiple threads running in parallel [82]. Co-operative cache partitioning [83]
works by efficiently allocating cache resources to threads which are concurrently
executing on multiple cores. Thread Aware Dynamic Insertion Policy (TA DIP) [84]
takes the memory requirements of individual applications into consideration and
makes replacement decisions. Adaptive Timekeeping Replacement (ATR) [85]
policy is designed to be applied primarily at LLC. It assigns ‘lifetime’ to every cache
line. Unlike other replacement strategies like LRU, it takes the process priority
levels and application memory access patterns into consideration and adjusts the
cache line lifetime accordingly to improve performance. But the condition which
can lead to thrashing has not been considered in both the methods
[84,85].Pseudo-LIFO [86] suggests a method for early eviction of dead blocks and
retain live blocks which improves the cache utilization and average CPI for multi-
programmed workloads.
2.8. Summary
This Chapter has presented the prior work on various cache replacement
algorithms like the use cache partitioning and bypassing together with their
additional overhead involved in implementation. An exhaustive study of prior work
on cache thrashing and cache pollution has helped in the design of dynamic cache
replacement schemes that can be applied in high performance processors. The
next chapter discusses a counter based cache replacement technique for shared
last level cache in multi-core processors.
3. CONTEXT-BASED DATA PATTERN EXPLOITATION TECHNIQUE
Data access patterns of workload vary across applications. A replacement
algorithm can aid in improving the performance of the system only when it works
well majority of the applications in existence. In short, if a replacement algorithm
could make decisions based on the access history and the current access pattern
of the workloads, the number of hits received can increase considerably
[87,88,89].
The impact of various cache replacement policies act as the main deciding
factor of system performance and efficiency in Chip Multi-Core Processors.
Conventional replacement approaches work well for applications which have a
well-defined and a predictable access pattern for their workloads. But when the
workload of the applications is dynamic and constantly changes pattern,
traditional replacement techniques may fail to produce improvement in the cache
metrics. Hence this chapter discusses a novel cache replacement policy that is
targeted towards applications with dynamically varying workloads. Context Based
Data Pattern Exploitation Technique assigns a counter for every block of the L2
cache. It then closely observes the data access patterns of various threads and
modifies the counter values appropriately to maximize the overall performance.
Experimental results obtained by using the PARSEC benchmarks have shown an
average improvement of 8% in overall hits at L2 cache level when compared to
the conventional LRU algorithm.
3.1. Introduction
Memory management plays an essential role in determining the performance
of CMPs. Cache memory is indispensible in CMPs to enhance the data retrieval
time and to achieve high performance. The level 1 cache or the L1 cache needs
to be small in size owing to the hardware limitations but since it is directly
integrated into the chip that carries the processor, accessing it typically takes only
a few processor cycles when compared to the accessing time of the next higher
levels of memories. Accessing the L2 cache may not be as fast as accessing the
L1 cache but since it is larger in size and is shared by all the available cores,
effectively managing it becomes more important. This chapter puts forth a
dynamic replacement approach called CB-DPET that is targeted mainly towards
the shared L2 cache.
3.1.1. Different Types of Application Workloads
The following are the types of workload patterns [90] exhibited by many
applications: Let ‘x’ and ‘y’ denote sequence of references where ‘xi’ denotes the
address of the cache block. ‘a’, ‘p’ and ‘n’ are integers where ‘a’ denotes the
number of times the sequence repeats itself.
Thrashing Workload:
[.. (x1, x2….xn) a ..] a>0, n>cache size
These workloads have a working set size that is greater than the size of the
available shared cache. They may not perform well under LRU replacement
policy. Sample data pattern for a thrashing workload for a cache which can hold
8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80 . . .
Here, cache_size = 8 blocks, x1 = 8, x11 = 80, n = 11 and a = any value > 0.
On first scan, elements ‘8’ till ‘5’ will be fetched and loaded into the cache. When
the next element ‘10’ arrives, the cache will be full and hence replacement has to
be made. Same is the case for remaining data items like ‘16’, ‘80’ and so on.
Regular Workload:
[… (x1, x2…. xn -1, xn, xn-1, … x2, x1)a…] n>0, a>0
Applications possessing this type of workload benefit hugely from the cache
usage. LRU policy suits them perfectly since the data elements follow stack based
pattern. Sample data pattern for regular workload for a cache which can hold 8
blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 4, 20, 11 . . .
Here, cache_size = 8 blocks, x1 = 8, x7 = 4, x8 = 5, n = 8 and a = any value > 0.
After the cache gets filled up with the eighth element ‘5’, the remaining elements
‘4’, ‘20’ and ‘11’’ result in back to back cache hits.
Random Workload:
[…(x1, x2….xn)…] n -› ∞
Random workloads possess poor cache re-use. They generally tend to thrash
under any replacement policy. Sample data pattern for a random workload for a
cache which can hold 8 blocks at a time can be as follows
. . . 1, 9, 2, 8, 0, 12, 22, 99, 101, 50, 3 . . .
Recurring Thrashing Workload:
[… (x1, x2…. xn)a … (y1, y2.. yp)
… (x2, x1… xn) a …], a>0, p>0, n>cache size
Sample data pattern for a recurring thrashing workload for a cache which can hold
8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 7, 17, 46, 10, 16, 80, 90, 99, 8, 9, 2 . . .
Here, cache_size = 8 blocks, x1 = 8, x11 = 46, y1= 10, y5= 99, n = 11 and a = 1.
On first scan, elements ‘8’ till ‘5’ will be fetched and loaded into the cache. When
the next element ‘7’ arrives, the cache will be full and hence replacement has to
be made. Same is the case for remaining data items like ‘17’, ‘46’ and so on. By
the time the initial pattern recurs towards the end, it would have been replaced
resulting in cache misses towards the end.
Recurring Non-Thrashing Workload:
[… (x1, x2 ….xn)a … (y1, y2 ... yp )
… (x2, x1… xn)
a …], a>0, p >0, n<=cache size
It is similar to the thrashing workload but with a recurring pattern in between.
Sample data pattern for a recurring non-thrashing workload for a cache which can
hold 8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80, 8, 9, 2 . . .
Here, cache_size = 8 blocks, x1 = 8, x8 = 5, y1= 10, y3= 80, n = 8 and a = 1.
On first scan, elements ‘8’ till ‘5’ will be fetched and loaded into the cache. When
the next element ‘10’ arrives, the cache will be full and hence replacement has to
be made. Same is the case for remaining data items like ‘16’ and ‘80’. Again the
initial pattern recurs towards the end. If LRU is applied here, it will result in 3 cache
misses towards the end as it would have replaced data items ‘8’, ‘9’ and ‘2’.
In the recurring non-thrashing pattern, if ‘n’ is large, then LRU can result in sub-
optimal performance. Data items x2 to xn which reoccur after a short burst of
random data (y1 to yp) may result in cache miss. The problem here is even if a
data item receives many cache hits in the past, if it is not referenced by the
processor for a brief period of time, LRU evicts the item from the cache. When
the processor looks for that data item again it results in a miss. This stresses the
need for a replacement technique that adapts itself to the changing workload and
makes replacement decisions appropriately.
This chapter’s contribution includes,
A novel counter based replacement algorithm called CB-DPET which tries
to maximize the cache performance by closely monitoring and adapting
itself to the pattern that occurs in the sequence.
Association of a 3-bit counter (which will be referred to as the PET counter
from here onwards) with every cache block in the cache set.
A set of rules to adjust the counter value whenever an insertion, promotion
or a replacement happens such that the overall hit percentage is
maximized.
Extension of the method to suit multi-threaded applications and also to
avoid cache thrashing.
PET counter boundary determination such that there will not be any stale
data in the cache for longer periods of time, i.e. non-accessed data ‘mature’ gradually with every access and gets removed at some point of time to
pave the way for new incoming data sets.
3.2. Basic Structure of CB-DPET:
This section and the following sub-sections elaborate on the basic concepts
that make up CB-DPET- namely DPET and DPET-V. Mapping method is set-
associative mapping [1]. A counter called the PET counter is associated with every
block in the cache set. Conceptually this counter holds the ‘age’ of each and every
cache block. The maximum value that the counter can hold is set to 4 which imply
that only 3 bits are required to implement it at the hardware level, thereby not
increasing the complexity. The minimum value that the counter can hold is ‘0’. The
maximum upper bound value that the counter can hold is actually ‘7’ (23-1) but the
value of ‘4’ has been chosen. This is because, any value that is chosen past 6 can
result in stale data polluting the cache for longer periods of time. Similarly if a
value below ‘3’ had been chosen, the performance will be more or less similar to
that of NRU. Thus the upper bound was taken to be ‘4’. The algorithm which
implements the task of assigning the counter values based on access pattern is
referred as the ‘Age Manipulation Algorithm’ (AMA).
3.3. DPET
A block with a lower PET counter value can be regarded as ‘younger’ when
compared to a block having a higher PET value. Initially the counter values of all
the blocks are set to ‘4’. Conceptually every block is regarded as ‘oldest’. So in
this sense, the block which is considered to be the ‘oldest’ (PET counter value 4)
is chosen as a replacement candidate when the cache gets filled up. If the oldest
block cannot be found, the counter value of every block is incremented
continuously till the oldest cache block is found. This block is then chosen for
replacement. Once the new block has been inserted into this slot, AMA
decrements the counter associated with it by ‘1’ (it is set to ‘3’). This is done
because the block can no more be regarded as ‘oldest’, as it has been accessed
recently. Also it cannot be given a very low value as the AMA is not sure about its
future access pattern. Hence it is inserted with a value of ‘3’. When a data hit
occurs, that block is promoted to the ‘youngest’ level irrespective of which level it
was in previously. By this way, more time is given for data items which repeat
themselves in a workload to stay in the cache.
To put this in more concrete terms,
A. DPET associates a counter, which iterates from ‘0’ to ‘4’, with every cache
block.
B. Initially the counter values of all the blocks are set to ‘4’. Conceptually every
block is regarded as ‘oldest’.
C. Now when a new data item comes in, the oldest block needs to be replaced.
Since contention can arise in this case, AMA chooses the first ‘oldest’ block
from the bottom of the cache as a tie-breaking mechanism.
D. Once the insertion is made, AMA decrements the counter associated with it by
‘1’ (it is set to ‘3’). This is done because the block can no more be regarded as ‘oldest’, as it has been accessed recently. Also it cannot be given a very low value as the AMA is not sure about its future access pattern. Hence it is
inserted with the value of ‘3’. E. In the case where all the blocks are filled up (without any hits in between),
there will be a situation where all the blocks will have a PET counter value of
‘3’. Now when a new data item arrives, there is no block with a counter value
of ‘4’. As a result, AMA increments all the counter values by ‘1’ and then performs the check again. If a replacement candidate is found this time, AMA
executes steps C and D again. Otherwise it again increments the values. This
is carried out recursively until a suitable replacement candidate is found.
F. Now consider the situation where a hit occurs for a particular data item in the
cache. This might be the start of a specific data pattern. So AMA gives high
priority to this data item and restores its associated PET counter value to ‘0’ terming it as the ‘youngest’ block.
On the other hand, if a miss is encountered, the situation becomes quite similar
to what AMA dealt with in the initial steps, i.e. it scans all the blocks from the
bottom to find the ‘oldest’ block. If found it proceeds to replace it. Otherwise the
counter values are incremented and once again the same search is performed
and this process goes on till a replacement candidate is found.
3.4. DPET-V : A Thrash-Proof Version
This method is a variant of DPET (In the following sections, it will be referred to
as DPET-V). Some times for workloads with a working set greater than the
available cache size, there is a possibility that there will not be any hits at all. The
entire time will be spent only in replacement. This condition is often referred to as
thrashing. The performance of cache can be improved if some fraction of
randomly selected working set is retained in the cache for some more time which
will result in increase in cache hits Hence very rarely, AMA decrements the PET
counter value by ‘2’ (instead of ‘1’) after an insertion is made. So the new counter
value post insertion will be ‘2’ instead of ‘3’. Conceptually it allows more time for
the data item to stay in the cache. This technique is referred to as DPET-V. The
probability of inserting with ‘2’ is chosen approximately as 5/100, i.e. out of 100
blocks, the PET counter value of (approximately) 5 blocks will be set to ‘2’ while
the others will be set to ‘3’ during insertion.
Fig 3.1 Data sequence which will result in thrashing for DPET
Algorithm 1: Implementation for Cache Miss, Hit and Insert Operations
Init:
Initialize all blocks with PET counter value ‘4’.
Input:
A cache set instance
Output:
removing_index /** Block id of the victim **/
Miss:
/** Invoke FindVictim method. Set is passed as parameter **/
FindVictim(set):
/** Infinite Loop **/
while 1 do
removing_index = -1;
/** Scanning from the bottom **/
for i=associativity-1 to 0 do
if set.blk[i].pet == 4 then
/** Block Found. Exit Loop **/
removing_index = i;
break;
if removing_index == -1 then
for i=associativity-1 to 0 do
/** Increment all blocks pet value by 1 and search again **/
set.blks[i].pet = set.blks[i].pet + 1
else
break; /** break from main while loop **/
return removing_index;
Hit:
set.blk.pet = 0;
Insert:
set.blk.pet = 3;
Correctness of Algorithm
Invariant:
At any iteration i, the PET counter value of set.blk[i] either holds a value of ‘4’ or
a value in the range of ‘0’ to ‘3’.
Initialization:
When i = 0, set.blk[i].pet will be ‘4’ initially. After being selected as victim and
after the new item has been inserted, set.blk[i].pet will be ‘3’. Hence invariant
holds good at initialization.
Maintenance:
At i=n, set.blk[i].pet will contain
either zero (if a cache hit has happened) or
set.blk[i].pet will have ‘3’ (if it has been newly inserted).
set.blk[i].pet will hold ‘1’ or ‘2’ or ‘4’ (if a replacement needs to be made but
block with PET counter ‘4’ not yet found). Hence invariant holds good for i = n.
Same case when i = n+1 as the individual iterations are mutually exclusive. Thus
invariant holds good for i = n + 1.
Termination:
The internal infinite loop will terminate when a replacement candidate is found.
A replacement candidate will be found for sure in every run since the counter value
of individual blocks is proven to be in the range ‘0’ to ‘4’. A candidate is found as
and when a counter value reaches ‘4’. Since an increment by ‘1’ is happening for
all the blocks until a value of ‘4’ is found, the boundary condition “if set.blk[i].pet
== ‘4' ” is bound to hit for sure, terminating the loop after a finite number of runs.
The invariant can also be found to hold good for all the cache blocks present
when the loop terminates.
It is observed that if DPET is applied over the working set shown in Fig 3.1 there
will not be any hits at all (Fig 3.2(i)). After the first burst {8,9,2,6,5,0}, the PET
counter value of all blocks is set to ‘3’. When the next burst {11,12,1,3,4,15}
arrives, none of them results in a hit. Now when the third burst arrives, again all
the old blocks are replaced with new ones.. Fig 3.2(ii) shows the action of DPET-
V over the same data pattern. According to DPET-V, AMA has randomly selected
two blocks (containing data items {5} and {2}) whose counter values have been
set to ‘2’ instead of ‘3’ while inserting the data burst 1. Now when burst 2 arrives,
those two blocks are left untouched as their PET counter values have not matured
to ‘4’ yet. Finally when the third burst of data {2,9,5,8,6} arrives, it results in two
hits more ( for {5} and {2} ) compared to DPET. Hence by giving more time for
those two blocks to stay in cache, the hit rate has been improved. Fig 3.3(i), 3.3(ii),
3.3(iii) and 3.3(iv) shows the working of CB-DPET in the form of a flow diagram.
(i) (ii)
Fig 3.2 (i) Contents of the cache at the end of each burst for DPET
(ii) Contents of the cache at the end of each burst for DPET-V
Processor looks for a particular
lo k i the a he
AMA starts search for block
i the a he
If is found
Set PET ou ter value of to 0
Processor brings the data
item from next level of
memory
Find replacement
victim in the
cache
HIT
MISS
Start
Finish
Insert new
data itemFinish
Fig 3.3 (i) CB-DPET flow Diagram
If is found
AMA algorithm starts
searching for a replacement
a didate i the a he
Choose as replacement victim
Increment counter
values of all blocks by 1
Yes
No
Sear h for first lo k from cache bottom with
PET ou ter value of 4
Find replacement
victim in the cache
Fig 3.3 (ii) Victim Selection flow
Insert new data
item-DPET
Insert incoming data item into the
victim block
Set PET Counter value of the block
to 3
Fig 3.3 (iii) Insert flow- DPET
Insert new data
item-DPET-V
Insert incoming data item into the
victim block
Set PET Counter value of
the block to 3
95 out of every 100 blocks
Set PET Counter value of
the block to 2
5 out of every 100 blocks
Fig 3.3 (iv) Insert flow DPET-V
3.5. Policy Selection Technique
A few test sets in the cache are allocated to choose between DPET and DPET-
V. Among those test sets, DPET is applied over a few sets and DPET-V is applied
over the remaining sets and they are closely observed. Based on the performance
obtained here, either DPET or DPET-V is applied across the remaining sets in the
cache (apart from the test sets). A separate global register (which from here
onwards will be referred to as Decision Making register or DM register) is
maintained to keep track of the number of misses encountered in the test sets for
both the techniques. Initially the value of this register is set to zero.
Fig 3.4 Policy selection using DM register
Data iss o a he lo k
If is i DPET assigned test sets
DM register is
incremented by 1
If is i DPET-V
assigned test sets
Check value of DM
Insert new data
item-DPET-V
Insert new data item-
DPET
Finish
Yes
No
No
If <= 0
If > 0
Find replacement
victim in the cache
Insert new data item-
DPET
DM register is
decremented by 1Yes
Find replacement
victim in the cache
Insert new data item-
DPET-V
Fig 3.5 CB-DPET Policy selection flow
For every miss encountered in a test set which has DPET running on it, the
register value is incremented by one and for every miss which DPET-V produce
in its test sets, AMA decrements the register value by one. Effectively, if at any
point of time the DM register value is positive, then it means that DPET has been
issuing many misses. Similarly if the register value goes negative, it indicates that
DPET-V has resulted in more misses. Based on this value, AMA makes the apt
replacement decision at run time. This entire algorithm which alternates
dynamically between DPET and DPET-V is termed Context Based DPET. Fig 3.4
depicts the policy selection process. Fig 3.5 shows the Victim selection process.
3.6. Thread Based Decision Making
The concept of CB-DPET is extended to suit multi-threaded applications with a
slight modification. For every thread a DM register and few test sets are allocated.
Before the arrival of the actual workload, in its allocated test sets each thread is
assigned to run DPET for one half and DPET-V for the remaining half of the test
sets. The sets running DPET (or DPET-V) need not be consecutive. These sets
are selected in a random manner. They can occur intermittently i.e. a set assigned
to run DPET can immediately follow a set that is assigned to run DPET-V or vice
versa. On the whole, the total number of test sets allocated for each thread is
equally split between DPET and DPET-V. Now when the actual workload arrives,
the thread starts to access the cache sets. Two possibilities arise here.
Case 1:
The thread can access its own test set. Here, if there is data miss, it records the
number of misses in its DM register appropriately depending on whether DPET or
DPET-V was assigned to that set as discussed in the previous section.
Case 2:
The thread gets to access any of the other sets apart from its own test sets. Here,
either DPET or DPET-V is applied depending on the thread’s DM register value.
If this access turns out to be the very first cache access of that thread, then its DM
register will have a value of ‘0’, in which case the policy is chosen as DPET.
3.7. Block Status Based Decision Making
When more than one thread tries to access a data item, it can be regarded as
shared data. Compared to data that is accessed only by one thread (private data),
shared data has to be placed in the cache for a longer duration. This is because
multiple threads try to access those data at different points of time and if a miss is
encountered, the resulting overhead would be relatively high. Taking this into
consideration, it is possible to find out whether each block has shared or private
data (using the id of the thread that accesses it) and decide on the counter values
appropriately. To serve this purpose, there is a ‘blockstatus’ register associated
with every cache block. When a thread accesses that block, its thread id is stored
into the register. Now when another thread tries to access the same block, the
value of the register is set to some random number that does not fall within the
accessing threads’ id range. So if the register holds that random number, the
corresponding block is regarded as shared. Otherwise it is treated as private.
3.8. Modification in the Proposed Policies
In case of DPET, when a block is found to be shared, insertion is made with a
counter value of ‘2’ instead of ‘3’. Similarly in DPET-V, for shared data, insertion
is always made with a PET counter value of ‘2’. If a hit is encountered in either of
the methods, the PET counter value is set to ‘0’ (as discussed earlier) only if the
block is shared. For private blocks, the counter value is set to ‘1’ when there is a
hit.
Fig 3.6 Data sequence containing a recurring pattern {1,5,7}
3.9. Comparison between LRU, NRU and DPET
3.9.1. Performance of LRU
The behaviour of LRU for the data sequence in Fig 3.6 is shown in Fig 3.7. It is
assumed that there are five blocks in the cache set under consideration. Fig. 3.7
shows the working of the LRU replacement policy over the given data set. The
workload sequence (divided into bursts) is specified in the Fig. 3.7. Below every
data, ‘m’ is used to indicate a miss in the cache and ‘h’ is used to indicate a hit.
Alphabet ‘N’ in the cache denotes Null or Invalid data. The cache set can be
viewed as a queue with insertions taking place from the top of the queue and
deletion taking place from the bottom. Additionally, if a hit occurs, that particular
data is promoted to the top of the queue, pushing other elements down by one
step. Hence at any point of time, the element that has been accessed recently is
kept at the top and the ‘least recently used’ element is found at the bottom, which
becomes the immediate candidate for replacement. In the above example, initially
the stream {1, 5, 7} results in miss and the data is fetched from the main memory
and placed in the cache.
Fig 3.7 Behaviour of LRU for the data sequence
Now data item {7} will hit and it is brought to the top of the queue (in this case it is
already at the top). Data items {1} and {5} results in two more consecutive hits
bringing {5} to the top. The ensuing elements {9, 6} results in misses and are
brought into the cache one after another. Among the burst {5, 2, 1} the data item
{2} alone misses. Then {0} too results in a miss and is brought from the main
memory. In the stream {4,0,8} the data item {0} hits. And finally the burst {1, 5, 7}
re-occurs towards the end where {5, 7} will result in two misses as indicated by
Fig. 3.8. The overall number of misses encountered here amounts to eleven.
3.9.2. Performance of NRU
The working of NRU on the given data set is shown in Fig. 3.8. NRU associates
a single bit counter with every cache line which is initialized to ‘1’. When there is
a hit, the value is changed to ‘0’. A ‘1’ indicates that the data block has remained
unreferenced in the cache for a longer period of time compared to other blocks.
Thus the block from the top which has a counter value of ‘1’ is selected as a victim
for replacement. Scanning from the top helps in breaking the tie. If such a block
is not found, the counter values of all the blocks are set to ‘1’ and a search is
performed again. In Fig. 3.8, counter values which get affected in every burst are
underlined for more clarity. As it is observed from the figure, NRU results in one
miss more than the overall misses encountered in LRU. Towards the end, the
burst {5,7} result in two misses. This is primarily because NRU tags the stream
{5,7} as ‘not-recently used’ as it had not seen them for some time and thus
chooses them as replacement victims. In the long run, as the data set size
increases in a similar fashion, the number of misses will increase considerably.
One feature that is common between both algorithms is that they follow the
same technique irrespective of the pattern that the data set follows. This might
minimize the hit rate in some cases. In case of LRU, if an item is accessed by the
processor, it is immediately taken to the top of the queue thereby pushing the
remaining set of elements one step below, irrespective of whatever pattern the
data set might follow thereafter. Thus even if any element in this group, that gets
pushed down, repeats itself after some point of time, it might have been marked
as ‘least recently used’ and evicted from the cache. NRU is also not powerful as
it has only a single bit counter, which does not allocate enough time for data items
which might be accessed in the near future.
Fig 3.8 Behaviour of NRU for the data sequence
3.9.3. Performance of DPET
Fig. 3.9 shows the contents of the cache at various points when DPET is
applied, along with the burst of input data which is processed at that moment. The
PET counter value associated with every block is shown, as a sub-script to the
actual data item and the counter value of the block which is directly affected at
that instant is underlined for more clarity. The input data set has the letters ‘m’ and
‘h’ under every data item to indicate a miss or a hit respectively. The behaviour is
studied with a cache set consisting of five blocks. Initially AMA assigns a value of
‘4’ to PET counters associated with all the blocks. As seen from figure, the burst
{1,5,7,7} arrives initially. Data items {1,5,7} are placed in the cache in a bottom up
fashion and their PET values are decremented by ‘1’. Now when the data item {7}
arrives again, it hits in the cache and thus the counter value associated with {7} is
made ‘0’. Data items {1,5} arrive individually and result in two hits. The
corresponding PET values are set to ‘0’ as indicated in the figure. The next burst
contains elements {9,6}, both resulting in cache misses. They are fetched from
the main memory and stored in the remaining two blocks of the cache set and
their counter values are decremented appropriately. The fifth data burst comprises
the data items {5,2,1}. Among these, {5} is already present in the cache and hence
its counter value is set to ‘0’. Data item {2} is not present in cache. So as with
DPET, AMA searches for a replacement candidate bottom up, which has its
counter value set to ‘4’. Since such a block could not be found, the PET values of
all the blocks are incremented by ‘1’ and the search is performed again. Now the
PET value of the block containing {9} would have matured to ‘4’ and it is replaced
by {2} and its counter value is decremented to ‘3’. Data item {1} hits in the cache.
The overall number of misses is reduced compared to LRU and NRU.
Fig 3.9 Behaviour of DPET for the data sequence
3.10. Results
To evaluate the effectiveness of the replacement policy on real workloads, a
performance analysis has been done with Gem5 simulator using PARSEC
benchmark programs, discussed in Chapter 6, as input sets. The results are
shown in this section.
3.10.1. CB-DPET Performance
The effectiveness of this method is visible in the following figures as it has
surpassed the LRU. Fig. 3.10 shows the overall number of hits obtained for each
workload with respect to CB-DPET and LRU at L2 cache. The overall number of
hits obtained in the ROI is taken. These numbers have been scaled down by
appropriate factors to keep them around the same range. It is observed from the
figure that CB-DPET has resulted in improvement in the number of hits compared
to LRU for majority of the workloads with fluidanimate showing a high
performance with respect to baseline policy.
Fig 3.10 Overall number of hits recorded in L2 cache
Fig. 3.11 shows the hit rate obtained for individual benchmarks. Hit rate is given
by the following expression.
Hit Rate = Overall number of hits in ROIOverall number of accesses in ROI … … …
Six of the seven benchmarks have reported improvements in the hit rate.
Fluidanimate shows maximum improvement of 8% whereas swaptions and vips
have exhibited hit rates on par with LRU.
0
200000
400000
600000
800000
1000000
1200000
Nu
mb
er
of
Hit
s
Benchmark
Overall number of hits in L2
LRU
DPET
Fig. 3.12 shows the number of hits obtained in the individual parallel phases. As
discussed earlier, the ROI can be further divided into two phases denoted by PP1
(Parallel Phase 1) and PP2 (Parallel Phase 2). These phases need not be equal
in complexity, number of instructions, etc. There can be cases where PP1 might
be more complex with a huge instruction set compared to PP2 or vice versa. The
number of hits recorded in these phases is shown in the graph in Fig. 3.12. CB-
DPET has shown an increase in hits in most of the phases of majority of the
benchmarks.
Fig 3.11 L2 cache hit rate
Miss latency at a memory level refers to the time spent by the processor to fetch
a data from the next level of memory (which is the primary memory in this case)
following a miss in that level of memory. Miss latency at L2 cache for individual
cores is shown in Fig. 3.13. It is usually measured in cycles. To present it in a
more readable format, it has been converted to seconds using the following
formula:
Miss latency in secs = Miss latency in cyclesCPU Clock Frequency … … …
0
0.2
0.4
0.6
0.8
1
1.2
Hit
Ra
te
Benchmarks
L2 cache hit rate
LRU
DPET
The impact of miss latency is an important factor to be taken into consideration. If
the miss latency is greater, it means that the overhead involved in fetching the
data will be high thereby resulting in performance bottleneck.
Fig 3.12 Hits recorded in individual phases of ROI
Fig 3.13 Miss latency at L2 cache for individual cores
Overall miss latency at L2 is also shown in Fig. 3.13. Note that this is with respect
to the ROI of the benchmarks. Considerable reduction in latency is observed
0100000200000300000400000500000600000700000800000900000
1000000
PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2
vips swaptions fluidanimate dedup ferret canneal
Nu
mb
er
of
Hit
s
Benchmark
Number of hits in individual phases
LRU
DPET
0
5
10
15
20
25
30
35
40
45
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
vips swaptions fluidanimate dedup ferret canneal blackscholes
Mis
s La
ten
cy (
in s
eco
nd
s)
Benchmarks
Miss latency at L2 cache for individual cores
LRU
DPET
across the majority of the benchmarks. Vips has shown a maximum reduction in
Overall miss latency. Except ferret and swaptions, all the other workloads have
shown reduction in the overall miss latency.
Percentage increase in Hits = Hits in CBDPET − Hits in LRUHits in LRU ∗ … … …
Fig. 3.14 shows the percentage increase in hits obtained from CB-DPET
compared to LRU. The figure shows that there is an increase in the hit percentage
for all the benchmarks except ferret. Swaptions exhibiting the highest percentage
increase of 13.5% whereas dedup has shown a 2% increase. An approximate 8%
improvement over LRU is obtained as the average hit percentage when CB-DPET
is applied at the L2 cache level. This is because CB-DPET adapts to the changing
pattern of the incoming workload and retains the frequently used data in the
cache, thus maximizing the number of hits. Fig. 3.15 shows the overall number of
misses recorded by each benchmark. Decrease in the number of misses is
observed across majority of the benchmarks compared to LRU.
Fig 3.14 Percentage increase in L2 cache hits
-15
-10
-5
0
5
10
15
Pe
rce
nta
ge
Benchmark
% increase in L2 cache hits
compared to LRU
% increase
in cache
hits
Fig 3.15 Overall number of misses at L2 Cache
3.11. Summary
When it comes to concurrent multi-threaded applications in a CMP
environment, LLC plays a crucial role in the overall system efficiency.
Conventional replacement strategies may not be effective in many instances as
they do not adapt dynamically to the data requirements. Numerous researches
[45,46,79,90] are being conducted to find novel ways for improving the efficiency
of the replacement algorithms applied at LLC. Both LRU and NRU schemes tend
to perform poorly on data sets possessing irregular patterns because they do not
store any extra information regarding the recent access pattern of the data items.
They make decisions purely based on the current access only [15,45,51,79]. This
chapter hence proposes a scheme called CB-DPET which tries to address the
shortcomings arising from the conventional replacement methods and improve
the performance.
In this chapter, the following are the conclusions
CB-DPET takes a more systematic and access-history based decision
compared to LRU. A counter associated with every cache block tracks the
0
100000
200000
300000
400000
500000
600000
700000
800000
Nu
mb
er
of
mis
ses
Benchmark
Number of misses
LRU
DPET
behavior of application access pattern and makes the replacement, insertion
and promotion decisions.
DPET-V is a thrash proof version of DPET. Here some randomly selected data
items are allowed to stay in the cache for a longer time by adjusting the counter
values appropriately.
In multi-threaded applications, a DM register and some test sets are allocated
for every thread. Based on the number of misses in the test sets, a best policy
is chosen for the remaining sets.
These algorithms are made thread-aware by using the thread id. Based on the
thread id, it can identify which thread is trying to access the cache block, so
when multiple threads try to access the same block, it is tagged as shared and
allow that block to stay in the cache for more time compared to other blocks
by appropriately adjusting their counter values.
Evaluation results on PARSEC benchmarks have shown that CB-DPET has
reported an average improvement in the overall hits (up to 8%) when compared
to LRU at the L2 cache level.
4. LOGICAL CACHE PARTITIONING TECHNIQUE
It is imperative for any level of cache memory in a multi-core architecture to have
a well defined, dynamic replacement algorithm in place to ensure consistent
superlative performance [92,93,94,95]. At L2 cache level, the most prevalently
used LRU replacement policy does not acquaint itself dynamically to the changes
in the workload. As a result, it can lead to sub-optimal performance for certain
applications whose workloads exhibit frequently fluctuating patterns. Hence many
works have strived to improve performance at the shared L2 cache level
[91,98,99].
To overcome the limitation of this conventional LRU approach, This chapter
proposes a novel counter-based replacement technique which logically partitions
the cache elements into four zones based on their ‘likeliness’ to be referenced by
the processor in the near future. Experimental results obtained by using the
PARSEC benchmarks have shown almost 9% improvement in the overall number
of hits and 3% improvement in the average cache occupancy percentage when
compared to LRU algorithm.
4.1. Introduction
Cache hits play a vital role in boosting the system performance. When there are
more hits, the number of transactions with the next levels of memories is reduced
thereby expediting the overall data access time. Many replacement techniques
use the cache miss metric to determine the replacement victim but in this chapter,
choosing the victim is based on the number of hits received by the data items in
the cache. Based on the hits received, the cache blocks are logically partitioned
into separate zones. These virtual zones are searched in a pre-determined order
to select the replacement candidate.
Contribution of this chapter includes,
A novel counter based replacement technique that logically partitions the
cache into four different zones namely – Most Likely to be Referenced (MLR),
Likely to be Referenced (LR), Less Likely to be Referenced (LLR), Never
Likely to be Referenced (NLR) in the decreasing order of their likeliness
factor.
Replacement, insertion and promotion of data elements within these zones
in such a manner that the overall hit rate is maximized.
Association of a 3-bit counter with every cache line to categorize the
elements into different zones. On a cache hit, the corresponding element
is promoted from one zone to another zone.
Replacement candidates selection from the zones in the ascending order
of their ‘likeliness factor’ i.e. the first search space for the victim would be
the never likely to be referenced zone, followed by the subsequent zones
till the most likely to be referenced zone is reached.
Periodic zone demotion of elements also occurs to make sure that stale
data does not pollute the cache.
4.2. Logical Cache Partitioning Method
LCP uses a 3-bit counter to dynamically shuffle the cache elements and logically
partition them into four different zones based on their likeliness to be referenced by
the processor in the near future. This counter will be referred to as LCP (Logical
Cache Partitioning) counter in subsequent sections. The lower and upper bounds for
the counter are set to ‘0’ and ‘7’ respectively. The prediction about the likeliness of
reference of the data items is made with the help of the hits encountered in the cache.
As the number of hits for a particular element increases, it is moved up the zone
list till it reaches the MLR region.
Only if it stays unreferenced for quite an amount of time, it is evicted from
the cache. As their names imply, the zones are arranged in the decreasing order
of their likeliness factor. Table 4.1 shows the counter values and their
corresponding zones. Any replacement policy consists of three phases-
Replacement, Insertion and Promotion and so does this method. Each of which is
explained in the subsequent sub-sections.
Table 4.1 LCP zone categorization
Counter Value
Range
Zone
6-7
MLR
3-5
LR
1-2
LLR
0
NLR
4.3. Replacement Policy
When a miss is encountered in the cache, the data item is fetched from the
secondary memory and brought into the cache thereby replacing one of the
elements which was already present in the cache. This element is often referred to
as the victim. The process of selecting a victim needs to be efficient in order to
improve the hit rate. In this method, the victim is selected from the zones in the
increasing order of their likeliness factor. Any cache line which comes under the
NLR zone is considered first for replacement. If no such line is found search is
performed again to check if any element falls in the LLR region and so on till MLR
is reached. If two or more lines possess the same counter value that is considered
for replacement during that iteration, the line that is encountered first is chosen as
the victim.
Once the replacement is made, LCP counters of all the other data elements are
decremented by ‘1’.This is done to carry out gradual zone demotion as discussed
earlier to flush out unused data items from the cache.
4.4. Insertion Policy
This phase is encountered as soon as the replacement candidate is found. The
new incoming data item is inserted into the corresponding cache set and its LCP
counter value is set to ‘2’ (LLR zone).
4.5. Promotion Policy
A hit on any data item in the cache calls for the promotion phase. The LCP value
associated with the cache line is incremented to the final value of its immediate
upper zone. For example, if it was earlier in the LLR zone (LCP value ‘1’ or ‘2’),
its LCP value is set to ‘5’ i.e. the element now falls within the LR zone. By
transferring elements within the zone according to the workload pattern, LCP has
proven to be more dynamic than LRU.
Processor looks for a
parti ular lo k i the cache
LCP algorithm starts search
for lo k i the a he
If is found
Set LCP counter value to the
max value of the next higher
zone
Processor brings the data
item from next level of
memory
Find
replacement
victim in the
cache
HIT
MISS
Start
Finish
Fig 4.1(i) LCP flow diagram
4.6. Boundary Condition
It is essential that the LCP counter value does not overshoot its specified range.
Thus whenever it is modified, a boundary condition check is carried out to ensure
that the value does not go below ‘0’ or beyond ‘7’. Fig. 4.1(i), 4.2(ii) shows the
working of LCP.
If is found
LCP algorithm starts
searching for a replacement
a didate i the a he fro zo e X here X
denotes NLR initially
No
Yes
Sear h for first lo k which has the LCP
counter value that falls
within the range of zone
X
Find replacement
victim in the
cache
I re e t X to poi t to the next higher zone
Choose as replacement victim
Insert new data item
here and set LCP counter
alue to 2 (LLR zone)
Fig 4.1 (ii) Victim Selection flow for LCP
Algorithm 2: Implementation for Cache Miss, Hit and Insert Operations
Init:
Initialize all blocks with LCP counter value ‘0’.
Input:
A cache set instance
Output:
removing_index /** Block id of the victim **/
Miss:
/** Invoke FindVictim method. Set is passed as parameter **/
FindVictim (set): i = 0
removing_index = -1
while i < 8 do
/** Scanning the set **/
for j = 0 to associativity do
/** Search from the NLR zone. LCP counter = 0 **/
if set.blks[j].lcp == i then
/** Block Found. Exit Loop **/
removing_index = i;
break;
if removing_index ! = -1 then
break;
i = i + 1;
for i = 0 to associativity do
/** Gradual Zone Demotion **/
if set.blks[j].lcp>0 then
set.blks[j].lcp = set.blks[j].lcp – 1;
return removing_index
Hit:
/** Boundary Condition Check **/
if set.blk.lcp ! = 7 then
/** NLR Zone. Promote to LLR **/
if set.blk.lcp == 0 then
set.blk.lcp = 2;
/** LLR Zone. Promote to LR **/
else if set.blk.lcp == 1 or set.blk.lcp == 2 then
set.blk.lcp = 5;
/** LR Zone. Promote to MLR **/
else
set.blk.lcp = 7
Insert:
/** LR Zone **/
set.blk.lcp = 2;
Correctness of Algorithm
Invariant: At any iteration i, the LCP counter value of set.blk[i] holds a value in
the range [0,7].
Initialization:
When i = 0, set.blk[i].lcp will be ‘0’ initially. After being selected as victim and
after the new item has been inserted, set.blk[i].lcp will be set to ‘2’. Hence
invariant holds good at initialization.
Maintenance:
At i = n, set.blk[i].lcp will contain
either zero (Cache block yet to be occupied for the first time) or
set.blk[i].lcp will have ‘2’ (If it has been newly inserted) or
set.blk[i].lcp will have ‘1’ or ‘3’ or ‘4’ or ‘6’ (As a part of gradual zone demotion)
set.blk[i].lcp will hold ‘5’ or ‘7’ (On a cache hit from previous zone).
“if blk.lcp ! = 7” condition ensures that the lcp counter value does not overshoot
its range. Hence invariant holds good for i = n.
Same case when i = n+1 as the individual iterations are mutually exclusive. Thus
invariant holds good for i = n + 1.
Fig 4.2 Working of LRU and LCP for the data set
Termination:
Increment of the loop control variable ‘i’ ensures that the loop terminates within
finite number of runs which is ‘8’ in this case. The condition “if set.blks[j].lcp>0”
ensures that the lcp counter value of the block does not go below ‘0’. The
invariant can also be found to hold good for all the cache blocks present
when the loop terminates.
4.7. Comparison with LRU
The following data set is considered.
. . . 2 9 1 7 6 1 7 5 9 1 0 7 5 4 8 3 6 1 7 . . .
It is observed that data items 1 and 7 occur frequently. Fig. 4.2 shows the
working of LRU and LCP on this data pattern. Assume that the cache can hold 4
blocks at a time. Incoming data items at that point of time are shown in the leftmost
column.
In the right most column, ‘m’ indicates a miss and ‘h’ indicates hit. Initially the
cache contains invalid blocks. Counter values associated with all the blocks are
set to -1. After applying both the techniques, LCP has resulted in 3 hits more than
LRU. Frequently occurring data items 1 and 7 have resulted in hits towards the
end unlike LRU. This is primarily because LRU follows the same approach
irrespective of the workload pattern and tags 1 and 7 as ‘least recently used’.
4.8. Results
Full system simulation using Gem5 simulator with PARSEC benchmarks
(discussed in Chapter 6) as input sets shows improvement in many cache metrics.
To measure the performance of LCP, metrics like hit percentage increase, number
of replacements made at L2, average cache occupancy and miss rate are used.
Results are shown for two, four and eight core systems.
4.8.1. LCP Performance
Subsequently, the term LCP will be used to represent this method in the graphs.
Percentage increase in the overall number of hits at L2 is shown in Fig. 4.3.
Fig 4.4 Difference in number of replacements made at L2 cache
Fig. 4.4 shows the total number of replacements made at L2. Fig. 4.5 shows the
-15
-10
-5
0
5
10
15
20
25
Pe
rce
nta
ge
Benchmark
% lesser number of replacements
made at L2 compared to LRU
% lesser
number of
replacements
made at L2
compared to
LRU (in
percentage)
Fig 4.3 Percentage increase in overall number of hits at L2 cache
-10
-5
0
5
10
15
Pe
rce
nta
ge
Benchmark
% increase in overall number of hits
at L2 compared to LRU
% increase in
overall
number of
hits at L2
compared to
LRU
average L2 cache occupancy percentage for all the workloads. Fig. 4.6 shows the
miss rate recorded by the individual cores and also the overall miss rate at L2.
Fig 4.5 Average cache occupancy percentage
Miss rate is defined as the ratio of number of misses to the total number of
accesses. As it is observed from Fig. 4.6 there is reduction in the core-wise miss
rate and overall miss rate across majority of the benchmarks when compared to
LRU.
Apart from two workloads, all the others have shown significant improvement in
hits. Percentage increase in hits varies from a minimum of 0.1% (dedup) to
maximum of up to 11% (ferret) as shown in Fig. 4.3.
Number of replacements reflects the efficiency of any replacement algorithm. A
maximum of almost 20% reduction in the number of replacements is observed
across the given workloads compared to LRU from Fig. 4.4. Cache occupancy
refers to the amount of cache that is being effectively utilized to improve the
performance for any workload. Fig. 4.5 indicates cache utilization is higher for
majority of the benchmarks when LCP is applied compared to LRU. Vips utilizes
the cache space to the maximum possible extent.
0
10
20
30
40
50
60
70
80
90
100P
erc
en
tag
e
Benchmark
Average cache occupancy(in percentage)
LCP
LRU
Fig 4.6 Miss rate recorded by individual cores
4.9. Summary
This chapter discusses about a dynamic and a structured replacement
technique that can be adopted across the LLC. Key points pertaining to this
replacement policy are as follows:
Elements of the cache are logically partitioned into four zones based on their
likeliness to be referenced by the processor with the help of a 3-bit LCP counter
associated with every cache line. The minimum value that the counter can hold
is ‘0’ and the maximum value is ‘7’.
Replacement candidates are chosen from the zones in the increasing order of
their likeliness factor starting from the NLR zone. Initially all the counter values
are set to ‘-1’. Conceptually all the blocks contain invalid data.
For every hit, the corresponding element is moved up by one zone by adjusting
the LCP counter value. If it has reached the top most zone (MLR) the counter
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
Fluidanimate Ferret x264 Swaptions Vips Dedup Canneal Blackscholes
Mis
s R
ate
Benchmark
Miss rate recorded by individual cores at L2
LCP
LRU
value is left untouched.
For every miss the counter value is decremented by ‘1’ to prevent stale data
items from polluting the cache as the algorithm executes. When a new data
arrives, its LCP value is set to ‘2’. Boundary condition check needs to be applied
whenever the LCP counter value is modified to make sure that it does not
overshoot its designated range.
Experimental results obtained by applying the method on PARSEC
benchmarks have shown a maximum improvement of 9% in the overall number
of hits and 3% improvement in the average cache occupancy percentage when
compared to the conventional LRU approach.
5. SHARING AND HIT BASED PRIORITIZING TECHNIQUE
The efficiency of cache memory can be enhanced by applying different
replacement algorithms. An optimal replacement algorithm must exhibit good
performance with minimal overhead. Traditional replacement methods that are
currently being deployed across multi-core architecture platforms, try to classify
elements purely based on the number of hits they receive during their stay in the
cache. In multi-threaded applications data can be shared by multiple threads
(which might run on the same core or across different cores). Those data items
need to be given more priority when compared to private data because miss on
those data items may stall the functioning of multiple threads resulting in
performance bottleneck. Since the traditional replacement approaches do not
possess this additional capability, they might lead to inadequate performance for
most of the current multi-threaded applications.
To address this limitation, this chapter proposes a Sharing and Hit based
Prioritizing (SHP) replacement technique that takes the sharing status of the data
elements into consideration while making replacement decisions. Evaluation
results obtained using multi-threaded workloads derived from the PARSEC
benchmark suite shows an average improvement of 9% in the overall hit rate when
compared to LRU algorithm.
5.1. Introduction
In multi-threaded applications, when threads run across different cores, the
activity in L2 cache tend to shoot up as multiple threads try to store and retrieve
shared data. At this point there are many challenges that need to be addressed.
Cache coherence has to be maintained, the replacement decisions made need to
be judicious and the available cache space must be utilized efficiently. When there
is a miss on a data item which is shared by multiple threads, the execution of many
threads gets stalled. This is because when the first thread, which had encountered
a miss, is attempting to fetch the data from the next level of memory, any
subsequent thread which tries to access the same data will result in miss.
Hence it becomes imperative to handle shared data with more care compared to
other data items. Traditional replacement algorithms like LRU, MRU etc do not
possess this capability. They classify elements based on when they will be required
by the processor but do not check for the status of the cache block, i.e. whether it is
shared or private at any point of time. Also when there are more than thousands of
threads running in parallel, it may not be very useful in tagging the block simply as
‘shared’ or ‘private’. In those cases, additional information about their shared status
can prove handy. This chapter provides an overview on a novel counter based
cache replacement technique that associates a ‘sharing degree’ with every cache
block. This degree specifies a set of ranges which includes – not shared (or private),
lightly shared, heavily shared and very heavily shared. This information combined
along with the number of hits received by the element during its stay in the cache is
used to produce a priority for that element based on which judicious replacement
decisions are made.
Contributions of this chapter includes,
A novel counter based replacement algorithm that makes replacement
decisions based on the sharing nature of the cache blocks
Association of a 'sharing degree' with every cache line, which indicates the
extent to which the element is shared depending on the number of threads
that try to access that element.
Four degrees of sharing namely – not shared (or private), lightly shared,
heavily shared and very heavily shared.
A replacement priority for every cache line which is arrived upon by
combining the sharing degree along with the number of hits received by
the element based on which the replacement decisions are made.
5.2. Sharing and Hit Based Prioritizing Method
Similar to the other two replacement algorithms, every cache block is associated
with a counter. This 2-bit counter is called as the Sharing Degree Counter or SD
counter. Table 5.1 shows all the four possible values that the counter can hold
and their individual descriptions. Depending on the number of threads that access
a cache block, it is classified into any one of the sharing categories as shown in
the table. To collect the sharing status of the cache blocks, it is essential to have
an efficient data structure in place to track the number of threads that accesses
the data item. For this purpose there is a filter which is referred to as the Thread
Tracker filter (TT filter). It is a flexible dynamic software based array that gets
created and is associated with every cache block during run time.
When a thread tries to access a data item, a search is conducted in the TT filter
to check if the thread id is already present in it. If not, then the id is stored in the
corresponding cache block’s TT filter. Size of the filter expands as and when a
thread id gets added to it. Once a cache block is about to be evicted, the memory
allocated for the associated TT filter is freed. At any point of time, with the help of
thread ids that are found in the TT filter, the sharing degree counter is populated
for every cache block.
5.3. Replacement Priority
The sharing status of the blocks alone cannot be used to make replacement
decisions. Number of hits garnered by the element is also an important factor to
consider when performing a replacement. Hence a priority for every individual
cache block has been arrived at, that not only gives more weightage to sharing
nature but also attaches importance to the number of cache hits garnered by the
element. Priority computation equation is as follows Priority = Sharing Degree + ∗ Hits Count … … … 4
Sharing degree indicates the sharing degree counter value for the particular
cache block at that particular point of time. A software based hit counter is
associated with every cache block that provides an estimate of the hits received
by the element during its stay in the cache. This counter is refreshed every time a
new element enters the block. The upper bound of the counter is fixed to a specific
value. If that value is reached, the counter is considered saturated and any hits
received beyond that will not be taken into account. Experimental results have
shown that in a unified cache memory, the average number of instructions is 50%
and the remaining 50% forms the data. For calculation of priority, the analysis is
mainly upon the data part as it is accessed much more frequently compared to
instructions. Also due to locality of reference, several threads can access the data
several times. To highlight the former part, the hit count value is divided by 2 and
to highlight the latter part, hit count value is scaled by a factor of the sharing
degree counter value. Whenever any one of these counters change, the priority
of the element needs to be recomputed.
Cache replacement policy consists of deletion, insertion and promotion
operations. Each phase of the algorithm is explained in detail in the subsequent
sections. The mapping method used is set associative mapping.
5.4. Replacement Policy
When the cache becomes full, replacement has to be made to pave way for new
incoming data items. SHP makes replacement decisions based on the computed
priority values. Elements are evicted in the increasing order of their priority. The
element with the least priority in the list is chosen as the victim. If more than one
element has the same least priority, then the one which is encountered first while
scanning the cache is taken as the victim as a tie-breaking mechanism. It is also
essential to ensure that stale data do not pollute the cache for longer periods of
time.
From this aspect the hit counter can be regarded as an ageing counter. Every
time a victim is found, the hit counter of all the other elements in the cache is
decremented and their corresponding priorities are re-computed. If any element
remains unreferenced for a long period of time, its priority will gradually decrease
and the element will eventually be flushed out of the cache.
Algorithm 3: Implementation for Cache Miss, Hit and Insert Operations
Input:
A cache set instance
Output:
removing_index /** Block id of the victim **/
Miss:
/** Invoke FindVictim method. Set id is passed as parameter **/
FindVictim(set)
FindVictim(set):
min = ∞
for i = 0 to associativity do
/** Choose the minimum priority element **/
if set.blk[i].priority < min then
min = set.blk[i].priority;
removing_index = i;
return removing_index;
Hit:
/** Scan the TT Filter **/
for i = 0 to length(TT_Filter) do
if TT_Fitler[i] == accessing_thread_id then
thread_found = 1;
/** If thread id not found in TT Filter, add the thread id to it **/
if thread_found ! = 1 then
TT_Filter[i] = accessing_thread_id;
set blk.sharing_degree counter value[0 to 3] according to TT_Filter length
priority = (set.blk.sharing_degree +1) * (set.blk.hit_counter)/2;
Insert:
/** Private block **/
set.blk.sharing_degree = 0;
refresh blk.hit_counter;
Correctness of Algorithm
Invariant: At any iteration i, the sharing degree counter value of set.blk[i] holds a
value in the range [0,3].
Initialization:
When i = 0, set.blk[i]. sharing_degree will be set to ‘0’ initially. Hence invariant
holds good at initialization.
Maintenance:
At i = n, set.blk[i].sharing_degree will contain
either zero (Private) or
set.blk[i].sharing_degree will have ‘1’ (Lightly shared) or
set.blk[i].sharing_degree will have ‘2’ (Heavily Shared) or
set.blk[i].sharing_degree will hold ‘3’ (Very Heavily Shared).
Hence invariant holds good for i = n.
Same case when i = n+1 as the individual iterations are mutually exclusive. Thus
invariant holds good for i = n + 1.
Termination:
The priority computation loop is limited by the associativity of the cache which
will always be finite. TT filter iteration is limited by the size of the TT filter which
will be equal to the total number of threads executing. It will also be finite. The
invariant can also be found to hold good for all the cache blocks present
when the loops terminate.
Processor looks for a
parti ular lo k i the cache
SHP algorithm starts search
for lo k i the a he
If is found
Increment Hit Counter
Processor brings the
data item from next
level of memory
Find replacement
victim in the cache
HIT
MISS
Start
Finish
Update TT Filter if
needed and sharing
degree counter
Re-compute priority
Fig 5.1 (i) SHP flow diagram
SHP algorithm starts searching
for a repla e e t a didate i the cache
Sear h for first lo k which has the minimum
priority value
Find replacement
victim in the cache
Choose as repla e e t victim
Insert new data item here and
set Sharing degree counter
value to 0 (private) and re-
compute priority
Fig 5.1 (ii) Victim Selection flow for SHP
Table 5.1 Sharing degree values and their descriptions
Number of Accessing Threads
Sharing Degree Counter Value
Nature of Sharing
1 0 Private/Not Shared
2-3 1 Lightly Shared
4-7 2 Heavily Shared
8-10 3 Very Heavily Shared
5.5. Insertion Policy
After evicting the victim, the new data item needs to be inserted into the cache.
Sharing degree counter is set to ‘0’ to indicate that the incoming block is currently
not shared by any other threads. The corresponding hit counter is refreshed.
5.6. Promotion Policy
When a cache hit happens, the hit counter of the corresponding block is
incremented. Also the accessing thread id is compared against the ids which are
already present in the TT filter and if it is not present there, it is added into the TT
filter. Sharing degree counter is adjusted accordingly and the priority is re-
computed. Fig. 5.1(i), 5.1(ii) shows the basic flow of SHP technique.
5.7. Results
To evaluate the performance of SHP, simulation has been carried out with
GEM5 simulator using PARSEC workloads (discussed in Chapter 6) and the
results are compared with LRU. Most of the applications are benefited
through this replacement policy. The results are shown in this section.
5.8. SHP Performance
It is observed from the following figures, that the performance of the proposed
replacement technique is better than LRU for multithreaded workloads. The
overall number of hits obtained at LLC cache for SHP method compared to LRU
is shown in the graph in Fig. 5.2. In Fig. 5.2, blackscholes has shown the
maximum improvement. On an average, a 9% percent improvement in the overall
number of hits is observed across the given benchmarks.
Fig 5.2 Overall number of hits at L2
Fig. 5.3 shows the overall number of replacements made at L2 for SHP and
LRU. The more the number of replacements made, the more will be the overhead
involved. So it is always desirable to keep this parameter as low as possible. In
this method, the average number of replacements made at L2 has decreased
compared to LRU. Compared to other benchmarks, blackscholes has exhibited
the least number of replacements (1.6% lesser than that of LRU) and swaptions
has shown a 0.55% decrease when compared to LRU
Fig. 5.4 shows the L2 miss rate for different workloads. Miss rate is computed
from the overall number of misses and the overall number of accesses. To obtain
a high performance, LLC must provide a lower miss rate. Majority of the
benchmarks have shown improvement in this metric when compared to LRU with
dedup showing superior performance.
0
50000
100000
150000
200000
250000
300000
350000
400000
No
of
Hit
s
Benchmark
Overall Number of Hits
SHP
LRU
Fig 5.3 Percentage decrease in number of replacements made at L2
Fig 5.4 Overall miss rate and core-wise miss rate at L2 cache
-1
-0.5
0
0.5
1
1.5
2
Pe
rce
nta
ge
Benchmarks
% decrease in number of
replacements compared to LRU
% decrease
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
core
1
core
2
ov
era
ll
core
1
core
2
ov
era
ll
core
1
core
2
ov
era
ll
core
1
core
2
ov
era
ll
core
1
core
2
ov
era
ll
core
1
core
2
ov
era
ll
core
1
core
2
ov
era
ll
Blackscholes Ferret x264 swaptions Vips Dedup Canneal
Mis
s ra
te
Benchmark
Overall and per core miss rate
SHP
LRU
5.9. Summary
Shared data plays a critical role in determining the performance of cache
memory systems, especially in a multi-threaded environment. Conventional LRU
approach does not attach importance to such data and hence this chapter
proposes a novel counter based prioritizing algorithm.
Every cache block is associated with a 2-bit sharing degree counter which
iterates from 0 to 3.
A dynamic software based TT filter is associated with every block to keep track
of the threads that are accessing that block.
A hit counter is used to keep track of the hits received by the data item.
Values of the sharing degree and the hit counters are used to compute a priority
for each cache block.
This priority is then used to make judicious replacement decisions.
Evaluation results have shown an average improvement of up to 9% in the
overall number of hits when compared to the traditional LRU approach.
6. EXPERIMENTAL METHODOLOGY
This section outlines the simulation infrastructure and benchmarks used to
evaluate the dynamic replacement policies. To evaluate the proposed techniques,
Gem5 [102] simulator is used and PARSEC benchmark suites [103,104,105] are
used as input sets.
6.1. Simulation Environment
For simulation studies, Gem5 - a modular, discrete, event driven, completely
object oriented, computer system open source simulator platform, has been
chosen. It can simulate a complete system with devices and an operating system,
in full system mode (FS mode), or user space only programs where system
services are provided directly by the simulator in syscall emulation mode (SE
mode). The FS mode is currently being used. It is also capable of simulating a
variety of Instruction Set Architectures (ISAs).
Table 6.1 Summary of the key characteristics of the PARSEC benchmarks
Program
Application
Domain
Working Set
Blackscholes
Financial Analysis
Small
Canneal
Computer Vision
Medium
Dedup
Enterprise Storage
Unbounded
Ferret
Similarity Search
Unbounded
Fluidanimate
Animation
Large
Swaptions
Financial Analysis
Medium
Vips
Media Processing
Medium
x264
Media Processing
Medium
6.2. PARSEC Benchmark
The Princeton Application Repository for Shared-Memory Computers
(PARSEC) is a benchmark suite that comprises numerous large scale commercial
multi-threaded workloads for CMP. For simulation purposes, seven workloads
have been picked up from PARSEC suite. Every workload is divided into three
phases: an initial sequential phase, a parallel phase (which is further split into two
phases), and a final sequential phase.
This work focuses on the performance obtained at the parallel phase which is
also known as the Region of Interest (ROI). All the stats which are collected to
measure the cache performance correspond to the ROI of the workloads. The
basic characteristics of the chosen benchmarks are shown in Table 6.1. Table 6.2
highlights the cache configuration parameters used for simulation.
Table 6.2 Cache architecture configuration parameters
Parameter L1 Cache L2 Cache
Total Cache Size i-cache 32 kB 2MB
d-cache 64 kB
Associativity 2-way 8-way
Miss Information/Status Handling Registers (MSHR)
10 20
Cache Block Size 64B 64B
Latency 1ns 10ns
Replacement Algorithm LRU CB-DPET
6.3. Evaluation Of Many-Core Systems
Evaluation using PARSEC benchmarks for all the three methods on two, four
and eight core systems with a 2MB LLC have yielded better results compared to
LRU in terms of parameters like number of hits, hit rate, throughput and IPC
speedup. Throughput is measured from the total number of instructions that get
executed in a single clock cycle (Instruction Per Cycle or IPC). In multi-threaded
applications, throughput is the sum of individual thread IPCs as shown below.
Throughput = ∑ IPCi ni=1 … … …
IPC Speedup of Algorithm k over LRU = IPCk / IPCLRU
where ‘n’ refers to the total number of threads and k refers to anyone of the
methods CB-DPET, LCP or SHP.
Evaluation results using PARSEC benchmarks for two cores have shown that
an average improvement in IPC speedup of up to 1.06, 1.05 and 1.08 times that
of LRU has been observed across CB-DPET, LCP and SHP techniques
respectively (Fig. 6.1, 6.2 and 6.3).
Fig 6.1 Performance comparison for 2 cores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Sp
ee
du
p
Benchmark
Average IPC speedup over LRU - 2 core
CB-DPET
LCP
SHP
Fig 6.2 Performance comparison for 4 cores
Fig 6.3 Performance comparison for 8 cores
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Sp
ee
du
p
Benchmarks
Average IPC speedup over LRU - 4 core
CB-DPET
LCP
SHP
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Sp
ee
du
p
Benchmarks
Average IPC speedup over LRU - 8 core
CB-DPET
LCP
SHP
Fig 6.4 CB-DPET- Percentage increase in hits
Fig 6.5 LCP- Percentage increase in hits
-20
-15
-10
-5
0
5
10
15
Pe
rce
nta
ge
Benchmark
CBDPET - Percentage increase in
number of hits compared to LRU
2-Core
4-Core
8-Core
-20
-15
-10
-5
0
5
10
15
20
Pe
rce
nta
ge
Benchmark
LCP - Percentage increase in number of
hits compared to LRU
2-Core
4-Core
8-Core
Fig 6.6 SHP- Percentage increase in hits
In 2 core scenario for CB-DPET, canneal has shown highest IPC speedup
improvment (1.2 times that of LRU) and dedup has recorded the least improvment
of 1.02 times that of LRU. Vips and swaptions have shown maximum IPC speed
up compared to other workloads for LCP and SHP methods in 2-core system. In
4 core scenario, fluidanimate has shown maximum IPC speedup of 1.075 and
1.07 times that of LRU for CB-DPET and LCP methods respectively. Blackscholes
has shown IPC speedup of 1.07 times that of LRU which is the maximum for SHP
method in 4-core environment. For 8 cores, ferret has recorded maximum IPC
speedup for CB-DPET and LCP methods whereas swaptions has shown
maximum IPC speedup of almost 1.1 times that of LRU for SHP method.
Improvement in overall hits of up to 8% has been observed using CB-DPET when
compared to LRU at the L2 cache level (Fig. 6.4). LCP and SHP have shown an
average improvement of upto 9% (Fig. 6.5) and 9% (Fig. 6.6) respectively in the
overall number of hits.
-30
-25
-20
-15
-10
-5
0
5
10
15
20
25
Pe
rce
nta
ge
Benchmark
SHP - Percentage increase in number of
hits compared to LRU
2-Core
4-Core
8-Core
6.4. Storage Overhead
6.4.1. CB-DPET
PET counter requires frequent lookup so it can be kept it as a part of the cache
hardware to expedite the access time. Considering an 8-way associative, 2MB
cache which has 4096 sets and the cache block size is 64 Bytes. Additional
Storage required is calculated as follows. Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block = ∗ ∗ = 𝑏𝑖 𝑜 𝐾𝐵
DM Register is allocated on a per thread basis and it depends on the application
during its run time so it is kept more like software based register rather than a
hardware one. The decision to have block status register either as a part of
hardware or software depends during implementation time because the value it
can hold is not fixed right now. Assuming that it can take only a ‘0’ (say shared)
and ‘1’ (not shared) and assuming that the thread id range do not start from ‘0’: If
implemented in hardware. Additional storage required for blockstatus register is
approx 4KB. The total additional storage required is approximately 16KB which
amounts to 0.78% overall increase in area for a 2MB L2 Cache.
6.4.2. LCP
Similar to PET, LCP counter also requires frequent lookup so keeping it as a
part of the cache hardware to expedite the access time. Storage overhead is
similar to that of PET counter which is approximately 12KB since LCP is also a 3-
bit counter. As no other registers and counters are required, LCP results in 0.58%
overall increase in area for a 2MB L2 Cache.
6.4.3. SHP
SHP requires a 2-bit Sharing Degree counter which requires frequent look up
and hence keeping it as a part of the hardware. Additional storage required for
Sharing Degree counter is calculated as follows
Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block = ∗ ∗ = 𝑏𝑖 𝑜 𝐾𝐵
Thread Tracker filter which is used to keep track of the threads that accesses the
blocks can be implemented as a part of software which gives the flexibility of
dynamically allocating and freeing the memory as and when required. Hence SHP
results in 0.39% overall increase in area for a 2MB L2 Cache.
7. CONCLUSION AND FUTURE WORK
7.1. Conclusion
When it comes to concurrent multi-threaded applications in a CMP environment,
LLC plays a crucial role in the overall system efficiency. Conventional replacement
strategies like LRU, NRU etc may not be effective in many instances as they do
not adapt dynamically to the data requirements. Numerous researches are being
conducted to find novel ways for improving the efficiency of the replacement
algorithms applied at LLC. In this dissertation, three novel counter based
replacement strategies have been proposed.
A novel replacement algorithm called CB-DPET has been proposed which
takes a more systematic and access-history based decision compared to LRU.
It is further split into DPET and DPET-V. DPET associates a counter with every
cache block to keep track of its recent access information. Based on this
counter’s values, the algorithm makes replacement, insertion and promotion
decisions.
A dynamic and a structured replacement technique called LCP has been
proposed to be adopted across the LLC. Here the elements of the cache are
logically partitioned into four zones based on their likeliness to be referenced
by the processor with the help of a 3-bit LCP counter associated with every
cache line. Replacement candidates are chosen from the zones in the
increasing order of their likeliness factor starting from the NLR zone. For every
hit, the corresponding element is moved up by one zone by adjusting the LCP
counter value. If it has reached the top most zone (MLR) the counter value is
left untouched. For every miss the counter value is decremented by ‘1’ to
prevent stale data items from polluting the cache.
A novel counter based prioritizing algorithm called SHP has been proposed to
be applied across the LLC. Here every cache block is associated with a 2-bit
sharing degree counter which iterates from 0 to 3. A dynamic software based
TT filter is associated with every block to keep track of the threads that are
accessing that block. A hit counter is used to keep track of the hits received by
the data item. Values of the sharing degree and the hit counters are used to
compute a priority for each cache block. This priority is then used to make
judicious replacement decisions.
Table 7.1 Percentage increase in overall hits at L2 Cache compared to LRU
Method 2-Core 4-Core 8-Core
CB-DPET 8% 4% 3%
LCP 6% 9% 5%
SHP 7% 9% 5%
Table 7.1 provides the percentage improvement obtained in overall number of
hits when compared to LRU for all the three methods. It is observed that up to 9%
improvement has been obtained at the L2 cache level (for LCP) in overall hit
percentage increase and an IPC speed up of up to 1.08 times that of LRU (for
SHP).
Simulation studies shows that all the three proposed counter based replacement
techniques have outperformed LRU with minimal storage overhead.
7.2. Scope for Further work
Dynamic replacement strategies have proven to work well with versatile set of
application workloads that currently exist. Increasing the dynamicity without
inducing much hardware overhead is crucial. This research has experimented
novel replacement strategies on the shared L2 cache while the modifications that
may be required to run the same algorithms in the relatively smaller private L1
cache is an interesting area to explore.
Appending additional bits per cache block in CB-DPET can help in storing more
information about the access pattern of the data item but storage overhead needs
to be taken into consideration. In LCP, optimization of the victim search process
could be done to expedite the replacement candidate selection. Multi-threading
could be a better solution here. Instead of having one thread searching the cache
set in multiple passes there can be multiple threads searching in different zones
but inter-thread communication may have to be explored to make sure the process
stops once victim is found. In sharing and hit-based prioritizing algorithm,
replacement priority of cache blocks is computed from the sharing degree and
number of hits received. In order to boost the accuracy, various other parameters
like miss penalty incurred while replacing that block etc can also be considered
while computing priority.
TECHNICAL BIOGRAPHY
Muthukumar S (RRN: 1184211) was born on 15th April 1965, in Erode,
Tamilnadu. He received the BE degree in Electronics and Communication
Engineering and the ME degree in Applied Electronics from Bharathiar University,
Coimbatore in the year 1989 and 1992 respectively. He is currently pursuing his
Ph.D. Degree in the Department of Electronics and Communication Engineering
of B.S. Abdur Rahman University, Chennai, India. He has got more than 22 years
of teaching experience in various institutions. He is working as Professor in the
Department of Electronics and Communication Engineering at Sri Venkateswara
College of Engineering, Chennai, India. His current research interests include
Multi-Core Processor Architectures, High Performance Computing and
Embedded Systems.
E-mail ID : [email protected]
Contact number : 9941437738