ENHANCEMENT OF CACHE PERFORMANCE IN MULTI-CORE … · 2017-02-06 · ENHANCEMENT OF CACHE...

ENHANCEMENT OF CACHE PERFORMANCE IN

MULTI-CORE PROCESSORS

A THESIS

Submitted by

MUTHUKUMAR.S

Under the Guidance of

Dr. P.K. JAWAHAR

in partial fulfillment for the award of the degree of

DOCTOR OF PHILOSOPHY

in

ELECTRONICS AND COMMUNICATION ENGINEERING

B.S.ABDUR RAHMAN UNIVERSITY

(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)

(Estd. u/s 3 of the UGC Act. 1956)

www.bsauniv.ac.in

November 2014

http://www.bsauniv.ac.in/

B.S.ABDUR RAHMAN UNIVERSITY

(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)

(Estd. u/s 3 of the UGC Act. 1956)

www.bsauniv.ac.in

BONAFIDE CERTIFICATE

Certified that this thesis ENHANCEMENT OF CACHE PERFORMANCE

IN MULTI-CORE PROCESSORS is the bonafide work of MUTHUKUMAR S.

(RRN: 1184211) who carried out the thesis work under my supervision. Certified

further, that to the best of my knowledge the work reported herein does not form

part of any other thesis or dissertation on the basis of which a degree or award

was conferred on an earlier occasion on this or any other candidate.

SIGNATURE

Dr. P.K.JAWAHAR

RESEARCH SUPERVISOR

Professor and Head

Department of ECE

B.S. Abdur Rahman University

Vandalur, Chennai – 600 048

http://www.bsauniv.ac.in/

ACKNOWLEDGEMENT

First and foremost, I would forever be thankful to the almighty for having

bestowed me with this life and knowledge. Next, I would like to express my

sincere gratitude to my supervisor Professor Dr. P.K. Jawahar for his

continuous support during my PhD study and research, for his patience,

motivation, enthusiasm, and immense knowledge. He gave me the freedom to

do the research in my chosen area, as a result of which, I felt more encouraged

and motivated to achieve my objectives. Besides my advisor, I would also like to

thank the doctoral committee members, Dr. P. Rangarajan and Dr. T.R.

Rangaswamy for their valuable suggestions, encouragement and insightful

comments.

My sincere thanks to Dr Kaja Mohideen, Dean, School of Electrical and

Communication Sciences for having provided me with this opportunity.

I would like to thank Professor Dr. V. Sankaranarayanan for his advice and

valuable suggestions throughout my research.

I would also like to thank Dr. M. Arunachalam, Secretary, Sri Venkateswara

College of Engineering for his encouragement and support.

Last but not the least, I would like to thank my family members for their immense

support. Special thanks goes to my wife Mrs. M. Shanthi, my son Mr. M.

Subramanian and my daughter M. Uma for their patience and moral support

that kept me going throughout the entire course.

I am also indebted to my parents who have stimulated my research interests

from the very early stage of my life. I sincerely hope that I can make my family

members proud.

ABSTRACT

The impact of various cache replacement policies act as the main deciding

factor of system performance and efficiency in Chip Multi-core Processors. It is

imperative for any level of cache memory in a multi-core architecture to have a

well-defined, dynamic replacement algorithm in place to ensure consistent

superlative performance. Many existing cache replacement policies such as

Least Recently Used (LRU) policy etc have proved to work well in the shared

Level 2 (L2) cache for most of the data set patterns generated by single

threaded applications. But when it comes to parallel multi-threaded applications

that generate differing patterns of workload at different intervals, the

conventional replacement policies can prove sub-optimal as they generally do

not abide by the spatial and temporal locality theories and do not acquaint

dynamically to the changes in the workload.

This dissertation proposes three novel cache replacement policies for L2

cache that are targeted towards such applications. Context Based Data Pattern

Exploitation Technique assigns a counter for every block of the L2 cache for

monitoring the data access patterns of various threads and modifies the counter

values appropriately to maximize the performance. Logical Cache Partitioning

technique logically partitions the cache elements into four zones based on their

‘likeliness’ to be referenced by the processor in the near future. Replacement

candidates are chosen from the zones in the ascending order of their ‘likeliness

factor’ to minimize the number of cache misses. Sharing and Hit based

Prioritizing replacement technique takes the sharing status of the data elements

into consideration while making replacement decisions. This information is

combined along with the number of hits received by the element to embark upon

a priority based on which the replacement decisions are made.

The proposed techniques are effective at improving cache performance and

offer an improvement of up to 9% in overall hits at the L2 cache level and an IPC

(Instructions Per Cycle) speedup of up to 1.08 times that of LRU for a wide

range of multithreaded benchmarks.

REFERENCES [1] John L. Hennessy and David A. Patterson, "Computer architecture: a

quantitative approach", Fifth Edition, Elsevier, 2012.

[2] Tola John Odule and Idowun Ademola Osinuga, “Dynamically Self-

Adjusting Cache Replacement Algorithm”, International Journal of Future

Generation Communication and Networking, Vol. 6, No. 1, Feb. 2013.

[3] Andhi Janapsatya, Aleksandar Ignjatovic, Jorgen Peddersen and Sri

Parameswaran, "Dueling clock: adaptive cache replacement policy based

on the clock algorithm", IEEE Design, Automation & Test Europe

Conference & Exhibition (DATE), pp. 920-925, 2010.

[4] Geng Tian and Michael Liebelt, "An effectiveness-based adaptive cache

replacement policy", Microprocessors and Microsystems, Vol 38, No.1, pp.

98-111, Elsevier, Feb. 2014.

[5] Anil Kumar Katti and Vijaya Ramachandran, "Competitive cache

replacement strategies for shared cache environments", IEEE 26th

International Parallel & Distributed Processing Symposium (IPDPS), pp.

215-226, 2012.

[6] Martin Kampe, Per Stenstrom and Michel Dubois, "Self-correcting LRU

replacement policies", 1st Conference on Computing Frontiers, ACM

Proceedings, pp. 181-191, 2004.

[7] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik Jhon,

"Adaptive block management for victim cache by exploiting l1 cache history

information", Embedded and Ubiquitous Computing, pp. 1-11, Springer

Berlin Heidelberg, 2004.

[8] Jorge Albericio, Ruben Gran, Pablo Ibanez, Víctor Vinals and Jose Maria

Llaberia, "ABS: A low-cost adaptive controller for prefetching in a banked

shared last-level cache", ACM Transactions on Architecture and Code

Optimization (TACO), Vol. 8, No. 4, Article. 19, 2012.

[9] Ke Zhang1, Zhensong Wang, Yong Chen3, Huaiyu Zhu4 and Xian-He Sun,

“PAC-PLRU: A Cache Replacement Policy to Salvage Discarded

Predictions from Hardware Prefetchers”, 11th IEEE/ACM International

Symposium on Cluster, Cloud and Grid Computing, p.265-274, May 2011.

[10] Jayesh Gaur, Mainak Chaudhuri and Sreenivas Subramoney, "Bypass and

insertion algorithms for exclusive last-level caches", ACM SIGARCH

Computer Architecture News, Vol. 39, No. 3, 2011.

[11] Hongliang Gao and Chris Wilkerson, "A dueling segmented LRU

replacement algorithm with adaptive bypassing", JWAC 1st JILP Workshop

on Computer Architecture Competitions: cache replacement

Championship, 2010.

[12] Saurabh Gao, Hongliang Gao and Huiyang Zhou, "Adaptive Cache

Bypassing for Inclusive Last Level Caches", IEEE 27th International

Symposium on Parallel & Distributed Processing (IPDPS), pp. 1243-1253,

2013.

[13] Mazen Kharbutli and D. Solihin, "Counter-based cache replacement and

bypassing algorithms", IEEE Transactions on Computers, Vol. 57, No. 4

pp. 433-447, 2008.

[14] Pierre Kharbutli, "Replacement policies for shared caches on symmetric

multicores: a programmer-centric point of view", 6th International

Conference on High Performance and Embedded Architectures and

Compilers, pp. 187-196, 2011.

[15] Moinuddin K Qureshi, "Adaptive spill-receive for robust high-performance

caching in CMPs", IEEE 15th International Symposium on High

Performance Computer Architecture (HPCA), pp. 45-54, 2009.

[16] Kathlene Hurt and Byeong Kil Lee, “Proposed Enhancements to Fixed

Segmented LRU Cache Replacement Policy”, IEEE 32nd International

Conference on Performance and Computing and Communications

(IPCCC), pp. 1-2, Dec. 2013.

[17] Manikantan, R, Kaushik Rajan and Ramaswamy Govindarajan. "NUcache:

An efficient multi-core cache organization based on next-use distance",

IEEE 17th International Symposium on High Performance Computer

Architecture (HPCA), pp. 243-253, 2011.

[18] Kedzierski, Kamil, Miquel Moreto, Francisco J. Cazorla and Mateo Valero,

"Adapting cache partitioning algorithms to pseudo-lru replacement

policies", IEEE International Symposium on Parallel & Distributed

Processing (IPDPS), pp. 1-12, 2010.

[19] Xie, Yuejian and Gabriel H. Loh, "PIPP: promotion/insertion pseudo-

partitioning of multi-core shared caches", ACM SIGARCH Computer

Architecture News, Vol. 37, No. 3, pp. 174-183, 2009.

[20] Hasenplaugh, William, Pritpal S. Ahuja, Aamer Jaleel, Simon Steely Jr and

Joel Emer, "The gradient-based cache partitioning algorithm", ACM

Transactions on Architecture and Code Optimization (TACO), Vol. 8, No.

4, Article 44, Jan 2012.

[21] Haakon Dybdahl, Per Stenstrom and Lasse Natvig, "A cache-partitioning

aware replacement policy for chip multiprocessors", High Performance

Computing-HiPC, Springer, pp. 22-34, 2006.

[22] Naoki Fujieda, and Kenji Kise, "A partitioning method of cooperative

caching with hit frequency counters for many-core processors", Second

International Conference on Networking and Computing (ICNC), pp. 160-

165, 2011.

[23] Konstantinos Nikas, Matthew Horsnell and Jim Garside, "An adaptive

bloom filter cache partitioning scheme for multi-core architectures”,

International Conference on Embedded Computer Systems, Architectures,

Modeling, and Simulation, SAMOS, pp. 25-32, 2008.

[24] Moinuddin K Qureshi and Yale N. Patt, "Utility-based cache partitioning: A

low-overhead, high-performance, runtime mechanism to partition shared

caches", 39th Annual IEEE/ACM International Symposium on

Microarchitecture, pp. 423-432, 2006.

[25] Daniel A Jimenez, "Insertion and promotion for tree-based PseudoLRU

last-level caches", 46th Annual IEEE/ACM International Symposium on

Microarchitecture, pp. 284-296, 2013.

[26] Lira, Javier, Carlos Molina and Antonio Gonzalez, "LRU-PEA: A smart

replacement policy for non-uniform cache architectures on chip

multiprocessors", IEEE International Conference on Computer Design, pp.

275-281, 2009.

[27] Nikos Hardavellas, Michael Ferdman, Babak Falsafi and Anastasia

Ailamaki, "Reactive NUCA: near-optimal block placement and replication

in distributed caches", ACM SIGARCH Computer Architecture News, Vol.

37, No. 3, pp. 184-195, 2009.

[28] Fazal Hameed, Bauer L and Henkel J, “Dynamic Cache Management in

Multi-Core Architectures Through Runtime Adaptation”, Design,

Automation & Test in Europe Conference & Exhibition (DATE), pp. 485-

490, March 2012.

[29] Alberto Ros, Marcelo Cintra, Manuel E. Acacio and Jose M. García.

"Distance-aware round-robin mapping for large NUCA caches",

International Conference on High Performance Computing (HiPC), pp. 79-

88, 2009.

[30] Javier Lira, Timothy M. Jones, Carlos Molina and Antonio Gonzalez. "The

migration prefetcher: Anticipating data promotion in dynamic NUCA

caches", ACM Transactions on Architecture and Code Optimization

(TACO), Vol. 8, No. 4, Article. 45, 2012.

[31] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Dynamic

cache clustering for chip multiprocessors", ACM 23rd International

Conference on Supercomputing, pp. 56-67, 2009.

[32] O. Adamao, A. Naz, K. Kavi, T. Janjusic and C. P. Chung, “Smaller split L-

1 data caches for multi-core processing systems”, IEEE 10th International

Symposium on Pervasive Systems, Algorithms and Networks(I-SPAN

2009), Dec 2009.

[33] Yong Chen, Surendra Byna and Xian-He Sun, "Data access history cache

and associated data prefetching mechanisms", ACM/IEEE Conference

on Supercomputing, pp. 1-12, 2007.

[34] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Acm: An

efficient approach for managing shared caches in chip multiprocessors",

High Performance Embedded Architectures and Compilers, Springer Berlin

Heidelberg, pp. 355-372, 2009.

[35] Sourav Roy, "H-NMRU: a low area, high performance cache replacement

policy for embedded processors", 22nd International Conference on VLSI

Design, pp. 553-558, 2009.

[36] Nobuaki Tojo, Nozomu Togawa, Masao Yanagisawa and Tatsuo Ohtsuki,

"Exact and fast L1 cache simulation for embedded systems", Asia and

South Pacific Design Automation Conference, pp. 817-822, 2009.

[37] Sangyeun Cho, and Lory Al Moakar, "Augmented FIFO Cache

Replacement Policies for Low-Power Embedded Processors", Journal of

Circuits, Systems and Computers, Vol. 18, No. 6, pp. 1081-1092, 2009.

[38] Kimish Patel, Luca Benini, Enrico Macii and Massimo Poncino, "Reducing

conflict misses by application-specific reconfigurable indexing", IEEE

Transactions on Computer-Aided Design of Integrated Circuits and

Systems, Vol. 25, No. 12, pp. 2626-2637, 2006.

[39] Yun Liang, and Tulika Mitra, "Instruction cache locking using temporal

reuse profile", 47th Design Automation Conference, pp. 344-349, 2010.

[40] Yun Liang, and Tulika Mitra, "Improved procedure placement for set

associative caches", International Conference on Compilers, Architectures

and Synthesis for Embedded Systems, pp. 147-156, 2010.

[41] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik Jhon,

"Adaptive block management for victim cache by exploiting l1 cache history

information", Embedded and Ubiquitous Computing, Springer Berlin

Heidelberg, pp. 1-11, 2004.

[42] Cristian Ungureanu, Biplob Debnath, Stephen Rago and Akshat Aranya,

"TBF: A memory-efficient replacement policy for flash-based caches",

IEEE 29th International Conference on In Data Engineering (ICDE), pp.

1117-1128, 2013.

[43] Pepijn de Langen and Ben Juurlink, "Reducing Conflict Misses in Caches

by Using Application Specific Placement Functions", 17th Annual

Workshop on Circuits, Systems and Signal Processing, pp. 300-305. 2006.

[44] Dongxing Bao and Xiaoming Li, "A cache scheme based on LRU-like

algorithm", IEEE International Conference on Information and Automation

(ICIA), pp. 2055-2060, 2010.

[45] Livio Soares, David Tam and Michael Stumm, "Reducing the harmful

effects of last-level cache polluters with an OS-level, software-only pollute

buffer", 41st annual IEEE/ACM International Symposium on

Microarchitecture, IEEE Computer Society, pp. 258-269, 2008.

[46] Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose

Duato, "MRU-Tour-based Replacement Algorithms for Last-Level Caches",

23rd International Symposium on Computer Architecture and High

Performance Computing (SBAC-PAD), pp. 112-119, 2011.

[47] Gangyong Jia, Xi Li, Chao Wang, Xuehai Zhou and Zongwei Zhu, "Cache

Promotion Policy Using Re-reference Interval Prediction", IEEE

International Conference on Cluster Computing (CLUSTER), pp. 534-537,

2012.

[48] Junmin Wu, Xiufeng Sui, Yixuan Tang, Xiaodong Zhu, Jing Wang and

Guoliang Chen, "Cache Management with Partitioning-Aware Eviction and

Thread-Aware Insertion/Promotion Policy", International Symposium on

Parallel and Distributed Processing with Applications (ISPA), pp. 374-381,

2010.

[49] Ali A. El-Moursy and Fadi N. Sibai, "V-Set cache design for LLC of multi-

core processors", IEEE 14th International Conference on High

Performance Computing and Communication (HPCC), pp. 995-1000,

2012.

[50] Aamer Jaleel, Matthew Mattina and Bruce Jacob, "Last level cache (LLC)

performance of data mining workloads on a CMP-a case study of parallel

bioinformatics workloads", The Twelfth International Symposium on High-

Performance Computer Architecture, pp. 88-98, 2006.

[51] Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri and

Jose Martinez, "Scavenger: A new last level cache architecture with global

block priority", 40th Annual IEEE/ACM International Symposium on


[52] Miao Zhou, Yu Du, Bruce Childers, Rami Melhem and Daniel Mossé,

"Writeback-aware partitioning and replacement for last-level caches in

phase change main memory systems", ACM Transactions on Architecture

and Code Optimization (TACO), Vol. 8, No. 4, Article. 53, 2012.

[53] Fazal Hameed, Lars Bauer and Jorg Henkel, "Adaptive cache

management for a combined SRAM and DRAM cache hierarchy for multi-

cores", Design, Automation & Test in Europe Conference & Exhibition

(DATE), pp. 77-82, 2013.

[54] Fang Juan and Li Chengyan, "An improved multi-core shared cache

replacement algorithm", 11th International Symposium on Distributed

Computing and Applications to Business, Engineering & Science

(DCABES), pp. 13-17, 2012.

[55] Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas

Subramoney and Joseph Nuzman, "Introducing hierarchy-awareness in

replacement and bypass algorithms for last-level caches", International

Conference on Parallel Architectures and Compilation Techniques, pp.

293-304, 2012.

[56] Yingying Tian, Samira M. Khan and Daniel A. Jimenez, "Temporal-based

multilevel correlating inclusive cache replacement", ACM Transactions on

Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article. 33,

2013.

[57] Liqiang He, Yan Sun and Chaozhong Zhang, "Adaptive Subset Based

Replacement Policy for High Performance Caching", JWAC-1st JILP

Workshop on Computer Architecture Competitions: Cache Replacement

Championship, 2010.

[58] Baozhong Yu, Yifan Hu, Jianliang Ma and Tianzhou Chen, "Access Pattern

Based Re-reference Interval Table for Last Level Cache", 12th

International Conference on Parallel and Distributed Computing,

Applications and Technologies (PDCAT), pp. 251-256, 2011.

[59] Haiming Liu, Michael Ferdman, Jaehyuk Huh and Doug Burger, "Cache

bursts: A new approach for eliminating dead blocks and increasing cache

efficiency", 41st annual IEEE/ACM International Symposium on


[60] Vineeth Mekkat, Anup Holey, Pen-Chung Yew and Antonia Zhai,

"Managing shared last-level cache in a heterogeneous multicore

processor", 22nd International Conference on Parallel Architectures and

Compilation Techniques, pp. 225-234, 2013.

[61] Rami Sheikh and Mazen Kharbutli, "Improving cache performance by

combining cost-sensitivity and locality principles in cache replacement

algorithms", IEEE International Conference on Computer Design (ICCD),

pp. 76-83, 2010.

[62] Adwan AbdelFattah and Aiman Abu Samra, "Least recently plus five least

frequently replacement policy (LR+ 5LF)", Int. Arab J. Inf. Technology,

Vol. 9, No. 1, pp. 16-21, 2012.

[63] Richa Gupta and Sanjiv Tokekar, "Proficient Pair of Replacement

Algorithms on L1 and L2 Cache for Merge Sort", Journal of Computing, Vol.

2, No. 3, pp. 171-175, March 2010.

[64] Jan Reineke, Daniel Grund, “Sensitivity of Cache Replacement Policies”,

ACM Transactions on Embedded Computer Systems, Vol. 9, No. 4, Article

39, March 2010.

[65] Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu and Yale N. Patt, "A

case for MLP-aware cache replacement", ACM SIGARCH Computer

Architecture News, Vol. 34, No. 2, pp. 167-178, 2006.

[66] Marios Kleanthous and Yiannakis Sazeides, "CATCH: A mechanism for

dynamically detecting cache-content-duplication in instruction caches",

ACM Transactions on Architecture and Code Optimization (TACO), Vol. 8,

No. 3, Article. 11, 2011.

[67] Moinuddin K. Qureshi, David Thompson and Yale N. Patt, "The V-Way

cache: demand-based associativity via global replacement", 32nd

International Symposium on Computer Architecture, ISCA'05 Proceedings,

pp. 544-555, 2005.

[68] Jichuan Chang and Gurindar S. Sohi, “Cooperative caching for chip

multiprocessors”, IEEE Computer Society, Vol. 34, No. 2, 2006.

[69] Shekhar Srikantaiah, Mahmut Kandemir and Mary Jane Irwin, "Adaptive

set pinning: managing shared caches in chip multiprocessors", ACM

SIGARCH Computer Architecture News, Vol. 36, No. 1, pp. 135-144, 2008.

[70] Ranjith Subramanian, Yannis Smaragdakis and Gabriel H. Loh, “Adaptive

Caches: Effective Shaping of Cache Behavior to Workloads”, IEEE

MICRO, Vol. 39, 2006.

[71] Tripti Warrier S, B. Anupama and Madhu Mutyam, "An application-aware

cache replacement policy for last-level caches", Architecture of Computing

Systems–ARCS, Springer Berlin Heidelberg, pp. 207-219, 2013.

[72] Chongmin Li, Dongsheng Wang, Yibo Xue, Haixia Wang and Xi Zhang.

"Enhanced adaptive insertion policy for shared caches", Advanced Parallel

Processing Technologies, Springer Berlin Heidelberg, pp. 16-30, 2011.

[73] Xiaoning Ding, Kaibo Wang and Xiaodong Zhang, "SRM-buffer: an OS

buffer management technique to prevent last level cache from thrashing in

multi-cores", ACM 6th Conference on Computer Systems, pp. 243-256,

2011.

[74] Jiayuan Meng and Kevin Skadron, "Avoiding cache thrashing due to private

data placement in last-level cache for many core scaling", IEEE

International Conference on Computer Design (ICCD), pp. 282-288, 2009.

[75] Vivek Seshadri, Onur Mutlu, Michael A. Kozuch and Todd C. Mowry, "The

Evicted-Address Filter: A unified mechanism to address both cache

pollution and thrashing", 21st International Conference on Parallel

Architectures and Compilation Techniques, pp. 355-366, 2012.

[76] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr and

Joel Emer, “Adaptive Insertion Policies for High Performance Caching”,

ACM SIGARCH Computer Architecture News, Vol. 35, No. 2, pp. 381-391,

June 2007.

[77] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero

and Alexander V. Veidenbaum, "Improving cache management policies

using dynamic reuse distances", 45th Annual IEEE/ACM International

Symposium on Microarchitecture, IEEE Computer Society, pp. 389-400,

2012.

[78] Xi Zhang, Chongmin Li, Haixia Wang and Dongsheng Wang, "A cache

replacement policy using adaptive insertion and re-reference prediction",

22nd International Symposium on Computer Architecture and High

Performance Computing (SBAC-PAD), pp. 95-102, 2010.

[79] Min Feng, Chen Tian, Changhui Lin and Rajiv Gupta, "Dynamic access

distance driven cache replacement", ACM Transactions on Architecture

and Code Optimization (TACO), Vol. 8, No. 3, Article 14, 2011.

[80] Jun Zhang, Xiaoya Fan and Song-He Liu, "A pollution alleviative L2 cache

replacement policy for chip multiprocessor architecture", International

Conference on Networking, Architecture and Storage, NAS'08, pp. 310-

316, 2008.

[81] Yuejian Xie and Gabriel H. Loh, "Scalable shared-cache management by

containing thrashing workloads", High Performance Embedded

Architectures and Compilers, Springer Berlin Heidelberg, pp. 262-276,

2010.

[82] Seongbeom Kim, Dhruba Chandra and Yan Solihin, "Fair cache sharing

and partitioning in a chip multiprocessor architecture", 13th International

Conference on Parallel Architectures and Compilation Techniques, IEEE

Computer Society, pp. 111-122, 2004.

[83] Jichuan Chang and Gurindar S. Sohi, "Cooperative cache partitioning for

chip multiprocessors", 21st ACM International Conference on

Supercomputing, pp. 242-252, 2007.

[84] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot,

Simon Steely Jr and Joel Emer, "Adaptive insertion policies for managing

shared caches", 17th International Conference on Parallel Architectures

and Compilation Techniques, pp. 208-219, 2008.

[85] Carole-Jean Wu and Margaret Martonosi, "Adaptive timekeeping

replacement: Fine-grained capacity management for shared CMP

caches", ACM Transactions on Architecture and Code Optimization


[86] Mainak Chaudhuri, "Pseudo-LIFO: the foundation of a new family of

replacement policies for last-level caches", 42nd Annual IEEE/ACM

International Symposium on Microarchitecture, pp. 401-412, 2009.

[87] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr and Joel Emer, "High

performance cache replacement using re-reference interval prediction

(RRIP)", ACM SIGARCH Computer Architecture News, Vol. 38, No. 3, pp.

60-71, 2010.

[88] Viacheslav V Fedorov, Sheng Qiu, A. L. Reddy and Paul V. Gratz, "ARI:

Adaptive LLC-memory traffic management", ACM Transactions on

Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article. 46,

2013.

[89] Moinuddin, K., Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr, and Joel

Emer, "Set-dueling-controlled adaptive insertion for high-performance

caching", Micro IEEE, Vol. 28, No. 1, pp. 91-98, 2008.

[90] Aamer Jaleel, Hashem H. Najaf-Abadi, Samantika Subramaniam, Simon

C. Steely and Joel Emer, "Cruise: cache replacement and utility-aware

scheduling", ACM SIGARCH Computer Architecture News, Vol. 40, No. 1,

pp. 249-260, 2012.

[91] George Kurian, Srinivas Devadas and Omer Khan, "Locality-Aware Data

Replication in the Last-Level Cache", High-Performance Computer

Architecture, 20th International Symposium on IEEE, 2014.

[92] Jason Zebchuk, Srihari Makineni and Donald Newell, "Re-examining cache

replacement policies", IEEE International Conference on Computer Design

(ICCD), pp. 671-678, 2008.

[93] Samira Manabi Khan, Zhe Wang and Daniel A. Jimenez, "Decoupled

dynamic cache segmentation", IEEE 18th International Symposium

on High Performance Computer Architecture (HPCA), pp. 1-12, 2012.

[94] Song Jiang and Xiaodong Zhang, "LIRS: an efficient low inter-reference

recency set replacement policy to improve buffer cache performance" ACM

SIGMETRICS Performance Evaluation Review, Vol. 30, No.1, pp. 31-42,

2002.

[95] Song Jiang and Xiaodong Zhang, "Making LRU friendly to weak locality

workloads: A novel replacement algorithm to improve buffer cache

performance", IEEE Transactions on Computers, Vol. 54, No. 8, pp. 939-

952, 2005.

[96] Izuchukwu Nwachukwu, Krishna Kavi, Fawibe Ademola and Chris Yan,

"Evaluation of techniques to improve cache access uniformities",

International Conference on Parallel Processing (ICPP), pp. 31-40, 2011.

[97] Dyer Rolán, Basilio B. Fraguela and Ramon Doallo, “Virtually split cache:

An efficient mechanism to distribute instructions and data”, ACM

Transactions on Architecture and Code Optimization (TACO), Vol. 10, No.

4, Article. 27, Dec 2013.

http://dl.acm.org/author_page.cfm?id=81448600243&coll=DL&dl=ACM&trk=0&cfid=412875296&cftoken=23498027



[98] Huang ZhiBin, Zhu Mingfa and Xiao Limin, "LvtPPP: live-time protected

pseudopartitioning of multicore shared caches", IEEE Transactions

on Parallel and Distributed Systems, Vol. 24, No. 8, pp. 1622-1632, 2013.

[99] Bradford M Beckmann, Michael R. Marty and David A. Wood, "ASR:

Adaptive selective replication for CMP caches", 39th Annual IEEE/ACM

International Symposium on Microarchitecture, MICRO-39, pp. 443-454,

2006.

[100] Garcia-Guirado, Antonio, Ricardo Fernandez-Pascual, Alberto Ros and

José M. García, "DAPSCO: Distance-aware partially shared cache

organization", ACM Transactions on Architecture and Code Optimization


[101] Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi,

Simon C. Steely Jr and Joel Emer, "SHiP: signature-based hit predictor for

high performance caching", 44th Annual IEEE/ACM International

Symposium on Microarchitecture, pp. 430-441, 2011.

[102] Nathan Binkert, Bradford Beckman, Gabriel Black, Steven K. Reinhardt, Ali

Saidi, Arkaprava Basu, Joel Hestness, Derek. R. Hower, Tushar Krishna,

Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay

Vaish, Mark D. Hill and David A. Wood, "The gem5 simulator", ACM

SIGARCH Computer Architecture News, Vol. 39, No. 2 pp. 1-7, 2011.

[103] Christian Bienia, Sanjeev Kumar and Kai Li, "PARSEC vs. SPLASH-2: A

quantitative comparison of two multithreaded benchmark suites on chip-

multiprocessors", IEEE International Symposium on Workload

Characterization (IISWC), pp. 47-56, 2008.

[104] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh and Kai Li, "The

PARSEC benchmark suite: Characterization and architectural

implications", 17th International Conference on Parallel Architectures and

Compilation Techniques, pp. 72-81, 2008.

[105] Mark Gebhart, Joel Hestness, Ehsan Fatehi, Paul Gratz and Stephen W.

Keckler, "Running PARSEC 2.1 on M5", University of Texas at Austin,

Department of Computer Science, Technical Report# TR-09-32, 2009.

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT

LIST OF TABLES

LIST OF FIGURES

LIST OF SYMBOLS AND ABBREVIATIONS

iv

viii

ix

xi

1. INTRODUCTION 1

1.1 Chip Multi-core Processors

1.2 Memory Management in CMP

1.3 Overview of Existing Replacement Policies

1.4 The Problem

1.5 Scope of Work

1.6 Research Objectives

1.7 Contributions

1.8 Outline of the Thesis

1.9 Summary

2

3

3

6

6

7

7

8

9

2. LITERATURE SURVEY 10

2.1 Replacement Policies

2.2 Data Prefetching and Cache Bypassing

2.3 Effectiveness of Cache Partitioning

2.4 Embedded System Processors

2.5 Performance of Shared Last-Level Caches

2.6 Cache Thrashing and Cache Pollution

2.7 Prevalence of Multi-Threaded Applications

2.8 Summary

10

11

11

12

13

15

16

17

3. CONTEXT-BASED DATA PATTERN EXPLOITATION

TECHNIQUE 18


3.1 Introduction

3.1.1. Different Types of Application Workloads

18

19

3.2 Basic Structure of CB-DPET

3.3 DPET

3.4 DPET-V: A Thrash Proof Version

3.5 Policy Selection Technique

3.6 Thread Based Decision Making

3.7 Block Status Based Decision Making

3.8 Modifications in the Proposed Policies

3.9 Comparison between LRU, NRU and

CB-DPET

3.9.1 Performance of LRU

3.9.2 Performance of NRU

3.9.3 Performance of DPET

3.10 Results

3.10.1 CB-DPET Performance

3.11 Summary

22

22

24

31

33

34

34

35

35

36

37

38

39

43

4. LOGICAL CACHE PARTITIONING TECHNIQUE 45

4.1 Introduction

4.2 Logical Cache Partitioning Method

4.3 Replacement Policy

4.4 Insertion Policy

4.5 Promotion Policy

4.6 Boundary Condition

4.7 Comparison with LRU

4.8 Results

4.8.1 LCP Performance

4.9 Summary

45

46

47

47

48

49

53

53

54

57


5. SHARING AND HIT-BASED PRIORITIZING

TECHNIQUE

58

5.1 Introduction

5.2 Sharing and Hit-Based Prioritizing Method

5.3 Replacement Priority

58

59

60

5.4 Replacement Policy

5.5 Insertion Policy

5.6 Promotion Policy

5.7 Results

5.7.1 SHP Performance

5.8 Summary

61

66

66

66

66

69

6.

EXPERIMENTAL METHODOLOGY

63

6.1 Simulation Environment

6.2 PARSEC Benchmark

6.3 Evaluation of Many-Core Systems

6.4 Storage Overhead

6.4.1 CB-DPET

6.4.2 LCP

6.4.3 SHP

70

71

71

76

76

76

76

7. CONCLUSION AND FUTURE WORK 78

7.1 Conclusion

7.2 Scope for Further Work

78

79

REFERENCES

LIST OF PUBLICATIONS

TECHNICAL BIOGRAPHY

81

97

98

LIST OF PUBLICATIONS

[1] Muthukumar S and Jawahar P.K., “Hit Rate Maximization by Logical Cache

Partitioning in a Multi-Core Environment”, Journal of Computer Science,

Vol. 10, Issue 3, pp. 492-498, 2014.

[2] Muthukumar S and Jawahar P.K., “Sharing and Hit Based Prioritizing

Replacement Algorithm for Multi-Threaded Applications”, International

Journal of Computer Applications, Vol. 90, Issue 12, pp. 34-38, 2014.

[3] Muthukumar S and Jawahar P.K., “Cache Replacement for Multi-Threaded

Applications using Context-Based Data Pattern Exploitation Technique”,

Malaysian Journal of Computer Science, Vol. 26, Issue 4, pp. 277-293,

2014.

[4] Muthukumar S and Jawahar P.K., “Redundant Cache Data Eviction in a

Multi-Core Environment”, International Journal of Advances in Engineering

and Technology, Vol. 5, Issue 2, pp.168-175, 2013.

[5] Muthukumar S and Jawahar P.K., “Cache Replacement Policies for

Improving LLC Performance in Multi-Core Processors”, International

Journal of Computer Applications, Vol. 105, Issue 8, pp. 12-17, 2014.

1. INTRODUCTION

The advent of multi-core computers consisting of two or more processors

working in tandem on a single integrated circuit has produced a phenomenal

breakthrough in performance over the recent years. Efficient memory

management is required to utilize the complete potential of CMPs.

One of the main concerns in recent times is the processor memory gap.

Processor memory gap indicates a scenario where the processor is stalled while

a memory operation is happening. Fig 1.1 shows that this situation is getting

worse as the processor performance tends to increase manifold but the memory

latency continues to be almost the same. The on-chip cache memory clock

speeds have hit a wall where practically no improvement could be achieved in

that area anymore. The usage of multiple cores has mitigated this effect to some

extent but this gap tends to widen lot more in the future stressing the need for

effective memory management techniques.

Fig 1.1 Processor memory gap [1]

1.1. Chip Multi-core Processors (CMP)

In a CMP, multiple processor cores are packaged into a single physical

processor chip that is capable of executing multiple tasks simultaneously. Multi-

core processors provide room for faster execution of applications by exploiting

the available parallelism (mainly instruction level parallelism), which implies the

ability to work on multiple problems simultaneously. Parallel multi-threaded

applications demand for high computational and throughput requirements. To

expedite their execution speed, these applications spawn multiple threads which

run in parallel. Each thread has its own execution context and is assigned to

perform a particular task. In a multi-core architecture, these threads can run on the

same core or on different cores. A two-core processor with two threads T1 and

T2 is depicted in Fig 1.2.

Fig 1.2 Two core processor-T1 and T2 represent the threads

Every core now behaves like an individual processor and has its own

execution pipeline. They can execute multiple threads in parallel thus expediting

the running time of the program as a whole and boosting the overall system

performance and throughput. Efficiency of CMPs largely rely on how well a

software application can leverage the multiple core advantage while execution.

Many applications that have emerged in recent times have been optimized to

utilize the multi-core advantage (if present) to improve the performance.

Networking applications and other such CPU-intensive applications have

benefitted hugely from the CMP architecture.

1.2. Memory Management in CMP

Memory management forms an integral part in determining the performance of

CMPs. Cache memory is used to hide the latency of main memory and it also

helps in reducing the processing time in CMPs. Generally every core in a CMP

environment will be having a private L1 cache and all the available cores share a

relatively larger unified L2 cache (which in most cases can also be referred to as

the shared Last Level Cache). The L1 cache is split into ‘instruction cache’ and

‘data cache’ as shown in Fig 1.3. The mapping method used is ‘set-associative’

mapping [1], where the cache is divided into a specified number of sets with

each set having a specified number of blocks.

The number of blocks in each set indicates the associativity of the cache. The

size of L1 cache has to be kept small owing to chip constraints. Since it is

directly integrated into the chip that carries the processor, its data retrieval time

is expedited.

Fig 1.3 General memory hierarchy in a CMP environment

1.3. Overview of Existing Replacement Policies

Replacement policies play a vital role in determining the performance of a

system. In a multi-core environment, where processing speed and throughput

are of essence, selecting a suitable replacement technique for the underlying

cache memories becomes crucial. Primary objective of any replacement

algorithm is to utilize the available cache space effectively and reduce the number

of data misses to the maximum extent possible. Misses can be of three types

namely- compulsory misses, conflict misses and capacity misses.

Compulsory misses happen when the data item is fetched from the main

memory and stored into the cache for the first time. In most cases it is

unavoidable. Conflict misses are quite common in direct mapped cache

architectures where more than one data item compete for the same cache line

resulting in eviction of the previously inserted data item. As the name suggests,

capacity misses arise due to the inability of the cache to hold the entire working

set which can be observed in most applications owing to the smaller cache size.

An efficient replacement algorithm targets the capacity misses and tries to reduce

them as much as possible.

Since L2 cache is the one that will be accessed by multiple cores and it has

shared data, it is essential that judicious replacement strategies need to be

adopted here, in order to improve the performance. At present, many

applications make use of the LRU cache replacement algorithm [1], which

discards the ‘oldest possible’ cache line available in the cache i.e. the cache line

which has not been accessed over a period of time. This implies that the ‘age’ of

every cache line needs to be updated after every access. This is done with the

help of either a counter or a queue data structure. In the queue based

approach, data that is referenced by the processor is moved to the top of the

queue, pushing all the elements below by one step. At any point of time, to

accommodate an incoming block, LRU deletes the block which is found at the

bottom of the queue (least recently used data).

This method seems to work well with many current applications. However for

certain workloads which possess an unpredictable data reference pattern that

fluctuates frequently with time, LRU can result in poor performance. A closer

observation on LRU can help reveal the cause for this problem. It assumes that

all the applications follow the same reference pattern i.e. they abide by the

spatial and temporal locality theories. Any data item that is not referenced by the

processor for some considerable amount of time is tagged as ‘least recently

used’ and evicted from the cache. When such a data item is required by the

processor in future, it is not found in the cache and has to be fetched from the

main memory.

Most Recently Used (MRU) policy also works in a similar fashion, with a slight

difference that the victim is chosen from the opposite end of the data structure

i.e. the most recently referred data is evicted to accommodate an incoming

block. The above stated algorithms work fine when the ‘locality of reference’ is

high. But for applications whose data sets do not exhibit any degree of regularity

in their access pattern, these algorithms might result in sub-optimal

performance. An example of such a pattern is given in Fig 1.4. In the ensuing

sections, individual data items are referred by enclosing them within flower

braces.

Fig 1.4 Data sequence containing a recurring pattern {1,5,7}

As it is observed from Fig 1.4, the starting pattern {1,5,7} seems to recur after

a long burst of random data. If LRU is applied for the above set of data, it will

result in poor performance as there is no regularity in the access pattern.

Pseudo-LRU keeps a single status bit for each cache block. Initially all bits are

reset. Every access to a block sets the bit indicating that the block is recently

used. This also may degrade the performance for the above data sequence.

Another replacement technique that is commonly being employed apart from

LRU and MRU is the Not Recently Used (NRU) algorithm. NRU associates a

single bit counter with every cache line to aid in making replacement decisions. It

might be effective compared to LRU in certain cases, but a 1-bit counter may not

be powerful enough to create the required dynamicity.

Random replacement algorithm works by arbitrarily selecting a victim for

replacement when necessary. It does not keep any information about the cache

access history. Least Frequently Used technique tracks the number of times a

block is referenced using a counter and a block with lowest count is selected for

replacement.

1.4. The Problem

An effective replacement algorithm is important for cache management in

CMP. When it comes to the on-chip cache, it becomes even more crucial as the

size of the cache memory is considerably small. In order to fine tune the

performance of the cache, it is imperative to have an efficient replacement

technique in place. Apart from being efficient it is also important that it should be

dynamic, i.e. adjusting itself according to the changing data access patterns.

The most prevalently used replacement algorithms such as Least Recently

Used, Most Recently Used, Not Recently Used etc may have been proven to

work well with many applications but they are not dynamic in nature as

discussed in the earlier section. Irrespective of the data pattern exhibited by the

applications, they follow the same technique to find out the replacement victim.

It was observed that if a replacement algorithm could make decisions

based on the access history and the current access pattern, the number of

cache hits received can increase considerably thereby reducing the number of

misses and improving the overall performance. The primary aim of this

dissertation is to propose and design such dynamic novel cache replacement

algorithms and evaluate them against the conventional LRU approach.

1.5. Scope of Work

Managing cache memory involves many challenges that need to be addressed

from various angles. The behavior of various existing replacement policies on

shared last level cache is analyzed. In multi-threaded applications shared data

has to be handled with care because when there is a miss on a data item which is

shared by multiple threads, the execution of many threads gets stalled.

Replacement decisions made need to be dynamic and self-adjusting according to

the workload pattern changes. Since the size of the cache is small, available

cache space must be utilized efficiently. Techniques proposed in this work cover

all the above mentioned aspects while refraining from in-depth focusing on cache

memory hardware architecture optimizations.

1.6. Research Objectives

The following are the objectives of this research

To augment the cache subsystem of a CMP with novel replacement

algorithm for improving the performance of shared Last Level Cache

(LLC).

To propose a dynamic and self adjusting cache replacement policy with

low storage overhead and robustness across workloads.

To manage the shared cache efficiently between multiple applications.

Make LLC thrash resistant and scan resistant and maximize the utilization

of LLC, which helps to accelerate overall performance for memory

intensive workloads

1.7. Contributions

In order to address the shortcomings of the conventional replacement

algorithms, three novel counter based replacement strategies, which are

targeted towards the shared LLC, have been proposed and evaluated over a set

of multi-threaded benchmarks.

First of the three methods, Context based data pattern exploitation

technique, maximizes the cache performance by closely monitoring and

adapting itself to the pattern that occurs in the workload sequence with

the help of a block-level 3-bit counter. Results shows that the overall

number of hits obtained at the L2 cache level is on average 8% more than

LRU approach and an improvement of IPC speedup of up to 6% is

achieved.

Logical cache partitioning, which is the second technique, logically partitions

the cache into four different zones namely – Most likely to be referenced,

Likely to be referenced, Less likely to be referenced and Never likely to be

referenced in the decreasing order of their likeliness factor. Replacement

decisions are made based on likeliness to be referenced by the processor

in the near future. Results shows that an improvement of up to 9% in overall

number of hits and 5% in IPC speedup compared to LRU.

Finally, Sharing and hit-based prioritizing is an effective novel counter

based replacement technique, which makes replacement decisions based

on the sharing nature of the cache blocks. Every cache block is classified

based on four degrees of sharing namely – not shared (or private), lightly

shared, heavily shared and very heavily shared with the help of a 2-bit

counter. This sharing degree information is combined along with the

number of hits received by the element based on which the replacement

decisions are made. Results shows that this method achieves 9%

improvement in the overall number of hits and 8% improvement in IPC

speedup compared to baseline LRU Policy.

1.8. Outline of the Thesis

This thesis is organized into seven chapters which are as follows

Chapter 2 consists of the literature survey which discusses about a versatile set

of researches that have been carried out on cache management in single core

and multi-core architectures.

Chapter 3 presents a novel counter based replacement technique namely

Context Based Data Pattern Exploitation Technique. Initially it presents the

different types of workloads and the basic structure of the policy. Then it

elaborates on the algorithm and results and finally summarizes the method.

Chapter 4 presents the Logical Cache Partitioning technique. First it presents

the proposed replacement policy and finally it evaluates its significance and

results compared to LRU.

Chapter 5 describes the novel replacement policy Sharing and Hit-based

Prioritizing technique and evaluates its performance.

Chapter 6 discusses about the simulation environment and the benchmarks

used to evaluate all the three methods and also about the results obtained for all

the techniques, comparing and contrasting with the results obtained for the

same set of benchmarks by applying the conventional LRU approach. The final

section of this chapter presents the storage overhead.

Chapter 7 wraps up by presenting the conclusion and directions for future work.

1.9. Summary

This Chapter has provided an overview of the problem associated with

existing cache replacement policies for shared last level cache present in chip

multiprocessors and this forms the essential foundation for the chapters 3, 4 and

5. It has presented an outline of the thesis, specifically devoting three chapters

to target the drawbacks associated with shared LLC in CMPs. The next chapter

presents the literature survey of various cache replacement policies for multi-

core processors.

2. LITERATURE SURVEY

2.1. Replacement Policies

Replacement policies play a crucial role in CMP cache management.

Considering the on-chip cache memory of a processor, choosing a perfect

replacement technique is extremely important in determining the performance of

the system. The more adaptive a replacement technique is to the incoming

workload, the lesser is the overhead involved in transferring data to and from the

secondary memory frequently. John Odule and Ademola Osingua have come up

with a dynamically self-adjusting cache replacement algorithm [2] that makes

page replacement decisions based on changing access patterns by adapting to

the spatial and temporal features of the workload. Though it dynamically changes

to the workload needs, it does not detect changes or make decisions at a finer

level i.e. at block level which can be crucial when the working set changes quite

frequently.

Many such algorithms which work better than LRU on specific workloads have

evolved in recent times. One such algorithm is the Dueling Clock (DC) [3]

algorithm, designed to be applied on the shared L2 cache. But it does not consider

applications that involve multiple threads executing in parallel. Effectiveness-

Based Replacement technique (EBR) [4] ranks the cache elements within a set,

based on the recency and frequency of access and chooses the replacement

victim based on those ranks. In a multi-threaded environment, a miss on shared

data can be pivotal and calls for more attention which is not provided here.

Competitive cache replacement [5] proposes a replacement algorithm for

inclusive, exclusive and partially-inclusive caches by considering two knowledge

models. Self correcting LRU [6] proposes a technique in which LRU is augmented

with a framework to constantly monitor and correct the replacement mistakes

which LRU can tend to make in its execution cycle. Adaptive Block management

[7] proposes a method to use the victim cache more efficiently thereby reducing

the number of accesses to L2 cache.

2.2. Data Prefetching and Cache Bypassing

Data prefetching can be very useful in any cache architecture since it helps in

reducing the cache miss latency [8]. Prediction-Aware Confidence-based Pseudo

LRU (PAC-PLRU) [9] algorithm efficiently makes use of the pre-fetched

information and also saves those information from unnecessarily being evicted

from the cache. There can also be scenarios where the overhead can be reduced

by promoting the data fetched from the main memory directly to the processor,

bypassing the cache [10]. Hongliang Gao and Chris Wilkerson’s Dual Segmented

LRU algorithm [11] has explored the positive aspects obtained from cache

bypassing in detail. Though cache bypassing [12] can help in reducing the transfer

overhead, it may not work well with the ‘locality principles’ i.e. bypassing a data

item likely to be referenced in the near future might not be a good idea.

Counter based cache replacement and cache bypassing algorithm [13] tries to

locate the dead cache lines well in advance and choose them as replacement

victims, thus preventing such lines from polluting the cache. Augmented cache

architecture is one that comprises two parts- a larger direct-mapped cache and a

smaller, fully-associative cache which are accessed in parallel. A replacement

technique similar to LRU known as LRU-Like algorithm [14] was proposed to be

applied across augmented cache architectures.

Backing up few of the evicted lines in another cache can help in incrementing

the overall hits received. Adaptive Spill-Receive Cache [15] works on similar lines

by having a spiller cache (one that ‘spills’ the evicted line) and a receiver cache

(one that stores the ‘spilled’ lines).But if the cache line reuse prediction is not

efficient this can result in lots of stale cache lines as the algorithm executes.

2.3. Effectiveness of Cache Partitioning

In a multi-processor environment, efficient cache partitioning can help in

improving the performance and throughput of the system [17,18,19]. Partition

aware replacement algorithms exploit the cache partitions and try to minimize the

cache misses [20,21]. Filter based partitioning techniques are widely being

proposed by many studies [22,24]. Adaptive Bloom Filter Cache Partitioning

(ABFCP) [23] involves dynamic cache partitioning at periodic intervals using

bloom filters and counters to adapt to the needs of concurrent applications. But

since every processor core is allotted an array of bloom filters, the hardware

complexity can increase as the number of cores becomes higher. Pseudo LRU

algorithm (similar to LRU but with lesser hardware complexity) has been proven

to work well with partitioned caches [25]. Studies have also been conducted in

enhancing the performance of Non-Uniform Cache Architecture (NUCA)

[26,27,29]. A dynamic cache management scheme [28] aimed at the Last-Level

Caches in NUCA claims to have improved the system performance.

Although many schemes work well by adapting to the dynamic data

requirements of the applications, they do not attach importance to data that is

shared between the threads of the parallel workloads. Hits on such data items not

only improve the performance of those workloads, but also reduces the overhead

involved in transferring the data to and from the secondary memory frequently.

Dynamic cache clustering [31] groups cache blocks into multiple dynamic cache

banks on a per core basis to expedite data access. Placing the new incoming data

items into the appropriate cache bank during run time can induce transfer

overheads though. Smaller split L-1 data caches [32] proposes a split design for

L1 data cache using separate L-1 array and L-2 scalar caches which has shown

performance improvements. Yong Chen et al. suggest the use of separate cache

to store the data access histories and studies its associated prefetching

mechanism which has produced performance gain [33].

2.4. Embedded System Processors

Embedded processors differ from the conventional processors in a way that they

are tailor-made for very specific applications. Replacement policies are being

proposed which are targeted towards embedded systems [36,37,38]. Hierarchical

Non-Most-Recently-Used (H-NMRU) [35] replacement policy is one such

algorithm that has proven to work well in those processors. [39] deals with the

cache block locking problem and provides a heuristic and efficient solution

whereas [40] proposes improved procedure placement policies in the cache

memory. A Two Bloom Filter (TBF) based approach [42], proposed primarily for

flash-based cache systems has proven to work well compared to LRU. The

proposal made by Pepijn de Langen et al. [43] is targeted to improve the

performance of direct mapped cache for application specific processors.

Dongxing Bao et al. suggest a method [44] to improve the performance of direct

mapped cache in which blocks are evicted based on their usage efficiency. This

method has shown improvement in miss rate.

2.5. Performance of Shared Last-Level Caches

Performance of the shared LLC in multi-core architecture plays a crucial role in

deciding the overall performance of the system, since all the cores will access the

data present here. Versatile researches [45,46,47] have been conducted in

improving the shared LLC performance. Phase Change Memory (PCM) with a

LLC forms an effective low-power consuming alternative to the traditional DRAMs

in existence. Writeback-aware Cache Partitioning (WCP) [52] efficiently partitions

the LLC in PCM among multiple applications and Write Queue Balancing (WQB)

replacement policy [52] helps in improving the throughput of PCM. But one of the

main drawbacks of PCM is that the writes are much slower compared to DRAM.

The Adaptive Cache Management technique [53] combines SRAM and DRAM

architectures into the LLC and proposes an adaptive DRAM replacement policy

to accommodate the changing access patterns of the applications. Though it

reports improvement over LRU it does not discuss on the hardware overhead that

might result with the change in the architecture. Frequency based LRU (FLRU)

[54] takes the recent access information, as the LRU does, along with the partition

and frequency information while making replacement decision.

Cache Hierarchy Aware Replacement (CHAR) [55] makes replacement

decisions in LLC based on the activities that occur in the inner levels of the cache.

It studies the data pattern that hits/misses in the higher levels of cache and makes

appropriate decisions in LLC. Yingying Tian et al. [56] have proposed a method

which also functions on similar lines for making decisions at LLC. But propagating

such information frequently across various levels of cache might slow down the

system over a period of time. Cache space is another area of concern. Instead of

utilizing the entire cache space, an Adaptive Subset Based Replacement Policy

(ASRP) [57] has been proposed, where the cache space is divided into multiple

subsets and at any point of time only one subset is active. Whenever a

replacement needs to be made, the victim is selected only from the active set. But

there is a possibility of under-utilization of the available cache space in this

technique. For any replacement algorithm to be successful, it is always essential

to make good use of the available cache space. This can be achieved by

eradicating the dead cache lines i.e. the lines which will never be referenced by

the processor in the future. Lot of researches have been carried out in this area

[58,59]. But the problem which can arise here is that for applications with an

unpredictable data access pattern, the process of dead line prediction becomes

intricate and in the long run the predictor might lose its accuracy.

Vineeth Mekkat et al. [60] have proposed a method that focuses on improving

the LLC performance on heterogeneous multi-core processors by targeting and

optimizing certain Graphical Processing Unit (GPU) based applications since they

spawn huge amount of threads compared to any other applications.

The effect of miss penalty can have a serious impact on the system efficiency.

Locality Aware Cost Sensitive (LACS) replacement technique [61] works on

reducing the miss penalty in caches. It calculates the number of instructions

issued during a miss for every cache block and when a victim needs to be selected

for replacement, it selects the block which issues the minimum number of miss

instructions. While it helps in reducing the miss penalty, it does not attach

importance to the data access pattern of the application under execution. Another

replacement policy which is called the LR+5LF [62] policy combines the concepts

of LRU and LFU to find a replacement candidate. Hardware complexity is an issue

here. Replacement algorithms that work well with L1 cache may not suit L2 cache.

Thus choosing an efficient pair of replacement algorithm for L1 and L2 becomes

important for maximizing the hit rate [63]. Studies have indicated that execution

history like the initial hardware state of the cache can also impact the sensitivity

of replacement algorithms [64]. Memory Level Parallelism (MLP) involves

servicing multiple memory accesses simultaneously which will save the processor

from stalling due to a single long-latency memory request. Replacement

technique proposed in [65] incorporates MLP and dynamically makes a

replacement decision on a miss such that the memory related stalls are reduced.

Tag based data duplication may occur in the caches which can be exploited to

eliminate lower-level memory accesses thereby reducing memory latency [66].

Demand-Based associativity [67] involves varying the associativity of the cache

set (basically total number of cache blocks per set) according to the requirements

of the application’s data access pattern. Storage complexity will be high in this

method though. Cooperative caching [68] proposes a unified framework for

managing the on-chip cache. Set pinning [69] proposes a technique for avoiding

inter-core misses and reducing intra-core misses in a shared cache which has

shown improvement in miss rate.

Adaptive caches [70] propose a method which dynamically selects any two of

cache replacement algorithms (LRU, LFU, FIFO, and Random) by tracking the

locality information. Application aware cache [71] tracks the life time of cache

blocks in shared LLC during runtime for parallel applications and thereby utilises

the cache space more efficiently. EDIP [72] proposes a method for memory

intensive workloads where a new block is inserted at the LRU position and claims

to have obtained less miss rate than LRU for workloads with large working set.

2.6. Cache Thrashing and Cache Pollution

One predominant problem with LRU is that it may result in thrashing (a condition

where most of the time is spent in replacement and there will not be any hits at

all) when it is applied for memory intensive applications. Investigations are being

conducted to reduce the effect of thrashing while making replacement decisions

[73,74]. Adaptive Insertion Policies [76] have tweaked the insertion policy of LRU

to reduce the cache miss for memory intensive workloads. It is termed as LRU

Insertion Policy (LIP) which has outperformed LRU with respect to thrashing. OS

buffer management technique [73] tries to reduce thrashing in LLCs by adopting

techniques to improve the management of OS buffer caches. Since it has to

operate at an OS level, it may pull down the system performance. Evicted-Address

filter mechanism [75] keeps track of the recently evicted blocks to predict the reuse

of the incoming cache blocks thereby mitigating the effects of thrashing but at the

cost of implementation complexity.

Cache pollution happens when the cache is predominantly filled up with stale

data that the processor will never reference in the future. Inefficient cache

replacement algorithms can lead to cache pollution. Cache line reuse prediction

techniques [77,78,79] are required to avoid or mitigate cache pollution. Software-

only pollute buffer based approach [45] works on eliminating the LLC polluters

dynamically using a special buffer that is controlled at the Operating System level

but as the algorithm executes, it may burden the OS downgrading the

performance. A Pollution Alleviative L2 cache [80] proposes a method to avoid

cache pollution that might arise due to the execution of wrong path in case of miss

predictions and claims to have improved the IPC and cache hit rate. Scalable

shared cache [81] proposes a method for managing thrashing applications in

shared LLC of embedded multi-core architectures.

2.7. Prevalence of Multi-Threaded Applications

Concurrent multi-threaded applications have become prevalent in today’s

CMPs. Works have been proposed on how effectively cache can be shared by

multiple threads running in parallel [82]. Co-operative cache partitioning [83]

works by efficiently allocating cache resources to threads which are concurrently

executing on multiple cores. Thread Aware Dynamic Insertion Policy (TA DIP) [84]

takes the memory requirements of individual applications into consideration and

makes replacement decisions. Adaptive Timekeeping Replacement (ATR) [85]

policy is designed to be applied primarily at LLC. It assigns ‘lifetime’ to every cache

line. Unlike other replacement strategies like LRU, it takes the process priority

levels and application memory access patterns into consideration and adjusts the

cache line lifetime accordingly to improve performance. But the condition which

can lead to thrashing has not been considered in both the methods

[84,85].Pseudo-LIFO [86] suggests a method for early eviction of dead blocks and

retain live blocks which improves the cache utilization and average CPI for multi-

programmed workloads.

2.8. Summary

This Chapter has presented the prior work on various cache replacement

algorithms like the use cache partitioning and bypassing together with their

additional overhead involved in implementation. An exhaustive study of prior work

on cache thrashing and cache pollution has helped in the design of dynamic cache

replacement schemes that can be applied in high performance processors. The

next chapter discusses a counter based cache replacement technique for shared

last level cache in multi-core processors.

3. CONTEXT-BASED DATA PATTERN EXPLOITATION TECHNIQUE

Data access patterns of workload vary across applications. A replacement

algorithm can aid in improving the performance of the system only when it works

well majority of the applications in existence. In short, if a replacement algorithm

could make decisions based on the access history and the current access pattern

of the workloads, the number of hits received can increase considerably

[87,88,89].

The impact of various cache replacement policies act as the main deciding

factor of system performance and efficiency in Chip Multi-Core Processors.

Conventional replacement approaches work well for applications which have a

well-defined and a predictable access pattern for their workloads. But when the

workload of the applications is dynamic and constantly changes pattern,

traditional replacement techniques may fail to produce improvement in the cache

metrics. Hence this chapter discusses a novel cache replacement policy that is

targeted towards applications with dynamically varying workloads. Context Based

Data Pattern Exploitation Technique assigns a counter for every block of the L2

cache. It then closely observes the data access patterns of various threads and

modifies the counter values appropriately to maximize the overall performance.

Experimental results obtained by using the PARSEC benchmarks have shown an

average improvement of 8% in overall hits at L2 cache level when compared to

the conventional LRU algorithm.

3.1. Introduction

Memory management plays an essential role in determining the performance

of CMPs. Cache memory is indispensible in CMPs to enhance the data retrieval

time and to achieve high performance. The level 1 cache or the L1 cache needs

to be small in size owing to the hardware limitations but since it is directly

integrated into the chip that carries the processor, accessing it typically takes only

a few processor cycles when compared to the accessing time of the next higher

levels of memories. Accessing the L2 cache may not be as fast as accessing the

L1 cache but since it is larger in size and is shared by all the available cores,

effectively managing it becomes more important. This chapter puts forth a

dynamic replacement approach called CB-DPET that is targeted mainly towards

the shared L2 cache.

3.1.1. Different Types of Application Workloads

The following are the types of workload patterns [90] exhibited by many

applications: Let ‘x’ and ‘y’ denote sequence of references where ‘xi’ denotes the

address of the cache block. ‘a’, ‘p’ and ‘n’ are integers where ‘a’ denotes the

number of times the sequence repeats itself.

Thrashing Workload:

[.. (x1, x2….xn) a ..] a>0, n>cache size

These workloads have a working set size that is greater than the size of the

available shared cache. They may not perform well under LRU replacement

policy. Sample data pattern for a thrashing workload for a cache which can hold

8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80 . . .

Here, cache_size = 8 blocks, x1 = 8, x11 = 80, n = 11 and a = any value > 0.

On first scan, elements ‘8’ till ‘5’ will be fetched and loaded into the cache. When

the next element ‘10’ arrives, the cache will be full and hence replacement has to

be made. Same is the case for remaining data items like ‘16’, ‘80’ and so on.

Regular Workload:

[… (x1, x2…. xn -1, xn, xn-1, … x2, x1)a…] n>0, a>0

Applications possessing this type of workload benefit hugely from the cache

usage. LRU policy suits them perfectly since the data elements follow stack based

pattern. Sample data pattern for regular workload for a cache which can hold 8

blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 4, 20, 11 . . .

Here, cache_size = 8 blocks, x1 = 8, x7 = 4, x8 = 5, n = 8 and a = any value > 0.

After the cache gets filled up with the eighth element ‘5’, the remaining elements

‘4’, ‘20’ and ‘11’’ result in back to back cache hits.

Random Workload:

[…(x1, x2….xn)…] n -› ∞

Random workloads possess poor cache re-use. They generally tend to thrash

under any replacement policy. Sample data pattern for a random workload for a

cache which can hold 8 blocks at a time can be as follows

. . . 1, 9, 2, 8, 0, 12, 22, 99, 101, 50, 3 . . .

Recurring Thrashing Workload:

[… (x1, x2…. xn)a … (y1, y2.. yp)

… (x2, x1… xn) a …], a>0, p>0, n>cache size

Sample data pattern for a recurring thrashing workload for a cache which can hold

8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 7, 17, 46, 10, 16, 80, 90, 99, 8, 9, 2 . . .

Here, cache_size = 8 blocks, x1 = 8, x11 = 46, y1= 10, y5= 99, n = 11 and a = 1.



be made. Same is the case for remaining data items like ‘17’, ‘46’ and so on. By

the time the initial pattern recurs towards the end, it would have been replaced

resulting in cache misses towards the end.

Recurring Non-Thrashing Workload:

[… (x1, x2 ….xn)a … (y1, y2 ... yp )

… (x2, x1… xn)

a …], a>0, p >0, n<=cache size

It is similar to the thrashing workload but with a recurring pattern in between.

Sample data pattern for a recurring non-thrashing workload for a cache which can

hold 8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80, 8, 9, 2 . . .

Here, cache_size = 8 blocks, x1 = 8, x8 = 5, y1= 10, y3= 80, n = 8 and a = 1.



be made. Same is the case for remaining data items like ‘16’ and ‘80’. Again the

initial pattern recurs towards the end. If LRU is applied here, it will result in 3 cache

misses towards the end as it would have replaced data items ‘8’, ‘9’ and ‘2’.

In the recurring non-thrashing pattern, if ‘n’ is large, then LRU can result in sub-

optimal performance. Data items x2 to xn which reoccur after a short burst of

random data (y1 to yp) may result in cache miss. The problem here is even if a

data item receives many cache hits in the past, if it is not referenced by the

processor for a brief period of time, LRU evicts the item from the cache. When

the processor looks for that data item again it results in a miss. This stresses the

need for a replacement technique that adapts itself to the changing workload and

makes replacement decisions appropriately.

This chapter’s contribution includes,

A novel counter based replacement algorithm called CB-DPET which tries

to maximize the cache performance by closely monitoring and adapting

itself to the pattern that occurs in the sequence.

Association of a 3-bit counter (which will be referred to as the PET counter

from here onwards) with every cache block in the cache set.

A set of rules to adjust the counter value whenever an insertion, promotion

or a replacement happens such that the overall hit percentage is

maximized.

Extension of the method to suit multi-threaded applications and also to

avoid cache thrashing.

PET counter boundary determination such that there will not be any stale

data in the cache for longer periods of time, i.e. non-accessed data ‘mature’ gradually with every access and gets removed at some point of time to

pave the way for new incoming data sets.

3.2. Basic Structure of CB-DPET:

This section and the following sub-sections elaborate on the basic concepts

that make up CB-DPET- namely DPET and DPET-V. Mapping method is set-

associative mapping [1]. A counter called the PET counter is associated with every

block in the cache set. Conceptually this counter holds the ‘age’ of each and every

cache block. The maximum value that the counter can hold is set to 4 which imply

that only 3 bits are required to implement it at the hardware level, thereby not

increasing the complexity. The minimum value that the counter can hold is ‘0’. The

maximum upper bound value that the counter can hold is actually ‘7’ (23-1) but the

value of ‘4’ has been chosen. This is because, any value that is chosen past 6 can

result in stale data polluting the cache for longer periods of time. Similarly if a

value below ‘3’ had been chosen, the performance will be more or less similar to

that of NRU. Thus the upper bound was taken to be ‘4’. The algorithm which

implements the task of assigning the counter values based on access pattern is

referred as the ‘Age Manipulation Algorithm’ (AMA).

3.3. DPET

A block with a lower PET counter value can be regarded as ‘younger’ when

compared to a block having a higher PET value. Initially the counter values of all

the blocks are set to ‘4’. Conceptually every block is regarded as ‘oldest’. So in

this sense, the block which is considered to be the ‘oldest’ (PET counter value 4)

is chosen as a replacement candidate when the cache gets filled up. If the oldest

block cannot be found, the counter value of every block is incremented

continuously till the oldest cache block is found. This block is then chosen for

replacement. Once the new block has been inserted into this slot, AMA

decrements the counter associated with it by ‘1’ (it is set to ‘3’). This is done

because the block can no more be regarded as ‘oldest’, as it has been accessed

recently. Also it cannot be given a very low value as the AMA is not sure about its

future access pattern. Hence it is inserted with a value of ‘3’. When a data hit

occurs, that block is promoted to the ‘youngest’ level irrespective of which level it

was in previously. By this way, more time is given for data items which repeat

themselves in a workload to stay in the cache.

To put this in more concrete terms,

A. DPET associates a counter, which iterates from ‘0’ to ‘4’, with every cache

block.

B. Initially the counter values of all the blocks are set to ‘4’. Conceptually every

block is regarded as ‘oldest’.

C. Now when a new data item comes in, the oldest block needs to be replaced.

Since contention can arise in this case, AMA chooses the first ‘oldest’ block

from the bottom of the cache as a tie-breaking mechanism.

D. Once the insertion is made, AMA decrements the counter associated with it by

‘1’ (it is set to ‘3’). This is done because the block can no more be regarded as ‘oldest’, as it has been accessed recently. Also it cannot be given a very low value as the AMA is not sure about its future access pattern. Hence it is

inserted with the value of ‘3’. E. In the case where all the blocks are filled up (without any hits in between),

there will be a situation where all the blocks will have a PET counter value of

‘3’. Now when a new data item arrives, there is no block with a counter value

of ‘4’. As a result, AMA increments all the counter values by ‘1’ and then performs the check again. If a replacement candidate is found this time, AMA

executes steps C and D again. Otherwise it again increments the values. This

is carried out recursively until a suitable replacement candidate is found.

F. Now consider the situation where a hit occurs for a particular data item in the

cache. This might be the start of a specific data pattern. So AMA gives high

priority to this data item and restores its associated PET counter value to ‘0’ terming it as the ‘youngest’ block.

On the other hand, if a miss is encountered, the situation becomes quite similar

to what AMA dealt with in the initial steps, i.e. it scans all the blocks from the

bottom to find the ‘oldest’ block. If found it proceeds to replace it. Otherwise the

counter values are incremented and once again the same search is performed

and this process goes on till a replacement candidate is found.

3.4. DPET-V : A Thrash-Proof Version

This method is a variant of DPET (In the following sections, it will be referred to

as DPET-V). Some times for workloads with a working set greater than the

available cache size, there is a possibility that there will not be any hits at all. The

entire time will be spent only in replacement. This condition is often referred to as

thrashing. The performance of cache can be improved if some fraction of

randomly selected working set is retained in the cache for some more time which

will result in increase in cache hits Hence very rarely, AMA decrements the PET

counter value by ‘2’ (instead of ‘1’) after an insertion is made. So the new counter

value post insertion will be ‘2’ instead of ‘3’. Conceptually it allows more time for

the data item to stay in the cache. This technique is referred to as DPET-V. The

probability of inserting with ‘2’ is chosen approximately as 5/100, i.e. out of 100

blocks, the PET counter value of (approximately) 5 blocks will be set to ‘2’ while

the others will be set to ‘3’ during insertion.

Fig 3.1 Data sequence which will result in thrashing for DPET

Algorithm 1: Implementation for Cache Miss, Hit and Insert Operations

Init:

Initialize all blocks with PET counter value ‘4’.

Input:

A cache set instance

Output:

removing_index /** Block id of the victim **/

Miss:

/** Invoke FindVictim method. Set is passed as parameter **/

FindVictim(set):

/** Infinite Loop **/

while 1 do

removing_index = -1;

/** Scanning from the bottom **/

for i=associativity-1 to 0 do

if set.blk[i].pet == 4 then

/** Block Found. Exit Loop **/

removing_index = i;

break;

if removing_index == -1 then

for i=associativity-1 to 0 do

/** Increment all blocks pet value by 1 and search again **/

set.blks[i].pet = set.blks[i].pet + 1

else

break; /** break from main while loop **/

return removing_index;

Hit:

set.blk.pet = 0;

Insert:

set.blk.pet = 3;

Correctness of Algorithm

Invariant:

At any iteration i, the PET counter value of set.blk[i] either holds a value of ‘4’ or

a value in the range of ‘0’ to ‘3’.

Initialization:

When i = 0, set.blk[i].pet will be ‘4’ initially. After being selected as victim and

after the new item has been inserted, set.blk[i].pet will be ‘3’. Hence invariant

holds good at initialization.

Maintenance:

At i=n, set.blk[i].pet will contain

either zero (if a cache hit has happened) or

set.blk[i].pet will have ‘3’ (if it has been newly inserted).

set.blk[i].pet will hold ‘1’ or ‘2’ or ‘4’ (if a replacement needs to be made but

block with PET counter ‘4’ not yet found). Hence invariant holds good for i = n.

Same case when i = n+1 as the individual iterations are mutually exclusive. Thus

invariant holds good for i = n + 1.

Termination:

The internal infinite loop will terminate when a replacement candidate is found.

A replacement candidate will be found for sure in every run since the counter value

of individual blocks is proven to be in the range ‘0’ to ‘4’. A candidate is found as

and when a counter value reaches ‘4’. Since an increment by ‘1’ is happening for

all the blocks until a value of ‘4’ is found, the boundary condition “if set.blk[i].pet

== ‘4' ” is bound to hit for sure, terminating the loop after a finite number of runs.

The invariant can also be found to hold good for all the cache blocks present

when the loop terminates.

It is observed that if DPET is applied over the working set shown in Fig 3.1 there

will not be any hits at all (Fig 3.2(i)). After the first burst {8,9,2,6,5,0}, the PET

counter value of all blocks is set to ‘3’. When the next burst {11,12,1,3,4,15}

arrives, none of them results in a hit. Now when the third burst arrives, again all

the old blocks are replaced with new ones.. Fig 3.2(ii) shows the action of DPET-

V over the same data pattern. According to DPET-V, AMA has randomly selected

two blocks (containing data items {5} and {2}) whose counter values have been

set to ‘2’ instead of ‘3’ while inserting the data burst 1. Now when burst 2 arrives,

those two blocks are left untouched as their PET counter values have not matured

to ‘4’ yet. Finally when the third burst of data {2,9,5,8,6} arrives, it results in two

hits more ( for {5} and {2} ) compared to DPET. Hence by giving more time for

those two blocks to stay in cache, the hit rate has been improved. Fig 3.3(i), 3.3(ii),

3.3(iii) and 3.3(iv) shows the working of CB-DPET in the form of a flow diagram.

(i) (ii)

Fig 3.2 (i) Contents of the cache at the end of each burst for DPET

(ii) Contents of the cache at the end of each burst for DPET-V

Processor looks for a particular

lo k i the a he

AMA starts search for block

i the a he

If is found

Set PET ou ter value of to 0

Processor brings the data

item from next level of

memory

Find replacement

victim in the

cache

HIT

MISS

Start

Finish

Insert new

data itemFinish

Fig 3.3 (i) CB-DPET flow Diagram

If is found

AMA algorithm starts

searching for a replacement

a didate i the a he

Choose as replacement victim

Increment counter

values of all blocks by 1

Yes

No

Sear h for first lo k from cache bottom with

PET ou ter value of 4

Find replacement

victim in the cache

Fig 3.3 (ii) Victim Selection flow

Insert new data

item-DPET

Insert incoming data item into the

victim block

Set PET Counter value of the block

to 3

Fig 3.3 (iii) Insert flow- DPET

Insert new data

item-DPET-V

Insert incoming data item into the

victim block

Set PET Counter value of

the block to 3

95 out of every 100 blocks

Set PET Counter value of

the block to 2

5 out of every 100 blocks

Fig 3.3 (iv) Insert flow DPET-V

3.5. Policy Selection Technique

A few test sets in the cache are allocated to choose between DPET and DPET-

V. Among those test sets, DPET is applied over a few sets and DPET-V is applied

over the remaining sets and they are closely observed. Based on the performance

obtained here, either DPET or DPET-V is applied across the remaining sets in the

cache (apart from the test sets). A separate global register (which from here

onwards will be referred to as Decision Making register or DM register) is

maintained to keep track of the number of misses encountered in the test sets for

both the techniques. Initially the value of this register is set to zero.

Fig 3.4 Policy selection using DM register

Data iss o a he lo k

If is i DPET assigned test sets

DM register is

incremented by 1

If is i DPET-V

assigned test sets

Check value of DM

Insert new data

item-DPET-V

Insert new data item-

DPET

Finish

Yes

No

No

If <= 0

If > 0

Find replacement

victim in the cache


DPET

DM register is

decremented by 1Yes

Find replacement

victim in the cache


DPET-V

Fig 3.5 CB-DPET Policy selection flow

For every miss encountered in a test set which has DPET running on it, the

register value is incremented by one and for every miss which DPET-V produce

in its test sets, AMA decrements the register value by one. Effectively, if at any

point of time the DM register value is positive, then it means that DPET has been

issuing many misses. Similarly if the register value goes negative, it indicates that

DPET-V has resulted in more misses. Based on this value, AMA makes the apt

replacement decision at run time. This entire algorithm which alternates

dynamically between DPET and DPET-V is termed Context Based DPET. Fig 3.4

depicts the policy selection process. Fig 3.5 shows the Victim selection process.

3.6. Thread Based Decision Making

The concept of CB-DPET is extended to suit multi-threaded applications with a

slight modification. For every thread a DM register and few test sets are allocated.

Before the arrival of the actual workload, in its allocated test sets each thread is

assigned to run DPET for one half and DPET-V for the remaining half of the test

sets. The sets running DPET (or DPET-V) need not be consecutive. These sets

are selected in a random manner. They can occur intermittently i.e. a set assigned

to run DPET can immediately follow a set that is assigned to run DPET-V or vice

versa. On the whole, the total number of test sets allocated for each thread is

equally split between DPET and DPET-V. Now when the actual workload arrives,

the thread starts to access the cache sets. Two possibilities arise here.

Case 1:

The thread can access its own test set. Here, if there is data miss, it records the

number of misses in its DM register appropriately depending on whether DPET or

DPET-V was assigned to that set as discussed in the previous section.

Case 2:

The thread gets to access any of the other sets apart from its own test sets. Here,

either DPET or DPET-V is applied depending on the thread’s DM register value.

If this access turns out to be the very first cache access of that thread, then its DM

register will have a value of ‘0’, in which case the policy is chosen as DPET.

3.7. Block Status Based Decision Making

When more than one thread tries to access a data item, it can be regarded as

shared data. Compared to data that is accessed only by one thread (private data),

shared data has to be placed in the cache for a longer duration. This is because

multiple threads try to access those data at different points of time and if a miss is

encountered, the resulting overhead would be relatively high. Taking this into

consideration, it is possible to find out whether each block has shared or private

data (using the id of the thread that accesses it) and decide on the counter values

appropriately. To serve this purpose, there is a ‘blockstatus’ register associated

with every cache block. When a thread accesses that block, its thread id is stored

into the register. Now when another thread tries to access the same block, the

value of the register is set to some random number that does not fall within the

accessing threads’ id range. So if the register holds that random number, the

corresponding block is regarded as shared. Otherwise it is treated as private.

3.8. Modification in the Proposed Policies

In case of DPET, when a block is found to be shared, insertion is made with a

counter value of ‘2’ instead of ‘3’. Similarly in DPET-V, for shared data, insertion

is always made with a PET counter value of ‘2’. If a hit is encountered in either of

the methods, the PET counter value is set to ‘0’ (as discussed earlier) only if the

block is shared. For private blocks, the counter value is set to ‘1’ when there is a

hit.

Fig 3.6 Data sequence containing a recurring pattern {1,5,7}

3.9. Comparison between LRU, NRU and DPET

3.9.1. Performance of LRU

The behaviour of LRU for the data sequence in Fig 3.6 is shown in Fig 3.7. It is

assumed that there are five blocks in the cache set under consideration. Fig. 3.7

shows the working of the LRU replacement policy over the given data set. The

workload sequence (divided into bursts) is specified in the Fig. 3.7. Below every

data, ‘m’ is used to indicate a miss in the cache and ‘h’ is used to indicate a hit.

Alphabet ‘N’ in the cache denotes Null or Invalid data. The cache set can be

viewed as a queue with insertions taking place from the top of the queue and

deletion taking place from the bottom. Additionally, if a hit occurs, that particular

data is promoted to the top of the queue, pushing other elements down by one

step. Hence at any point of time, the element that has been accessed recently is

kept at the top and the ‘least recently used’ element is found at the bottom, which

becomes the immediate candidate for replacement. In the above example, initially

the stream {1, 5, 7} results in miss and the data is fetched from the main memory

and placed in the cache.

Fig 3.7 Behaviour of LRU for the data sequence

Now data item {7} will hit and it is brought to the top of the queue (in this case it is

already at the top). Data items {1} and {5} results in two more consecutive hits

bringing {5} to the top. The ensuing elements {9, 6} results in misses and are

brought into the cache one after another. Among the burst {5, 2, 1} the data item

{2} alone misses. Then {0} too results in a miss and is brought from the main

memory. In the stream {4,0,8} the data item {0} hits. And finally the burst {1, 5, 7}

re-occurs towards the end where {5, 7} will result in two misses as indicated by

Fig. 3.8. The overall number of misses encountered here amounts to eleven.

3.9.2. Performance of NRU

The working of NRU on the given data set is shown in Fig. 3.8. NRU associates

a single bit counter with every cache line which is initialized to ‘1’. When there is

a hit, the value is changed to ‘0’. A ‘1’ indicates that the data block has remained

unreferenced in the cache for a longer period of time compared to other blocks.

Thus the block from the top which has a counter value of ‘1’ is selected as a victim

for replacement. Scanning from the top helps in breaking the tie. If such a block

is not found, the counter values of all the blocks are set to ‘1’ and a search is

performed again. In Fig. 3.8, counter values which get affected in every burst are

underlined for more clarity. As it is observed from the figure, NRU results in one

miss more than the overall misses encountered in LRU. Towards the end, the

burst {5,7} result in two misses. This is primarily because NRU tags the stream

{5,7} as ‘not-recently used’ as it had not seen them for some time and thus

chooses them as replacement victims. In the long run, as the data set size

increases in a similar fashion, the number of misses will increase considerably.

One feature that is common between both algorithms is that they follow the

same technique irrespective of the pattern that the data set follows. This might

minimize the hit rate in some cases. In case of LRU, if an item is accessed by the

processor, it is immediately taken to the top of the queue thereby pushing the

remaining set of elements one step below, irrespective of whatever pattern the

data set might follow thereafter. Thus even if any element in this group, that gets

pushed down, repeats itself after some point of time, it might have been marked

as ‘least recently used’ and evicted from the cache. NRU is also not powerful as

it has only a single bit counter, which does not allocate enough time for data items

which might be accessed in the near future.

Fig 3.8 Behaviour of NRU for the data sequence

3.9.3. Performance of DPET

Fig. 3.9 shows the contents of the cache at various points when DPET is

applied, along with the burst of input data which is processed at that moment. The

PET counter value associated with every block is shown, as a sub-script to the

actual data item and the counter value of the block which is directly affected at

that instant is underlined for more clarity. The input data set has the letters ‘m’ and

‘h’ under every data item to indicate a miss or a hit respectively. The behaviour is

studied with a cache set consisting of five blocks. Initially AMA assigns a value of

‘4’ to PET counters associated with all the blocks. As seen from figure, the burst

{1,5,7,7} arrives initially. Data items {1,5,7} are placed in the cache in a bottom up

fashion and their PET values are decremented by ‘1’. Now when the data item {7}

arrives again, it hits in the cache and thus the counter value associated with {7} is

made ‘0’. Data items {1,5} arrive individually and result in two hits. The

corresponding PET values are set to ‘0’ as indicated in the figure. The next burst

contains elements {9,6}, both resulting in cache misses. They are fetched from

the main memory and stored in the remaining two blocks of the cache set and

their counter values are decremented appropriately. The fifth data burst comprises

the data items {5,2,1}. Among these, {5} is already present in the cache and hence

its counter value is set to ‘0’. Data item {2} is not present in cache. So as with

DPET, AMA searches for a replacement candidate bottom up, which has its

counter value set to ‘4’. Since such a block could not be found, the PET values of

all the blocks are incremented by ‘1’ and the search is performed again. Now the

PET value of the block containing {9} would have matured to ‘4’ and it is replaced

by {2} and its counter value is decremented to ‘3’. Data item {1} hits in the cache.

The overall number of misses is reduced compared to LRU and NRU.

Fig 3.9 Behaviour of DPET for the data sequence

3.10. Results

To evaluate the effectiveness of the replacement policy on real workloads, a

performance analysis has been done with Gem5 simulator using PARSEC

benchmark programs, discussed in Chapter 6, as input sets. The results are

shown in this section.

3.10.1. CB-DPET Performance

The effectiveness of this method is visible in the following figures as it has

surpassed the LRU. Fig. 3.10 shows the overall number of hits obtained for each

workload with respect to CB-DPET and LRU at L2 cache. The overall number of

hits obtained in the ROI is taken. These numbers have been scaled down by

appropriate factors to keep them around the same range. It is observed from the

figure that CB-DPET has resulted in improvement in the number of hits compared

to LRU for majority of the workloads with fluidanimate showing a high

performance with respect to baseline policy.

Fig 3.10 Overall number of hits recorded in L2 cache

Fig. 3.11 shows the hit rate obtained for individual benchmarks. Hit rate is given

by the following expression.

Hit Rate = Overall number of hits in ROIOverall number of accesses in ROI … … …

Six of the seven benchmarks have reported improvements in the hit rate.

Fluidanimate shows maximum improvement of 8% whereas swaptions and vips

have exhibited hit rates on par with LRU.

0

200000

400000

600000

800000

1000000

1200000

Nu

mb

er

of

Hit

s

Benchmark

Overall number of hits in L2

LRU

DPET

Fig. 3.12 shows the number of hits obtained in the individual parallel phases. As

discussed earlier, the ROI can be further divided into two phases denoted by PP1

(Parallel Phase 1) and PP2 (Parallel Phase 2). These phases need not be equal

in complexity, number of instructions, etc. There can be cases where PP1 might

be more complex with a huge instruction set compared to PP2 or vice versa. The

number of hits recorded in these phases is shown in the graph in Fig. 3.12. CB-

DPET has shown an increase in hits in most of the phases of majority of the

benchmarks.

Fig 3.11 L2 cache hit rate

Miss latency at a memory level refers to the time spent by the processor to fetch

a data from the next level of memory (which is the primary memory in this case)

following a miss in that level of memory. Miss latency at L2 cache for individual

cores is shown in Fig. 3.13. It is usually measured in cycles. To present it in a

more readable format, it has been converted to seconds using the following

formula:

Miss latency in secs = Miss latency in cyclesCPU Clock Frequency … … …

0

0.2

0.4

0.6

0.8

1

1.2

Hit

Ra

te

Benchmarks

L2 cache hit rate

LRU

DPET

The impact of miss latency is an important factor to be taken into consideration. If

the miss latency is greater, it means that the overhead involved in fetching the

data will be high thereby resulting in performance bottleneck.

Fig 3.12 Hits recorded in individual phases of ROI

Fig 3.13 Miss latency at L2 cache for individual cores

Overall miss latency at L2 is also shown in Fig. 3.13. Note that this is with respect

to the ROI of the benchmarks. Considerable reduction in latency is observed

0100000200000300000400000500000600000700000800000900000

1000000

PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2

vips swaptions fluidanimate dedup ferret canneal

Nu

mb

er

of

Hit

s

Benchmark

Number of hits in individual phases

LRU

DPET

0

5

10

15

20

25

30

35

40

45

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

vips swaptions fluidanimate dedup ferret canneal blackscholes

Mis

s La

ten

cy (

in s

eco

nd

s)

Benchmarks

Miss latency at L2 cache for individual cores

LRU

DPET

across the majority of the benchmarks. Vips has shown a maximum reduction in

Overall miss latency. Except ferret and swaptions, all the other workloads have

shown reduction in the overall miss latency.

Percentage increase in Hits = Hits in CBDPET − Hits in LRUHits in LRU ∗ … … …

Fig. 3.14 shows the percentage increase in hits obtained from CB-DPET

compared to LRU. The figure shows that there is an increase in the hit percentage

for all the benchmarks except ferret. Swaptions exhibiting the highest percentage

increase of 13.5% whereas dedup has shown a 2% increase. An approximate 8%

improvement over LRU is obtained as the average hit percentage when CB-DPET

is applied at the L2 cache level. This is because CB-DPET adapts to the changing

pattern of the incoming workload and retains the frequently used data in the

cache, thus maximizing the number of hits. Fig. 3.15 shows the overall number of

misses recorded by each benchmark. Decrease in the number of misses is

observed across majority of the benchmarks compared to LRU.

Fig 3.14 Percentage increase in L2 cache hits

-15

-10

-5

0

5

10

15

Pe

rce

nta

ge

Benchmark

% increase in L2 cache hits

compared to LRU

% increase

in cache

hits

Fig 3.15 Overall number of misses at L2 Cache

3.11. Summary

When it comes to concurrent multi-threaded applications in a CMP

environment, LLC plays a crucial role in the overall system efficiency.

Conventional replacement strategies may not be effective in many instances as

they do not adapt dynamically to the data requirements. Numerous researches

[45,46,79,90] are being conducted to find novel ways for improving the efficiency

of the replacement algorithms applied at LLC. Both LRU and NRU schemes tend

to perform poorly on data sets possessing irregular patterns because they do not

store any extra information regarding the recent access pattern of the data items.

They make decisions purely based on the current access only [15,45,51,79]. This

chapter hence proposes a scheme called CB-DPET which tries to address the

shortcomings arising from the conventional replacement methods and improve

the performance.

In this chapter, the following are the conclusions

CB-DPET takes a more systematic and access-history based decision

compared to LRU. A counter associated with every cache block tracks the

0

100000

200000

300000

400000

500000

600000

700000

800000

Nu

mb

er

of

mis

ses

Benchmark

Number of misses

LRU

DPET

behavior of application access pattern and makes the replacement, insertion

and promotion decisions.

DPET-V is a thrash proof version of DPET. Here some randomly selected data

items are allowed to stay in the cache for a longer time by adjusting the counter

values appropriately.

In multi-threaded applications, a DM register and some test sets are allocated

for every thread. Based on the number of misses in the test sets, a best policy

is chosen for the remaining sets.

These algorithms are made thread-aware by using the thread id. Based on the

thread id, it can identify which thread is trying to access the cache block, so

when multiple threads try to access the same block, it is tagged as shared and

allow that block to stay in the cache for more time compared to other blocks

by appropriately adjusting their counter values.

Evaluation results on PARSEC benchmarks have shown that CB-DPET has

reported an average improvement in the overall hits (up to 8%) when compared

to LRU at the L2 cache level.

4. LOGICAL CACHE PARTITIONING TECHNIQUE

It is imperative for any level of cache memory in a multi-core architecture to have

a well defined, dynamic replacement algorithm in place to ensure consistent

superlative performance [92,93,94,95]. At L2 cache level, the most prevalently

used LRU replacement policy does not acquaint itself dynamically to the changes

in the workload. As a result, it can lead to sub-optimal performance for certain

applications whose workloads exhibit frequently fluctuating patterns. Hence many

works have strived to improve performance at the shared L2 cache level

[91,98,99].

To overcome the limitation of this conventional LRU approach, This chapter

proposes a novel counter-based replacement technique which logically partitions

the cache elements into four zones based on their ‘likeliness’ to be referenced by

the processor in the near future. Experimental results obtained by using the

PARSEC benchmarks have shown almost 9% improvement in the overall number

of hits and 3% improvement in the average cache occupancy percentage when

compared to LRU algorithm.

4.1. Introduction

Cache hits play a vital role in boosting the system performance. When there are

more hits, the number of transactions with the next levels of memories is reduced

thereby expediting the overall data access time. Many replacement techniques

use the cache miss metric to determine the replacement victim but in this chapter,

choosing the victim is based on the number of hits received by the data items in

the cache. Based on the hits received, the cache blocks are logically partitioned

into separate zones. These virtual zones are searched in a pre-determined order

to select the replacement candidate.

Contribution of this chapter includes,

A novel counter based replacement technique that logically partitions the

cache into four different zones namely – Most Likely to be Referenced (MLR),

Likely to be Referenced (LR), Less Likely to be Referenced (LLR), Never

Likely to be Referenced (NLR) in the decreasing order of their likeliness

factor.

Replacement, insertion and promotion of data elements within these zones

in such a manner that the overall hit rate is maximized.

Association of a 3-bit counter with every cache line to categorize the

elements into different zones. On a cache hit, the corresponding element

is promoted from one zone to another zone.

Replacement candidates selection from the zones in the ascending order

of their ‘likeliness factor’ i.e. the first search space for the victim would be

the never likely to be referenced zone, followed by the subsequent zones

till the most likely to be referenced zone is reached.

Periodic zone demotion of elements also occurs to make sure that stale

data does not pollute the cache.

4.2. Logical Cache Partitioning Method

LCP uses a 3-bit counter to dynamically shuffle the cache elements and logically

partition them into four different zones based on their likeliness to be referenced by

the processor in the near future. This counter will be referred to as LCP (Logical

Cache Partitioning) counter in subsequent sections. The lower and upper bounds for

the counter are set to ‘0’ and ‘7’ respectively. The prediction about the likeliness of

reference of the data items is made with the help of the hits encountered in the cache.

As the number of hits for a particular element increases, it is moved up the zone

list till it reaches the MLR region.

Only if it stays unreferenced for quite an amount of time, it is evicted from

the cache. As their names imply, the zones are arranged in the decreasing order

of their likeliness factor. Table 4.1 shows the counter values and their

corresponding zones. Any replacement policy consists of three phases-

Replacement, Insertion and Promotion and so does this method. Each of which is

explained in the subsequent sub-sections.

Table 4.1 LCP zone categorization

Counter Value

Range

Zone

6-7

MLR

3-5

LR

1-2

LLR

0

NLR

4.3. Replacement Policy

When a miss is encountered in the cache, the data item is fetched from the

secondary memory and brought into the cache thereby replacing one of the

elements which was already present in the cache. This element is often referred to

as the victim. The process of selecting a victim needs to be efficient in order to

improve the hit rate. In this method, the victim is selected from the zones in the

increasing order of their likeliness factor. Any cache line which comes under the

NLR zone is considered first for replacement. If no such line is found search is

performed again to check if any element falls in the LLR region and so on till MLR

is reached. If two or more lines possess the same counter value that is considered

for replacement during that iteration, the line that is encountered first is chosen as

the victim.

Once the replacement is made, LCP counters of all the other data elements are

decremented by ‘1’.This is done to carry out gradual zone demotion as discussed

earlier to flush out unused data items from the cache.

4.4. Insertion Policy

This phase is encountered as soon as the replacement candidate is found. The

new incoming data item is inserted into the corresponding cache set and its LCP

counter value is set to ‘2’ (LLR zone).

4.5. Promotion Policy

A hit on any data item in the cache calls for the promotion phase. The LCP value

associated with the cache line is incremented to the final value of its immediate

upper zone. For example, if it was earlier in the LLR zone (LCP value ‘1’ or ‘2’),

its LCP value is set to ‘5’ i.e. the element now falls within the LR zone. By

transferring elements within the zone according to the workload pattern, LCP has

proven to be more dynamic than LRU.

Processor looks for a

parti ular lo k i the cache

LCP algorithm starts search

for lo k i the a he

If is found

Set LCP counter value to the

max value of the next higher

zone

Processor brings the data

item from next level of

memory

Find

replacement

victim in the

cache

HIT

MISS

Start

Finish

Fig 4.1(i) LCP flow diagram

4.6. Boundary Condition

It is essential that the LCP counter value does not overshoot its specified range.

Thus whenever it is modified, a boundary condition check is carried out to ensure

that the value does not go below ‘0’ or beyond ‘7’. Fig. 4.1(i), 4.2(ii) shows the

working of LCP.

If is found

LCP algorithm starts

searching for a replacement

a didate i the a he fro zo e X here X

denotes NLR initially

No

Yes

Sear h for first lo k which has the LCP

counter value that falls

within the range of zone

X

Find replacement

victim in the

cache

I re e t X to poi t to the next higher zone

Choose as replacement victim

Insert new data item

here and set LCP counter

alue to 2 (LLR zone)

Fig 4.1 (ii) Victim Selection flow for LCP


Init:

Initialize all blocks with LCP counter value ‘0’.

Input:


Output:


Miss:

/** Invoke FindVictim method. Set is passed as parameter **/

FindVictim (set): i = 0

removing_index = -1

while i < 8 do

/** Scanning the set **/

for j = 0 to associativity do

/** Search from the NLR zone. LCP counter = 0 **/

if set.blks[j].lcp == i then

/** Block Found. Exit Loop **/

removing_index = i;

break;

if removing_index ! = -1 then

break;

i = i + 1;

for i = 0 to associativity do

/** Gradual Zone Demotion **/

if set.blks[j].lcp>0 then

set.blks[j].lcp = set.blks[j].lcp – 1;

return removing_index

Hit:

/** Boundary Condition Check **/

if set.blk.lcp ! = 7 then

/** NLR Zone. Promote to LLR **/

if set.blk.lcp == 0 then

set.blk.lcp = 2;

/** LLR Zone. Promote to LR **/

else if set.blk.lcp == 1 or set.blk.lcp == 2 then

set.blk.lcp = 5;

/** LR Zone. Promote to MLR **/

else

set.blk.lcp = 7

Insert:

/** LR Zone **/

set.blk.lcp = 2;


Invariant: At any iteration i, the LCP counter value of set.blk[i] holds a value in

the range [0,7].

Initialization:

When i = 0, set.blk[i].lcp will be ‘0’ initially. After being selected as victim and

after the new item has been inserted, set.blk[i].lcp will be set to ‘2’. Hence

invariant holds good at initialization.

Maintenance:

At i = n, set.blk[i].lcp will contain

either zero (Cache block yet to be occupied for the first time) or

set.blk[i].lcp will have ‘2’ (If it has been newly inserted) or

set.blk[i].lcp will have ‘1’ or ‘3’ or ‘4’ or ‘6’ (As a part of gradual zone demotion)

set.blk[i].lcp will hold ‘5’ or ‘7’ (On a cache hit from previous zone).

“if blk.lcp ! = 7” condition ensures that the lcp counter value does not overshoot

its range. Hence invariant holds good for i = n.



Fig 4.2 Working of LRU and LCP for the data set

Termination:

Increment of the loop control variable ‘i’ ensures that the loop terminates within

finite number of runs which is ‘8’ in this case. The condition “if set.blks[j].lcp>0”

ensures that the lcp counter value of the block does not go below ‘0’. The

invariant can also be found to hold good for all the cache blocks present

when the loop terminates.

4.7. Comparison with LRU

The following data set is considered.

. . . 2 9 1 7 6 1 7 5 9 1 0 7 5 4 8 3 6 1 7 . . .

It is observed that data items 1 and 7 occur frequently. Fig. 4.2 shows the

working of LRU and LCP on this data pattern. Assume that the cache can hold 4

blocks at a time. Incoming data items at that point of time are shown in the leftmost

column.

In the right most column, ‘m’ indicates a miss and ‘h’ indicates hit. Initially the

cache contains invalid blocks. Counter values associated with all the blocks are

set to -1. After applying both the techniques, LCP has resulted in 3 hits more than

LRU. Frequently occurring data items 1 and 7 have resulted in hits towards the

end unlike LRU. This is primarily because LRU follows the same approach

irrespective of the workload pattern and tags 1 and 7 as ‘least recently used’.

4.8. Results

Full system simulation using Gem5 simulator with PARSEC benchmarks

(discussed in Chapter 6) as input sets shows improvement in many cache metrics.

To measure the performance of LCP, metrics like hit percentage increase, number

of replacements made at L2, average cache occupancy and miss rate are used.

Results are shown for two, four and eight core systems.

4.8.1. LCP Performance

Subsequently, the term LCP will be used to represent this method in the graphs.

Percentage increase in the overall number of hits at L2 is shown in Fig. 4.3.

Fig 4.4 Difference in number of replacements made at L2 cache

Fig. 4.4 shows the total number of replacements made at L2. Fig. 4.5 shows the

-15

-10

-5

0

5

10

15

20

25

Pe

rce

nta

ge

Benchmark

% lesser number of replacements

made at L2 compared to LRU

% lesser

number of

replacements

made at L2

compared to

LRU (in

percentage)

Fig 4.3 Percentage increase in overall number of hits at L2 cache

-10

-5

0

5

10

15

Pe

rce

nta

ge

Benchmark

% increase in overall number of hits

at L2 compared to LRU

% increase in

overall

number of

hits at L2

compared to

LRU

average L2 cache occupancy percentage for all the workloads. Fig. 4.6 shows the

miss rate recorded by the individual cores and also the overall miss rate at L2.

Fig 4.5 Average cache occupancy percentage

Miss rate is defined as the ratio of number of misses to the total number of

accesses. As it is observed from Fig. 4.6 there is reduction in the core-wise miss

rate and overall miss rate across majority of the benchmarks when compared to

LRU.

Apart from two workloads, all the others have shown significant improvement in

hits. Percentage increase in hits varies from a minimum of 0.1% (dedup) to

maximum of up to 11% (ferret) as shown in Fig. 4.3.

Number of replacements reflects the efficiency of any replacement algorithm. A

maximum of almost 20% reduction in the number of replacements is observed

across the given workloads compared to LRU from Fig. 4.4. Cache occupancy

refers to the amount of cache that is being effectively utilized to improve the

performance for any workload. Fig. 4.5 indicates cache utilization is higher for

majority of the benchmarks when LCP is applied compared to LRU. Vips utilizes

the cache space to the maximum possible extent.

0

10

20

30

40

50

60

70

80

90

100P

erc

en

tag

e

Benchmark

Average cache occupancy(in percentage)

LCP

LRU

Fig 4.6 Miss rate recorded by individual cores

4.9. Summary

This chapter discusses about a dynamic and a structured replacement

technique that can be adopted across the LLC. Key points pertaining to this

replacement policy are as follows:

Elements of the cache are logically partitioned into four zones based on their

likeliness to be referenced by the processor with the help of a 3-bit LCP counter

associated with every cache line. The minimum value that the counter can hold

is ‘0’ and the maximum value is ‘7’.

Replacement candidates are chosen from the zones in the increasing order of

their likeliness factor starting from the NLR zone. Initially all the counter values

are set to ‘-1’. Conceptually all the blocks contain invalid data.

For every hit, the corresponding element is moved up by one zone by adjusting

the LCP counter value. If it has reached the top most zone (MLR) the counter

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

Fluidanimate Ferret x264 Swaptions Vips Dedup Canneal Blackscholes

Mis

s R

ate

Benchmark

Miss rate recorded by individual cores at L2

LCP

LRU

value is left untouched.

For every miss the counter value is decremented by ‘1’ to prevent stale data

items from polluting the cache as the algorithm executes. When a new data

arrives, its LCP value is set to ‘2’. Boundary condition check needs to be applied

whenever the LCP counter value is modified to make sure that it does not

overshoot its designated range.

Experimental results obtained by applying the method on PARSEC

benchmarks have shown a maximum improvement of 9% in the overall number

of hits and 3% improvement in the average cache occupancy percentage when

compared to the conventional LRU approach.

5. SHARING AND HIT BASED PRIORITIZING TECHNIQUE

The efficiency of cache memory can be enhanced by applying different

replacement algorithms. An optimal replacement algorithm must exhibit good

performance with minimal overhead. Traditional replacement methods that are

currently being deployed across multi-core architecture platforms, try to classify

elements purely based on the number of hits they receive during their stay in the

cache. In multi-threaded applications data can be shared by multiple threads

(which might run on the same core or across different cores). Those data items

need to be given more priority when compared to private data because miss on

those data items may stall the functioning of multiple threads resulting in

performance bottleneck. Since the traditional replacement approaches do not

possess this additional capability, they might lead to inadequate performance for

most of the current multi-threaded applications.

To address this limitation, this chapter proposes a Sharing and Hit based

Prioritizing (SHP) replacement technique that takes the sharing status of the data

elements into consideration while making replacement decisions. Evaluation

results obtained using multi-threaded workloads derived from the PARSEC

benchmark suite shows an average improvement of 9% in the overall hit rate when

compared to LRU algorithm.

5.1. Introduction

In multi-threaded applications, when threads run across different cores, the

activity in L2 cache tend to shoot up as multiple threads try to store and retrieve

shared data. At this point there are many challenges that need to be addressed.

Cache coherence has to be maintained, the replacement decisions made need to

be judicious and the available cache space must be utilized efficiently. When there

is a miss on a data item which is shared by multiple threads, the execution of many

threads gets stalled. This is because when the first thread, which had encountered

a miss, is attempting to fetch the data from the next level of memory, any

subsequent thread which tries to access the same data will result in miss.

Hence it becomes imperative to handle shared data with more care compared to

other data items. Traditional replacement algorithms like LRU, MRU etc do not

possess this capability. They classify elements based on when they will be required

by the processor but do not check for the status of the cache block, i.e. whether it is

shared or private at any point of time. Also when there are more than thousands of

threads running in parallel, it may not be very useful in tagging the block simply as

‘shared’ or ‘private’. In those cases, additional information about their shared status

can prove handy. This chapter provides an overview on a novel counter based

cache replacement technique that associates a ‘sharing degree’ with every cache

block. This degree specifies a set of ranges which includes – not shared (or private),

lightly shared, heavily shared and very heavily shared. This information combined

along with the number of hits received by the element during its stay in the cache is

used to produce a priority for that element based on which judicious replacement

decisions are made.

Contributions of this chapter includes,

A novel counter based replacement algorithm that makes replacement

decisions based on the sharing nature of the cache blocks

Association of a 'sharing degree' with every cache line, which indicates the

extent to which the element is shared depending on the number of threads

that try to access that element.

Four degrees of sharing namely – not shared (or private), lightly shared,

heavily shared and very heavily shared.

A replacement priority for every cache line which is arrived upon by

combining the sharing degree along with the number of hits received by

the element based on which the replacement decisions are made.

5.2. Sharing and Hit Based Prioritizing Method

Similar to the other two replacement algorithms, every cache block is associated

with a counter. This 2-bit counter is called as the Sharing Degree Counter or SD

counter. Table 5.1 shows all the four possible values that the counter can hold

and their individual descriptions. Depending on the number of threads that access

a cache block, it is classified into any one of the sharing categories as shown in

the table. To collect the sharing status of the cache blocks, it is essential to have

an efficient data structure in place to track the number of threads that accesses

the data item. For this purpose there is a filter which is referred to as the Thread

Tracker filter (TT filter). It is a flexible dynamic software based array that gets

created and is associated with every cache block during run time.

When a thread tries to access a data item, a search is conducted in the TT filter

to check if the thread id is already present in it. If not, then the id is stored in the

corresponding cache block’s TT filter. Size of the filter expands as and when a

thread id gets added to it. Once a cache block is about to be evicted, the memory

allocated for the associated TT filter is freed. At any point of time, with the help of

thread ids that are found in the TT filter, the sharing degree counter is populated

for every cache block.

5.3. Replacement Priority

The sharing status of the blocks alone cannot be used to make replacement

decisions. Number of hits garnered by the element is also an important factor to

consider when performing a replacement. Hence a priority for every individual

cache block has been arrived at, that not only gives more weightage to sharing

nature but also attaches importance to the number of cache hits garnered by the

element. Priority computation equation is as follows Priority = Sharing Degree + ∗ Hits Count … … … 4

Sharing degree indicates the sharing degree counter value for the particular

cache block at that particular point of time. A software based hit counter is

associated with every cache block that provides an estimate of the hits received

by the element during its stay in the cache. This counter is refreshed every time a

new element enters the block. The upper bound of the counter is fixed to a specific

value. If that value is reached, the counter is considered saturated and any hits

received beyond that will not be taken into account. Experimental results have

shown that in a unified cache memory, the average number of instructions is 50%

and the remaining 50% forms the data. For calculation of priority, the analysis is

mainly upon the data part as it is accessed much more frequently compared to

instructions. Also due to locality of reference, several threads can access the data

several times. To highlight the former part, the hit count value is divided by 2 and

to highlight the latter part, hit count value is scaled by a factor of the sharing

degree counter value. Whenever any one of these counters change, the priority

of the element needs to be recomputed.

Cache replacement policy consists of deletion, insertion and promotion

operations. Each phase of the algorithm is explained in detail in the subsequent

sections. The mapping method used is set associative mapping.

5.4. Replacement Policy

When the cache becomes full, replacement has to be made to pave way for new

incoming data items. SHP makes replacement decisions based on the computed

priority values. Elements are evicted in the increasing order of their priority. The

element with the least priority in the list is chosen as the victim. If more than one

element has the same least priority, then the one which is encountered first while

scanning the cache is taken as the victim as a tie-breaking mechanism. It is also

essential to ensure that stale data do not pollute the cache for longer periods of

time.

From this aspect the hit counter can be regarded as an ageing counter. Every

time a victim is found, the hit counter of all the other elements in the cache is

decremented and their corresponding priorities are re-computed. If any element

remains unreferenced for a long period of time, its priority will gradually decrease

and the element will eventually be flushed out of the cache.


Input:


Output:


Miss:

/** Invoke FindVictim method. Set id is passed as parameter **/

FindVictim(set)

FindVictim(set):

min = ∞

for i = 0 to associativity do

/** Choose the minimum priority element **/

if set.blk[i].priority < min then

min = set.blk[i].priority;

removing_index = i;

return removing_index;

Hit:

/** Scan the TT Filter **/

for i = 0 to length(TT_Filter) do

if TT_Fitler[i] == accessing_thread_id then

thread_found = 1;

/** If thread id not found in TT Filter, add the thread id to it **/

if thread_found ! = 1 then

TT_Filter[i] = accessing_thread_id;

set blk.sharing_degree counter value[0 to 3] according to TT_Filter length

priority = (set.blk.sharing_degree +1) * (set.blk.hit_counter)/2;

Insert:

/** Private block **/

set.blk.sharing_degree = 0;

refresh blk.hit_counter;


Invariant: At any iteration i, the sharing degree counter value of set.blk[i] holds a

value in the range [0,3].

Initialization:

When i = 0, set.blk[i]. sharing_degree will be set to ‘0’ initially. Hence invariant

holds good at initialization.

Maintenance:

At i = n, set.blk[i].sharing_degree will contain

either zero (Private) or

set.blk[i].sharing_degree will have ‘1’ (Lightly shared) or

set.blk[i].sharing_degree will have ‘2’ (Heavily Shared) or

set.blk[i].sharing_degree will hold ‘3’ (Very Heavily Shared).

Hence invariant holds good for i = n.



Termination:

The priority computation loop is limited by the associativity of the cache which

will always be finite. TT filter iteration is limited by the size of the TT filter which

will be equal to the total number of threads executing. It will also be finite. The

invariant can also be found to hold good for all the cache blocks present

when the loops terminate.

Processor looks for a

parti ular lo k i the cache

SHP algorithm starts search

for lo k i the a he

If is found

Increment Hit Counter

Processor brings the

data item from next

level of memory

Find replacement

victim in the cache

HIT

MISS

Start

Finish

Update TT Filter if

needed and sharing

degree counter

Re-compute priority

Fig 5.1 (i) SHP flow diagram

SHP algorithm starts searching

for a repla e e t a didate i the cache

Sear h for first lo k which has the minimum

priority value

Find replacement

victim in the cache

Choose as repla e e t victim

Insert new data item here and

set Sharing degree counter

value to 0 (private) and re-

compute priority

Fig 5.1 (ii) Victim Selection flow for SHP

Table 5.1 Sharing degree values and their descriptions

Number of Accessing Threads

Sharing Degree Counter Value

Nature of Sharing

1 0 Private/Not Shared

2-3 1 Lightly Shared

4-7 2 Heavily Shared

8-10 3 Very Heavily Shared

5.5. Insertion Policy

After evicting the victim, the new data item needs to be inserted into the cache.

Sharing degree counter is set to ‘0’ to indicate that the incoming block is currently

not shared by any other threads. The corresponding hit counter is refreshed.

5.6. Promotion Policy

When a cache hit happens, the hit counter of the corresponding block is

incremented. Also the accessing thread id is compared against the ids which are

already present in the TT filter and if it is not present there, it is added into the TT

filter. Sharing degree counter is adjusted accordingly and the priority is re-

computed. Fig. 5.1(i), 5.1(ii) shows the basic flow of SHP technique.

5.7. Results

To evaluate the performance of SHP, simulation has been carried out with

GEM5 simulator using PARSEC workloads (discussed in Chapter 6) and the

results are compared with LRU. Most of the applications are benefited

through this replacement policy. The results are shown in this section.

5.8. SHP Performance

It is observed from the following figures, that the performance of the proposed

replacement technique is better than LRU for multithreaded workloads. The

overall number of hits obtained at LLC cache for SHP method compared to LRU

is shown in the graph in Fig. 5.2. In Fig. 5.2, blackscholes has shown the

maximum improvement. On an average, a 9% percent improvement in the overall

number of hits is observed across the given benchmarks.

Fig 5.2 Overall number of hits at L2

Fig. 5.3 shows the overall number of replacements made at L2 for SHP and

LRU. The more the number of replacements made, the more will be the overhead

involved. So it is always desirable to keep this parameter as low as possible. In

this method, the average number of replacements made at L2 has decreased

compared to LRU. Compared to other benchmarks, blackscholes has exhibited

the least number of replacements (1.6% lesser than that of LRU) and swaptions

has shown a 0.55% decrease when compared to LRU

Fig. 5.4 shows the L2 miss rate for different workloads. Miss rate is computed

from the overall number of misses and the overall number of accesses. To obtain

a high performance, LLC must provide a lower miss rate. Majority of the

benchmarks have shown improvement in this metric when compared to LRU with

dedup showing superior performance.

0

50000

100000

150000

200000

250000

300000

350000

400000

No

of

Hit

s

Benchmark

Overall Number of Hits

SHP

LRU

Fig 5.3 Percentage decrease in number of replacements made at L2

Fig 5.4 Overall miss rate and core-wise miss rate at L2 cache

-1

-0.5

0

0.5

1

1.5

2

Pe

rce

nta

ge

Benchmarks

% decrease in number of

replacements compared to LRU

% decrease

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

core

1

core

2

ov

era

ll

core

1

core

2

ov

era

ll

core

1

core

2

ov

era

ll

core

1

core

2

ov

era

ll

core

1

core

2

ov

era

ll

core

1

core

2

ov

era

ll

core

1

core

2

ov

era

ll

Blackscholes Ferret x264 swaptions Vips Dedup Canneal

Mis

s ra

te

Benchmark

Overall and per core miss rate

SHP

LRU

5.9. Summary

Shared data plays a critical role in determining the performance of cache

memory systems, especially in a multi-threaded environment. Conventional LRU

approach does not attach importance to such data and hence this chapter

proposes a novel counter based prioritizing algorithm.

Every cache block is associated with a 2-bit sharing degree counter which

iterates from 0 to 3.

A dynamic software based TT filter is associated with every block to keep track

of the threads that are accessing that block.

A hit counter is used to keep track of the hits received by the data item.

Values of the sharing degree and the hit counters are used to compute a priority

for each cache block.

This priority is then used to make judicious replacement decisions.

Evaluation results have shown an average improvement of up to 9% in the

overall number of hits when compared to the traditional LRU approach.

6. EXPERIMENTAL METHODOLOGY

This section outlines the simulation infrastructure and benchmarks used to

evaluate the dynamic replacement policies. To evaluate the proposed techniques,

Gem5 [102] simulator is used and PARSEC benchmark suites [103,104,105] are

used as input sets.

6.1. Simulation Environment

For simulation studies, Gem5 - a modular, discrete, event driven, completely

object oriented, computer system open source simulator platform, has been

chosen. It can simulate a complete system with devices and an operating system,

in full system mode (FS mode), or user space only programs where system

services are provided directly by the simulator in syscall emulation mode (SE

mode). The FS mode is currently being used. It is also capable of simulating a

variety of Instruction Set Architectures (ISAs).

Table 6.1 Summary of the key characteristics of the PARSEC benchmarks

Program

Application

Domain

Working Set

Blackscholes

Financial Analysis

Small

Canneal

Computer Vision

Medium

Dedup

Enterprise Storage

Unbounded

Ferret

Similarity Search

Unbounded

Fluidanimate

Animation

Large

Swaptions

Financial Analysis

Medium

Vips

Media Processing

Medium

x264

Media Processing

Medium

6.2. PARSEC Benchmark

The Princeton Application Repository for Shared-Memory Computers

(PARSEC) is a benchmark suite that comprises numerous large scale commercial

multi-threaded workloads for CMP. For simulation purposes, seven workloads

have been picked up from PARSEC suite. Every workload is divided into three

phases: an initial sequential phase, a parallel phase (which is further split into two

phases), and a final sequential phase.

This work focuses on the performance obtained at the parallel phase which is

also known as the Region of Interest (ROI). All the stats which are collected to

measure the cache performance correspond to the ROI of the workloads. The

basic characteristics of the chosen benchmarks are shown in Table 6.1. Table 6.2

highlights the cache configuration parameters used for simulation.

Table 6.2 Cache architecture configuration parameters

Parameter L1 Cache L2 Cache

Total Cache Size i-cache 32 kB 2MB

d-cache 64 kB

Associativity 2-way 8-way

Miss Information/Status Handling Registers (MSHR)

10 20

Cache Block Size 64B 64B

Latency 1ns 10ns

Replacement Algorithm LRU CB-DPET

6.3. Evaluation Of Many-Core Systems

Evaluation using PARSEC benchmarks for all the three methods on two, four

and eight core systems with a 2MB LLC have yielded better results compared to

LRU in terms of parameters like number of hits, hit rate, throughput and IPC

speedup. Throughput is measured from the total number of instructions that get

executed in a single clock cycle (Instruction Per Cycle or IPC). In multi-threaded

applications, throughput is the sum of individual thread IPCs as shown below.

Throughput = ∑ IPCi ni=1 … … …

IPC Speedup of Algorithm k over LRU = IPCk / IPCLRU

where ‘n’ refers to the total number of threads and k refers to anyone of the

methods CB-DPET, LCP or SHP.

Evaluation results using PARSEC benchmarks for two cores have shown that

an average improvement in IPC speedup of up to 1.06, 1.05 and 1.08 times that

of LRU has been observed across CB-DPET, LCP and SHP techniques

respectively (Fig. 6.1, 6.2 and 6.3).

Fig 6.1 Performance comparison for 2 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Sp

ee

du

p

Benchmark

Average IPC speedup over LRU - 2 core

CB-DPET

LCP

SHP



0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

Sp

ee

du

p

Benchmarks


CB-DPET

LCP

SHP

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

Sp

ee

du

p

Benchmarks


CB-DPET

LCP

SHP

Fig 6.4 CB-DPET- Percentage increase in hits

Fig 6.5 LCP- Percentage increase in hits

-20

-15

-10

-5

0

5

10

15

Pe

rce

nta

ge

Benchmark

CBDPET - Percentage increase in

number of hits compared to LRU

2-Core

4-Core

8-Core

-20

-15

-10

-5

0

5

10

15

20

Pe

rce

nta

ge

Benchmark

LCP - Percentage increase in number of

hits compared to LRU

2-Core

4-Core

8-Core

Fig 6.6 SHP- Percentage increase in hits

In 2 core scenario for CB-DPET, canneal has shown highest IPC speedup

improvment (1.2 times that of LRU) and dedup has recorded the least improvment

of 1.02 times that of LRU. Vips and swaptions have shown maximum IPC speed

up compared to other workloads for LCP and SHP methods in 2-core system. In

4 core scenario, fluidanimate has shown maximum IPC speedup of 1.075 and

1.07 times that of LRU for CB-DPET and LCP methods respectively. Blackscholes

has shown IPC speedup of 1.07 times that of LRU which is the maximum for SHP

method in 4-core environment. For 8 cores, ferret has recorded maximum IPC

speedup for CB-DPET and LCP methods whereas swaptions has shown

maximum IPC speedup of almost 1.1 times that of LRU for SHP method.

Improvement in overall hits of up to 8% has been observed using CB-DPET when

compared to LRU at the L2 cache level (Fig. 6.4). LCP and SHP have shown an

average improvement of upto 9% (Fig. 6.5) and 9% (Fig. 6.6) respectively in the

overall number of hits.

-30

-25

-20

-15

-10

-5

0

5

10

15

20

25

Pe

rce

nta

ge

Benchmark

SHP - Percentage increase in number of

hits compared to LRU

2-Core

4-Core

8-Core

6.4. Storage Overhead

6.4.1. CB-DPET

PET counter requires frequent lookup so it can be kept it as a part of the cache

hardware to expedite the access time. Considering an 8-way associative, 2MB

cache which has 4096 sets and the cache block size is 64 Bytes. Additional

Storage required is calculated as follows. Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block = ∗ ∗ = 𝑏𝑖 𝑜 𝐾𝐵

DM Register is allocated on a per thread basis and it depends on the application

during its run time so it is kept more like software based register rather than a

hardware one. The decision to have block status register either as a part of

hardware or software depends during implementation time because the value it

can hold is not fixed right now. Assuming that it can take only a ‘0’ (say shared)

and ‘1’ (not shared) and assuming that the thread id range do not start from ‘0’: If

implemented in hardware. Additional storage required for blockstatus register is

approx 4KB. The total additional storage required is approximately 16KB which

amounts to 0.78% overall increase in area for a 2MB L2 Cache.

6.4.2. LCP

Similar to PET, LCP counter also requires frequent lookup so keeping it as a

part of the cache hardware to expedite the access time. Storage overhead is

similar to that of PET counter which is approximately 12KB since LCP is also a 3-

bit counter. As no other registers and counters are required, LCP results in 0.58%

overall increase in area for a 2MB L2 Cache.

6.4.3. SHP

SHP requires a 2-bit Sharing Degree counter which requires frequent look up

and hence keeping it as a part of the hardware. Additional storage required for

Sharing Degree counter is calculated as follows

Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block = ∗ ∗ = 𝑏𝑖 𝑜 𝐾𝐵

Thread Tracker filter which is used to keep track of the threads that accesses the

blocks can be implemented as a part of software which gives the flexibility of

dynamically allocating and freeing the memory as and when required. Hence SHP

results in 0.39% overall increase in area for a 2MB L2 Cache.

7. CONCLUSION AND FUTURE WORK

7.1. Conclusion

When it comes to concurrent multi-threaded applications in a CMP environment,

LLC plays a crucial role in the overall system efficiency. Conventional replacement

strategies like LRU, NRU etc may not be effective in many instances as they do

not adapt dynamically to the data requirements. Numerous researches are being

conducted to find novel ways for improving the efficiency of the replacement

algorithms applied at LLC. In this dissertation, three novel counter based

replacement strategies have been proposed.

A novel replacement algorithm called CB-DPET has been proposed which

takes a more systematic and access-history based decision compared to LRU.

It is further split into DPET and DPET-V. DPET associates a counter with every

cache block to keep track of its recent access information. Based on this

counter’s values, the algorithm makes replacement, insertion and promotion

decisions.

A dynamic and a structured replacement technique called LCP has been

proposed to be adopted across the LLC. Here the elements of the cache are

logically partitioned into four zones based on their likeliness to be referenced

by the processor with the help of a 3-bit LCP counter associated with every

cache line. Replacement candidates are chosen from the zones in the

increasing order of their likeliness factor starting from the NLR zone. For every

hit, the corresponding element is moved up by one zone by adjusting the LCP

counter value. If it has reached the top most zone (MLR) the counter value is

left untouched. For every miss the counter value is decremented by ‘1’ to

prevent stale data items from polluting the cache.

A novel counter based prioritizing algorithm called SHP has been proposed to

be applied across the LLC. Here every cache block is associated with a 2-bit

sharing degree counter which iterates from 0 to 3. A dynamic software based

TT filter is associated with every block to keep track of the threads that are

accessing that block. A hit counter is used to keep track of the hits received by

the data item. Values of the sharing degree and the hit counters are used to

compute a priority for each cache block. This priority is then used to make

judicious replacement decisions.

Table 7.1 Percentage increase in overall hits at L2 Cache compared to LRU

Method 2-Core 4-Core 8-Core

CB-DPET 8% 4% 3%

LCP 6% 9% 5%

SHP 7% 9% 5%

Table 7.1 provides the percentage improvement obtained in overall number of

hits when compared to LRU for all the three methods. It is observed that up to 9%

improvement has been obtained at the L2 cache level (for LCP) in overall hit

percentage increase and an IPC speed up of up to 1.08 times that of LRU (for

SHP).

Simulation studies shows that all the three proposed counter based replacement

techniques have outperformed LRU with minimal storage overhead.

7.2. Scope for Further work

Dynamic replacement strategies have proven to work well with versatile set of

application workloads that currently exist. Increasing the dynamicity without

inducing much hardware overhead is crucial. This research has experimented

novel replacement strategies on the shared L2 cache while the modifications that

may be required to run the same algorithms in the relatively smaller private L1

cache is an interesting area to explore.

Appending additional bits per cache block in CB-DPET can help in storing more

information about the access pattern of the data item but storage overhead needs

to be taken into consideration. In LCP, optimization of the victim search process

could be done to expedite the replacement candidate selection. Multi-threading

could be a better solution here. Instead of having one thread searching the cache

set in multiple passes there can be multiple threads searching in different zones

but inter-thread communication may have to be explored to make sure the process

stops once victim is found. In sharing and hit-based prioritizing algorithm,

replacement priority of cache blocks is computed from the sharing degree and

number of hits received. In order to boost the accuracy, various other parameters

like miss penalty incurred while replacing that block etc can also be considered

while computing priority.

TECHNICAL BIOGRAPHY

Muthukumar S (RRN: 1184211) was born on 15th April 1965, in Erode,

Tamilnadu. He received the BE degree in Electronics and Communication

Engineering and the ME degree in Applied Electronics from Bharathiar University,

Coimbatore in the year 1989 and 1992 respectively. He is currently pursuing his

Ph.D. Degree in the Department of Electronics and Communication Engineering

of B.S. Abdur Rahman University, Chennai, India. He has got more than 22 years

of teaching experience in various institutions. He is working as Professor in the

Department of Electronics and Communication Engineering at Sri Venkateswara

College of Engineering, Chennai, India. His current research interests include

Multi-Core Processor Architectures, High Performance Computing and

Embedded Systems.

E-mail ID : [email protected]

Contact number : 9941437738

mailto:[email protected]

ENHANCEMENT OF CACHE PERFORMANCE IN MULTI-CORE … · 2017-02-06 · ENHANCEMENT OF CACHE...

Documents

Transcript of ENHANCEMENT OF CACHE PERFORMANCE IN MULTI-CORE … · 2017-02-06 · ENHANCEMENT OF CACHE...