ISSN 2320 -5083 Journal of · PDF fileJournal of International ... Advanced Materials...
Transcript of ISSN 2320 -5083 Journal of · PDF fileJournal of International ... Advanced Materials...
Journal of International Academic Research for Multidisciplinary
ISSN 2320 -5083
A Scholarly, Peer Reviewed, Monthly, Open Access, Online Research Journal
Impact Factor – 1.393
VOLUME 1 ISSUE 11 DECEMBER 2013
A GLOBAL SOCIETY FOR MULTIDISCIPLINARY RESEARCH
www.jiarm.com
A GREEN PUBLISHING HOUSE
Editorial Board
Dr. Kari Jabbour, Ph.D Curriculum Developer, American College of Technology, Missouri, USA.
Er.Chandramohan, M.S System Specialist - OGP ABB Australia Pvt. Ltd., Australia.
Dr. S.K. Singh Chief Scientist Advanced Materials Technology Department Institute of Minerals & Materials Technology Bhubaneswar, India
Dr. Jake M. Laguador Director, Research and Statistics Center, Lyceum of the Philippines University, Philippines.
Prof. Dr. Sharath Babu, LLM Ph.D Dean. Faculty of Law, Karnatak University Dharwad, Karnataka, India
Dr.S.M Kadri, MBBS, MPH/ICHD, FFP Fellow, Public Health Foundation of India Epidemiologist Division of Epidemiology and Public Health, Kashmir, India
Dr.Bhumika Talwar, BDS Research Officer State Institute of Health & Family Welfare Jaipur, India
Dr. Tej Pratap Mall Ph.D Head, Postgraduate Department of Botany, Kisan P.G. College, Bahraich, India.
Dr. Arup Kanti Konar, Ph.D Associate Professor of Economics Achhruram, Memorial College, SKB University, Jhalda,Purulia, West Bengal. India
Dr. S.Raja Ph.D Research Associate, Madras Research Center of CMFR , Indian Council of Agricultural Research, Chennai, India
Dr. Vijay Pithadia, Ph.D, Director - Sri Aurobindo Institute of Management Rajkot, India.
Er. R. Bhuvanewari Devi M. Tech, MCIHT Highway Engineer, Infrastructure, Ramboll, Abu Dhabi, UAE Sanda Maican, Ph.D. Senior Researcher, Department of Ecology, Taxonomy and Nature Conservation Institute of Biology of the Romanian Academy, Bucharest, Romania Dr. Reynalda B. Garcia Professor, Graduate School & College of Education, Arts and Sciences Lyceum of the Philippines University Philippines Dr.Damarla Bala Venkata Ramana Senior Scientist Central Research Institute for Dryland Agriculture (CRIDA) Hyderabad, A.P, India PROF. Dr.S.V.Kshirsagar, M.B.B.S,M.S Head - Department of Anatomy, Bidar Institute of Medical Sciences, Karnataka, India. Dr Asifa Nazir, M.B.B.S, MD, Assistant Professor, Dept of Microbiology Government Medical College, Srinagar, India. Dr.AmitaPuri, Ph.D Officiating Principal Army Inst. Of Education New Delhi, India Dr. Shobana Nelasco Ph.D Associate Professor, Fellow of Indian Council of Social Science Research (On Deputation, Department of Economics, Bharathidasan University, Trichirappalli. India M. Suresh Kumar, PHD Assistant Manager, Godrej Security Solution, India. Dr.T.Chandrasekarayya,Ph.D Assistant Professor, Dept Of Population Studies & Social Work, S.V.University, Tirupati, India.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
605 www.jiarm.com
ANALYSIS OF CACHE OBLIVIOUS ALGORITHMS
KORDE P.S* *Dept. of Computer Science, Shri Shivaji College, Parbhani (M.S.) India
ABSTRACT
We propose to cache-oblivious algorithms for better memory utilization. Design of
cache-oblivious algorithm requires deep understanding of program structure and operation,
and familiarity with memory architecture. To make the performance benefits of cache
oblivious structure available to the average programmer. It is very necessity to design such
techniques which reduce cache miss rate and achieve significant performance improvements
on real programs. We have developed techniques with detailed performance in comparison
with its cache algorithm, with both the parameter values in the cache oblivious algorithms
and the hardware platforms varied.
KEYWORDS: Cache Oblivious, Cache miss, Cache hit; Multiplication; Transposition; Sorting; B-tree
1. INTRODUCTION
A Cache memory is high speed and relatively small memory which play very important
role in computer system. It is utilized to eliminate the gap between CPU and main memory
speed. It has a smaller access time than that of main memory. At present there are three types
of cache
1. Fully Associative
2. Direct Mapped
3. Set Associative
In Fully Associative Cache, there is no restriction on mapping form memory to cache. The
tag used for the search it is expensive. It is feasible for small size cache only. In Direct
Mapped Cache, A memory block is mapping by only one cache line. There is no need to
expensive search. In Set Associative, A predefined set are used for mapping from memory to
cache. From this list of types of Cache, Direct Mapped cache is the fastest. With advance
development in hardware of computer, it is possible to have high speed CPU but the
performance of CPU get restricted by the performance of Cache memory access. There are
various approaches available for effective utilization of cache memory. List Recently Used
(LRU), Most Recently Used (MRU), Pseudo LRU (PLRU), Segmented LRU, Set Associative
Direct Map, Least Frequency Used (LFU) and Multi Queue (MQ) .
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
606 www.jiarm.com
2. Utilization of Cache Memory Algorithms
Program transformations are one of the most valuable compiler techniques to improve data
locality. However, restructuring compilers have a hard time coping with data dependences. A
typical solution is to focus on program parts where the dependences are simple enough to
enable any cache obliviousness. For more complex problems is only addressed the question
of checking whether a it is legal or not.
There are two types of processing algorithms
a) Utilization of Cache Oblivious Algorithms.
b) Utilization Cache Coconscious Algorithms.
Cache Oblivious Algorithms
A Cache Oblivious Algorithm is designed to perform well, without modification on multiple
machines with different cache sizes or for a memory hierarchy with different levels of cache
having different sizes. The idea of cache oblivious algorithms was conceived by Charles E.
Leserson[12] as early as 1996 and first published by Harald Prokop [12] in his master thesis
at the Massachusetts Institute of Technology in 1999. Optimal Cache Oblivious Algorithms
are known for Cooley- Tukey FFT algorithms, matrix multiplication, sorting, matrix
transposition and several other problems. The goal of Cache Oblivious Algorithms is to
reduce the amount of tuning that is required. Cache Oblivious Algorithms are analyzed using
idealized model of cache known as cache oblivious model.
Cache Coconscious Algorithms
A Microprocessor performance has improved 60% each year for almost two decades. Yet
over same period, memory access time has decreased by less than 10% per year [3]. These
trends appear likely to persist barring an unforeseen technology break through. This is
because the primary driving force behind memory technology is storage capacity and current
technology consists of high capacity memory with fast access times.
The unfortunate, but inevitable, consequence is an ever-increasing processor memory
performance gap. Memory caches are the ubiquitous hardware solution to this problem.
These are small, fast memories that store recently accessed data items and attempt to
intercept and satisfy data requests without accessing main memory. In the beginning ,
single level of cache suffered, but increasing performance gap(now two orders of
magnitude) requires two levels of caches today and three in the near future.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
607 www.jiarm.com
3. Cache Oblivious Algorithms
We first discuss the structure of basic cache oblivious algorithms and its optimization
techniques.
3.1 Cache oblivious model
The basic notation of cache oblivious model is simple. Modern Computer has multi-
level storage hierarchies. In which data which is being manipulated by the CPU moves back
and forth between various levels of storage.
‐ In register on the CPU
‐ In a Specialized Cache Memory(Often called level-1 or level-2 cache)
‐ In Computers main memory
‐ On disk
The CPU has different sized caches. The size of cache is important specifically the block size
of the cache is important; the block size is unit of transfer and replacement.
We have demonstrated standard Cache Oblivious Algorithms cache oblivious matrix
transposition, cache oblivious matrix multiplication and Cache oblivious Dynamic
Algorithms. These have good cache performance and cache aware sense. We tested first
matrix transposition, matrix multiplication and dynamic programming with its cache miss ratio.
3.1.1 Matrix Transposition
Matrix transposition is a fundamental operation in linear algebra and in fast Fourier
transforms and has applications in numerical analysis, image processing and graphics. The
simplest hack for transposing a N X N square matrix in C++ could be:
for (i = 0; i < N; i++) for (j = i+1; j < N; j++) swap(A[i][j], A[j][i])
In C++ matrices are stored in “row-major" storage, i.e. the rightmost dimension varies
the fastest. In the above case, the number of cache misses the code could do is (N2). The
optimal cache oblivious matrix transposition makes (1 + N2/B ) cache misses[12]. Before
we go into the divide and conquer based algorithm for matrix transposition that is cache
oblivious, let us see an experimental results.
We present here general free cache oblivious of performance, we consider here 3×3,
4×4, 5×5 matrices then find out cache miss and cache hit ratio. Graphically Cache oblivious
how its length in fig 1.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
608 www.jiarm.com
Fig 1: Cache oblivious of matrix transposition
3.1.2 Matrix Multiplication
Matrix multiplication is one of the most studied computational problems, we are given two
matrices of m × n and n × p and we want to compute the product matrix of m× p size. In this
section we will use n = m = N although the results can easily be extended to the case when
they are not equal. Thus, we are given two N×N matrices x (xi;j ), y = (yi;j ), and we wish to
compute their product z, i.e. there are N2 outputs where the ( i, j) output is
The algorithm breaks the three matrices x; y; z into four sub matrices of size N/2 ,
N/2, rewriting the equation z = x.y as
r s a c e g t u = b d × f h
We present here another general free cache oblivious of performance, we consider
here 3×3, 4×4, 5×5 matrices multiplication. How it processes in fig 2.
Fig 2: Cache oblivious of matrix multiplication
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
609 www.jiarm.com
3.1.3 Dynamic Programming
Dynamic Programming is a method for solving complex problems by breaking them
into simpler sub problems. It is applicable to problems exhibiting properties of overlapping
sub problems which are only slightly smaller and optimal substructure. It takes less time than
native methods.
Cache-oblivious implementation of the classic dynamic programming algorithms
using Transferring values and storing value methods, the algorithms continues to run in
(m+n) space and perform (mn/BM) block transfers. Experimental Algorithms shows the
faster than widely used algorithms.
We present here general free cache oblivious of performance, we consider here 3×3, 4×4, 5×5
matrices then find out cache miss and cache hit Graphically it rises according to length in fig 3.
Fig 3: Cache oblivious of Dynamic Algorithms
In 2×2 Matrices, 3×3 Matrices the cache miss ratio n as n+1 but when size of array increase
and length of cache miss ratio also goes to rises as shown in fig 3.
4. Analysis of Cache Oblivious Algorithms
We designed following cache oblivious algorithms with new approach
1) Cache Oblivious Matrix Multiplication
2) Cache Oblivious Matrix Transposition
3) Cache Oblivious Sorting
4) Cache Oblivious B-Tree
4.1 Cache Oblivious Matrix Multiplication
We have to perform two steps
1) Initialize single dimension array a, b, c and transfer element of matrix a as row-major
order and matrix b as column-major order in single dimension array.
2) Compute the array a, array b and store the result in array c as in Algorithm 1.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
610 www.jiarm.com
Algorithm 1 Multiplication of 3-by-3 matrix
1. Set x=0 2. Repeat steps 3
for i = 1 to 3 a) If ( x > 9) then
Set x=0 [end of if ]
3. Repeat For j = i to 9 a) Repeat for k = i to 9
i) Set c[x]=c[x]+a[j]*b[k] ii) Set x= x+1 iii) Set k=k+1
[end of k loop] b) Set j=j+2
[End of j loop] [End of i loop]
The multiplications scheme presented can be easily extended to multiplication of 5-by-5 , 7-
by-7 and so on. It can be used on any matrix multiplication as long as the matrix dimensions
numbers. It is necessary to use a same approach. In case of a large matrix, the matrix can be
divided into sub matrix.
We tested cache oblivious of performance, we consider here 3×3, 4×4, 5×5 matrices then find
out cache miss and cache hit ratio as shown in fig 4.
Fig 4: Cache Oblivious matrix multiplication
Implementation
Implementation of Matrix multiplication scheme we have developed in the previous
section. The algorithm takes counter variable convert two dimension matrix into single
dimension. An additional single dimension is required to store the elements.
In programming language C or C++ the matrix a, b and counter we may design matrix
transposition.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
611 www.jiarm.com
4.2 Cache Oblivious Matrix Transposition
We reformulate matrix transposition algorithm into the following form
In Algorithm 2 we have used one dimension array on the execution order of the main loop. It
may be executed in same order of another loop with coping values of matrix an in column-
major order. In second time it back transfers the values of matrix. The processing of Matrix
element is in fig 5.
Fig 5: Operation of elements of matrix We just rearrange the matrix elements by using single dimension array. It gives very good
temporal locality in the access pattern of the matrix elements. Thus the key requirements for
good cache performance are satisfied.
4.3 Cache Oblivious Sorting
In this technique we use an ordering of min elements array. However our sequential sorting
which totally avoid the transposition process.
Algorithm 2: Matrix Transposition Algorithms 1. Set counter=0 2. Repeat steps for i = 1 to n
a) Repeat for j = 1 to n i) Set b[counter]=a[j][i] ii) counter=counter+1 [end of j loop] [end of i loop]
3. set counter=0 4. Repeat steps for i = 1 to n
b) Repeat for j = 1 to n iii) Set b[counter]=a[j][i] iv) counter=counter+1 [end of j loop] [end of i loop]
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
612 www.jiarm.com
Fig 6: Cache Oblivious Processing Sorting First we rearrange algorithm in the following form as fig 6 and fig 7
Fig 7 Array 2
In this process, we have used two arrays having same length. In this method
We find minimum value at every phase store in second array list. It shows the better locality
of the element access and can benefit from the presence of cache memory.
In Bubble Sort we make passes from left to right over the permutation to be sorted
and always move the currently largest right by exchanges between it and right adjacent
element – if that one is smaller. We make at most n-1 passes, since after moving all but one
element in the correct place the single remaining element must be also in its correct place.
The total number of exchanges is obviously at most n2, so we only need to consider the lower
bound. Let B be a Bubble Sort algorithm. For a permutation of elements 1, 2, - - - - - - - -
n. we can describe the total number of exchanges by
In sorting we move minimum number from array and store into array 2. These processes
repeat n times. In this method there are eliminating process of swapping and exchanging
elements.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
613 www.jiarm.com
Sorting requires θ ( log M/B ) block transfers and permuting an array requires θ (
min log M/B ) block transfers. Lower bounds hold for the cache-oblivious model. The
lower bounds from immediately give a lower bound of Ω ( log M/B ) block transfers for
cache-oblivious sorting. The upper bound from cannot be applied to the cache-oblivious
setting since these algorithm make explicit use of B and M.
In sequential processing we transfer the value of min to array 2 while processing and
reformulate the algorithm 3.
Algorithm 3 1. Repeat steps 2,3,4 for i = 1 to n 2. Repeat for j = 1 to n
a) If ( a[ j] < = min ) and ( a[ j ] < > 0 ) then i) Set min =a [ j ] ii) Set loc = j
[ end of if ] [ end of j loop ]
3. Set b[ i ] = min 4. Set min = max value
[ End of i loop ] 5. Exit
In this process we formulate the Bubble sort method; here we used two arrays as array 1 and
array 2. In first array we find the location of min value and place as null value and copy value
in array 2. The process may continue up to last elements. Every phase we must store min
value as max value. It is compulsory for every external loop. We make at most n
passes, since after moving all but one element in the correct place the single remaining
element must be also in its correct place. The total number of exchanges is obviously at most
n2, so we only need to consider the lower bound. By using same technique we may design
algorithm 4 Cache oblivious Sequential Sorting for descending.
Implementation: Algorithm 3 is simple scheme we developed in previous sections. The
algorithm takes n phases as min1, min 2 - - - - - min n to indicate the sorting process. In
programming language like C or C++, Java directly used these schemes
4.4 Cache Oblivious B-tree
A binary tree T is in memory. The algorithm does a DFS (preorder) traversal of T, applying
an operation PROCESS to each of its nodes. An array stack is used to temporarily hold the
addresses of nodes as in algorithm 4.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
614 www.jiarm.com
Algorithm 4 1. Set Top=1, Stack [1] = Null and PTR= ROOT 2. Repeat steps 3 to 5 while PTR< > NULL 3. Apply PROCESS to INFO[PTR] 4. [Right child ?]
If RIGHT[PTR] < > NULL then [ Push on STACK ] Set TOP = TOP+1, and STACK[TOP] = RIGHT[PTR] [End of if structure]
5. [Left child] If LEFT[PTR] < > NULL then Set PTR= LEFT[PTR] Else : [Pop from STACK] Set PTR=STACK[TOP] and TOP=TOP-1 [ End of If structure] [ End of 2 loop]
6. Exit.
The worst case number of memory transfers during a top down of a path using the above
layout schemes assuming each block stress B nodes. With the BFS layout, the topmost
[log(B+1)] levels of the tree will be contained in at most two blocks whereas each of the
following blocks read only contains one node form the path. The total number of memory
transfers is therefore θ (log ( n /B). For DFS and In-order layout we get worst case bound
when path to rightmost leaf, since first [ log ( n+1) ] - [log B] node have distance at least B in
memory, where as the last log(B+1)] nodes stored in at most tow blocks. Prokop observed, in
the van Emde Boas layout there are at most ( log BN) memory transfers. Only van Emde
Boas layout has asymptotically optimal bound achieved by B-tree. In Sequential Accessing
Process we change the sequence of storing binary trees in memory.
Linked Representation of Binary Tree u
A binary tree T in memory by using three parallel arrays INFO, LEFT, RIGHT and a pointer
variable ROOT as follows. Each node N of T will correspond to a location K such that as in
fig 8, fig 9.
1) INFO [K] contains the data at the node N.
2) LEFT [K] contains the location of the left child of node N.
3) RIGHT[K] contains the location of the right child of node N
ig 8: A Binary Tree T
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
615 www.jiarm.com
Fig9: Memory Representation of Sequential Processing In this process we divide binary tree T as s Left-sub-tree and Right sub-tree which linked
with each other for sequential processing. The four memory layout DFS, in-order, BFS, van
Emde Boas when processed the algorithm 5 will be convert as algorithm 6. CO B-tree
executes in linking order elements as left-sub tree and right sub-tree. In sequential access
method it process linking order of data structure.
We have perform two steps
1) It executes in sequential order.
2) The same approach are used for all four layouts
Algorithm 5 ( DFS Preorder) 1. Set PTR=ROOT 2. While ( PTR < > Null )
a) Apply Process of INFO[PTR] b) Set SAVE=PTR c) Set PTR=LEFT[PTR]
[ End of loop] 3. Set PTR=RIGHT[SAVE] 4. While ( PTR < > Null )
d) Apply Process of INFO[PTR] e) Set PTR=RIGHT[PTR]
[ End of loop] 5. EXIT
Cache oblivious B-tree which uses two phase of loop it repeat until PTR goes to null. In
programming language like C or C++, Java directly used these schemes.
5. Conclusions
To summarize, we have studied cache oblivious algorithms for matrix multiplication, matrix
transposition, dynamic programming, sorting and binary searching trees. The Algorithm
design is used heavily in both parallel and external memory algorithms. Cache oblivious
algorithms make heavy use of this paradigm and lot of seemingly simple algorithm that
were based on this paradigm are already cache oblivious for instance, stressen’s matrix
multiplication, quick sort, merge sort, closed pair, Convex hulls median selection are all
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
616 www.jiarm.com
algorithms are that are cache oblivious, through not all of them are optimal in this model.
This means that they might be cache oblivious but can be modified to make fewer cache
misses than they do in the current form. In the section on matrix multiplication, we see that
stressen’s matrix multiplication algorithm is already optimal in the cache oblivious sense.
Divide and Conquer algorithm split the instance of the problem to be solved into several sub
problems such that each of the sub problems can be solved independently. Since the
algorithm recourses on the sub problems, at some point of time the sub problems fit inside
Memory and subsequent recursion fit the sub problem into B. For instance let’s analyze the
average number of cache missed in randomized sort algorithms. This algorithm is quite cache
friendly if not cache optimal. In cache oblivious process, we present here the problem of
optimization of cache memory done by implementation of optimal oblivious matrix
multiplication. Our algorithm uses only two types of recursive blocks. All the elements are
accessed sequentially and there are no jumps at all. The number of cache miss are of the order
of (N3/L √M).
Cache Oblivious algorithm, we present optimization of cache memory done by
implementation of optimal oblivious matrix multiplication. Our algorithm uses only two
types of recursive blocks. All the elements are accessed sequentially and there are no jumps
at all. The number of cache miss are of the order of (N3/L √M).
In matrix transposition that has excellent spatial and temporal locality features. Using
cache ideal model the number of cache misses is order of (N3/L . This is asymptotically
optimal for any algorithms that are based on recursive block multiplication.
We have presented sorting algorithm that has better locality features. Using ideal cache
model cache misses is of order of θ( log M/B ) block transfers and permuting an array
requires θ ( min log M/B ) block transfers. This is asymptotically optimal for any
algorithm that is based on sorting. It totally avoids the need for swapping for address
arithmetic. While this fact is not fully exploited on standard computers, it may be
considerable advantage for hardware implementations of sorting techniques.
The basic idea of data structure is to maintain binary tree of height log n + using existing
methods. In sequential accessing cache oblivious binary process we also investigate the
practicality of cache obliviousness in the area of trees by providing comparison of different
methods for laying out a tree in memory. One further observation is that the effects from the
space saving and in size caused by implicit layout are notable.
JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: 2320-5083, Volume 1, Issue 11, December 2013
617 www.jiarm.com
6. References 1. A. Aggarwal and J. Vitter, “The Input/Output Complexity of Sorting and Related Problems,” Comm.
ACM, vol. 31, pp. 1116-1127, 1988. 2. Bingsheng He, Qiong Luo Cache Oblivous Nested Loop Joins CIKM06 2006 ACM 3. Bader M A, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary.
2002 In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 29 to 38.
4. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progression. Journal of Symbolic Computation, 9:251 to 280, 1990.
5. D. E. knuth. The art of computer programming: Fundamental Algorithms, third edition, volume 1 of the art of computer programming. Addison Wesley Longman, Boston, New York 1997.
6. D.A. Paterson and J.L. Hennessay Computer Oraganization and Design: The Hardware/ Software Interface Second Edition Morgan Kaufmann San Francisco 1998.
7. Demaine, E.D.: Cache-oblivious algorithms and data structures. In: Lecture Notes from the EEF Summer School on Massive Data Sets. Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark (2002)
8. Erik D. Demaine, Cache-Oblivious Algorithms and Data Structures, in Lecture Notes in Computer Science, BRICS, June 27-July 1, 2002, University of Aarhus, Denmark, Springer.
9. Joon Sang Park Optimizing Graph Alogorithms for Improved Cache Performance IEEE Transanctions on Parallel and Distributed system Vol 15 No 9 Sept 2004.
10. G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Proc. 29th In-ternational Colloquium on Automata, Languages, and Programming, volume 2380 of Lecture Notes in Computer Science, pages 426–438. Springer Verlag, Berlin, 2002.
11. Gustavson F.G. M. Mehl, M. P¨ogl, C. Zenger. 1999 A cache-aware algorithm for PDEson hierarchical data structures based on space-filling curves. SIAM Journal of Scientific Computing.
12. M. Frigo Charles E. Leiserson. H Prokop and S. Ramachandaran Cache Oblivious Algorithm Oct 1999 In Proc 40th Annual Symposium on Foundation of Computer Science.
13. M Bader and Christoph Zenger Institute for Informatics der TU M¨unchen, Boltzmannstr. 3, 85748 Garching, Germany Preprint submitted to Elsevier Science 30th December 2004 (http://www.springerlink.com/content/p088315m7t778650)
14. M. Sniedovich. Dynamic Programming. The Marcel Dekker, Inc., New York, NY, USA, 1992.B. Smith, “An approach to graphs of linear forms (Unpublished work style),” unpublished.
15. Mihai Alexandru Furis Cache Miss Analysis of Walsh-Hadamard Transform Algorithms Master thesis M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. Iacono, A. L´opez-Ortiz. The cost of cache-oblivious searching. In FOCS’2003, pp. 271–282, 2005.
16. M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Ann. Symp. on Foundations of Computer Science, pages 399–409. IEEE Computer Society Press, 2000.
17. N. Rahman, R. Cole, and R. Raman. Optimized predecessor data structures for internal memory. In WAE 2001, 5th Int. Workshop on Algorithm Engineering, volume 2141 of LNCS, pages 67–78. Springer, 2001.
18. Piyush Kumar Department of Computer Science University of New York, NSF( CCR-973220, CCR-0098172) Cache Oblivious Algorithms.
19. L. Arge, M. A. Bender, E. D. Demaine, B. Holland- Minkley, and J. I. Munro, Cache-oblivious priority queue and graph algorithm applicationsǁ, In ACM, editor, Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC ’02), ACM Press, 2002, pages 268–276.
20. R. Bellman. Dynamic Programming. The Princeton University Press, Princeton, New Jersey, 1957. 21. Kazushige Goto,Robert van De Geijn. On Reducing TLB Misses in Matrix Multplication. TOMS under
revision (preprinton http://www.csutexas.edu/users/flame/pubs.html) 22. S. Chaterjee, S. Sen DARPA DABT6398-1-0001, NFS grants DA07-2637 Department of Computer
Science Carolina NC 27599-3175 23. Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson Implementation of Streassen’s
Algorithms 089719-854-1/1996IEEE Explore 24. T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2nd ed., 25. Tobias Johnson Cache-Oblivious of an Array’s Pair May 7 2007 26. R. Chowdhury V. Ramachandaran Cache Oblivious Dynamic Programming NSF Grant CISE Research
Infrastructure NSF Grant CCF0514871 27. R. Chowdhury V. Ramachandaran Cache Oblivious Dynamic Programming for Bioinformatics IEEE
Bioinformatics Vol7No3 July Sept 2010 28. Z. Kasheff. Cache-oblivious dynamic search trees. MS thesis, Massachusetts Institute of Technology,
Cambridge, MA, 2004