High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded...

33
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

Transcript of High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded...

Page 1: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

High Performance Embedded Computing

© 2007 Elsevier

Lecture 11: Memory Optimizations

Embedded Computing SystemsMikko Lipasti, adapted from M. Schulte

Based on slides and textbook from Wayne Wolf

Page 2: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Topics

List scheduling Loop Transformations Global Optimizations Buffers, Data Transfers, and Storage

Management Cache and Scratch-Pad Optimizations Main Memory Optimizations

Page 3: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

List Scheduling

Given DFG graph, how do you assign (schedule) operations to particular slots?

Schedule based on priority function Compute longest path to leaf function Schedule nodes with longest path first Keep track of readiness (based on result latency)

Goal: maximize ILP, fill as many issue slots with useful work as possible (minimize NOPs)

Page 4: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

List Scheduling Example

Heuristic – no guarantee of optimality For parallel pipelines account for structural hazards

Pick 3 3 (4), 1(4), 2(3), 4(2), 5(0), 6(0)

Pick 1 1(4), 2(3), 4(2), 5(0), 6(0)

Pick 2 2(3), 4(2), 5(0), 6(0)

Pick 4 4(2), 5(0), 6(0)

Pick 6 (5 nr) 5(0), 6(0)

Pick 5 5(0)

Page 5: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Local vs. Global Scheduling

Single-entry, single-exit (e.g. basic block) scope Limited opportunity since only 4-5 instructions

Expand scope Unroll loop body, inline small functions Construct superblocks and hyperblocks, which are

single-entry, multiple exit sequences of blocks Code motion across control-flow divergence

Speculative, consider safety (exceptions) and state Predication is useful to nullify wrong-path instructions

Page 6: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Memory-oriented optimizations Memory is a key bottleneck in many

embedded systems. Memory usage can be optimized at any level

of the memory hierarchy. Can target data or instructions. Global memory analysis can be particularly

useful. It is important to size buffers between subsystems

to avoid buffer overflow and wasted memory.

Page 7: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Loop transformations Data dependencies may be within or between loop iterations. Ideal loops are fully

parallelizable.

A loop nest has loops enclosed by other loops. A perfect loop nest has no conditional statements.

Page 8: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Types of loop transformations Loop permutation changes order of loops. Index rewriting changes the form of the loop

indexes. Loop unrolling copies the loop body. Loop splitting creates separate loops for

operations in the loop body. Loop fusion or loop merging combines loop

bodies. Loop padding adds data elements to an array

to change how the array maps into memory.

Page 9: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Polytope model Commonly used to represent data

dependencies in loop nests. Loop transformations can be modeled as

matrix operations: Each column represents iteration bounds.

ji

Page 10: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Loop permutation Changes the order of loop indices Can help reduce the time needed to access

matrix elements 2-D arrays in C are stored in row major order

Access the data row by row. Example of matrix-vector multiplication

Page 11: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Loop fusion Combines loop bodies

for (i = 0; i <N; i++) for (i = 0; i <N; i++) {

x[i] = a[i] * b[i]; x[i] = a[i] * b[i];

for (i = 0; i <N; i++) y[i] = a[i] * c[i];

y[i] = a[i] * c[i]; }

Original loops After loop fusion

How might this help improve performance?

Page 12: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Buffer management [Pan01]

In embedded systems, buffers are often used to communicate between subsystems

Excessive dynamic memory management wastes cycles, energy with no functional improvements.

Many embedded programs use arrays that are statically allocated

Several loop transformations have been developed to make buffer management more efficient

Before:for (i=0; i<N; ++i)

for (j=0; j<N-L; ++j)b[i][j] = 0;

for (i=0; i<N; ++i)for (j=0; j<N-L; ++j)

for (k=0; k<L; ++k)b[i][j] += a[i]

[j+k]; After:

for (i=0; i<N; ++i)for (j=0; j<N-L; ++j)

b[i][j] = 0;for (k=0; k<L; ++k)

b[i][j] += a[i][j+k];closer

Page 13: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Buffer management [Pan01]int a_buf[L];

int b_buf;

for (i = 0; i < N; ++i) {

initialize a_buf

for (j = 0; j < N - L; ++j) {

b_buf = 0;

a_buf[(j + L - 1) % L] = a[i][j + L - 1];

for (k<0; k < L; ++k)

b_buf += a_buf[(j + k)%L];

b[i][j] = b_buf;

}

}

Loop analysis can help to make data reuse more explicit.

Buffers are declared in the program

Don’t need to exist in final implementation.

Page 14: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Cache optimizations – [Pan97] Strategies:

Move data to reduce the number of conflicts. Move data to take advantage of prefetching.

Need: Load map. Information on access frequencies.

Page 15: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Cache conflicts Assume a direct-mapped cache of size C = 2m

with a cache line size of M words Memory address A maps to cache line

k = (A mod C)/M If N is a multiple of C, then a[i], b[i], and c[i] all

map to the same cache line

Page 16: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Reducing cache conflicts Could increase the cache size

Why might these be a bad idea? Add L dummy words between adjacent arrays Let f(x) denote the cache line to which the

program variable x is mapped. For L = M, with a[i] starting at 0, we have

Page 17: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Scalar variable placement Place scalar variables to improve locality and

reduce cache conflicts. Build closeness graph that indicates the desirability of

keeping sets of variables close in memory. M adjacent words are read on a single miss

Group variables into M word clusters. Build a cluster interference graph

Indicates which cluster map to same cache line Use interference graph to optimize placement.

Try to avoid interference

Page 18: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Constructing the closeness graph Generate an access sequence

Create a node for memory access in the code Directed edge between nodes indicates successive access Loops weighted with number of loop iterations

Use access sequence to construct closeness graph Connect nodes within distance M of each other

x to b is distance 4 (count x and b) Links give # of times control flows between nodes Requires O(Mn2) time for n nodes

Page 19: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Group variables into M word clusters. Determine which variables to place on same line

Put variables that will frequently be accessed closely together on the same line. Why?

Form clusters to maximize the total weight of edges in all the clusters

Greedy AssignClusters algorithm has complexity O(Mn2) In previous example M = 3 and n = 9

Page 20: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Build cluster interference graph Identify clusters that should not map to the same line

Convert variable access sequence to cluster access sequence

Weight in graph corresponds to number of times cluster access alternates along execution path

High weights should not be mapped to the same line

Page 21: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Assigning memory locations to clusters Find an assignment of clusters in a CIG to memory locations, such that MemAssignCost (CIG) is minimized.

Page 22: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Array placement Focus on arrays accessed in innermost loops. Why? Arrays are placed to avoid conflicting accesses with

other arrays. Don’t worry about clustering, but still construct the

interference graph – edges dictated by array bounds

Page 23: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Avoid conflicting memory locations Given addresses X, Y.

Cache with k lines each holding M words.

Formulas for X and Y mapping to the same cache line:

Page 24: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Array assignment algorithm

[Pan97] © 1997 IEEE

Page 25: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Results from data placement [Pan97]

© 2006 Elsevier

Data cache hit rates improve 48% on average Results in average speedup of 34% Results for a 256-byte cache and for kernels

Page 26: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

On-Chip vs. Off-Chip Data Placement

[Pan00] explore how to partition data to optimize performance

Page 27: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

On-Chip vs. Off-Chip Data Placement Allocate static variables at compile time Map all scalar values/constants to scratchpad Map all arrays too large for scratchpad into DRAM Only arrays with intersecting lifetimes will have

conflicts. Calculate several parameters:

VAC(u): variable access count – how many times is u accessed.

IAC(u): interference access count – how many times are other variables accessed during u’s lifetime.

IF(u): total interference count = VAC(u) + IAC(u).

Page 28: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

On-Chip vs. Off-Chip Data Placement Also need to calculate LCF(u): loop conflict factor.

p is the number of loops accessed by u k(u) is the number of accesses to u K(v) is the number of accesses to variables other than u

And TCF(u): total conflict factor.

Page 29: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Scratch pad allocation formulation

AD( c ): access density.

Page 30: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Scratch pad allocation algorithm

[Pan00] © 2000 ACM Press

Page 31: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Scratch pad allocation performance

[Pan00] © 2000 ACM Press

Page 32: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

© 2006 Elsevier

Main memory-oriented optimizations Memory chips provide several useful modes:

Burst mode accesses sequential locations. Provide start address and length Reduce number of addresses sent, increase transfer rate

Paged modes allow only part of the address to be transmitted. Address split into page number and offset Store page number in register to quickly access values in same

page

Access times depend on address(es) being accessed.

Page 33: High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Banked memories

© 2006 Elsevier

Banked memories allow multiple memory banks to be accessed in parallel