A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

1/8

A NOR Emulation Strategy over NAND Flash Memory

Jian-Hong Lin Yuan-Hao Chang Jen-Wei Hsieh Tei-Wei Kuo Cheng-Chih Yang

Graduate Institute of Networking and

Multimedia

Department of Computer Science and

Information Engineering

National Taiwan University, Taipei,

Taiwan 106, R.O.C.

{r94944003, d93944006, ktw}@csie.ntu.edu.tw

Department of Computer Science and

Information Engineering

National Chiayi University,Chiayi,

Taiwan 60004, R.O.C.

[email protected]

Product Development Firmware

Engineering Gruop Genesys Logic,

Inc. Taipei, Taiwan 231, R.O.C

[email protected]

Abstract

This work is motivated by a strong market demand in the re-

placement of NOR flash memory with NAND flash memory to

cut down the cost in many embedded-system designs, such as

mobile phones. Different from LRU-related caching or buffer-

ing studies, we are interested in prediction-based prefetching

based on given execution traces of applications. An implemen-

tation strategy is proposed in the storage of the prefetching

information with limited SRAM and run-time overheads. An

efficient prediction procedure is presented based on informa-tion extracted from application executions to reduce the per-

formance gap between NAND flash memory and NOR flash

memory in reads. With the behavior of a target application ex-

tracted from a set of collected traces, we show that data access

over NOR flash memory is responded effectively over SRAM

that serves as a cache for NAND flash memory.

Keywords:NAND, NOR, flash memory, data caching

1. INTRODUCTION

While flash memory remains one of the most popular stor-

ages in embedded systems because of its non-volatility, shock-resistance, small size, and low energy consumption, its appli-

cation has grown much beyond its original design. Based on

its original design, NOR flash memory (referred to as NOR for

short) is designed to store binary code of programs because

NOR supports XIP (eXecute-In-Place) and high performance

in read operations, while NAND flash memory (referred to as

NAND for short) is used as a data storage because NAND has

lower price and higher performance in write/erase operations,

compared to NOR [22, 23, 27, 30]. In recent years, the price

of NAND has down done much faster than that of NOR, e.g.,

the price of 8Gbit NAND is lower than one-fifth of 8Gbit NOR

Supported by the National Science Council of Taiwan, R.O.C., under GrantNSC-95R0062-AE00-07 and NSC-95-2221-E-002-094-MY3

in the first quarter of year 2007 [14]. In order to reduce the

hardware cost ultimately, using NAND to replace NOR (mo-

tivated by a strong market demand) becomes a new trend in

embedded-system designs, especially on mobile phones and

arcade games, even though read performance is an essential

and critical issue in program execution. These observations

motivate the objective of this work. That is the exploration of

filling up the read performance gap between NAND and NOR

with limited overhead.

The management of flash memory is carried out by either

software on a host system (as a raw medium) or hardware cir-

cuits/firmware inside its devices. In the past decade, there were

a lot of research and implementation designs to manage flash-

memory storage systems, e.g., [2, 3, 5, 6, 8, 9, 16, 31, 33, 34].

Some exploited efficient management schemes for large-scale

storage systems and different architecture designs, e,g., [8, 9,

16, 31, 33, 34]. To improve the performance of hard disks, Intel

proposedRobsonto use NAND as a non-volatile cache of hard

disks [1, 4, 7, 32] and Microsoft proposed a fast booting mech-

anism (called ReadyBoost and ReadyDrive) in Windows Vista

to enhance system performance by caching data in flash mem-

ory. To use flash memory to replace hard disks and to improve

the performance of flash memory, some proposed to use batterybacked SRAM to cache recently used data from flash memory

with a copy-on-write scheme [12, 13, 18, 32]. For the new re-

search trend of using NAND to store both code and data (in

order to replace NOR), C. Park et al. developed a cost-efficient

memory architecture integrating NAND flash memory into ex-

isting memory hierarchy for XIP and also proposed an applica-

tion specific demand paging mechanism in NAND flash mem-

ory with compiler assistance [20, 21]. However, the proposed

mechanisms need the source code of applications and specific

compilers to lay out the compiled code in fixed memory lo-

cation. The imposed constraints make the mechanisms hard to

implement.

1


2/8

Although researchers have proposed many excellent meth-

ods in improving the performance of flash memory, very few

of them work on improving the performance of program ex-

ecution when using NAND to store binary code. Only some

research proposed some caching mechanisms, e.g., [15, 17,

19, 25, 26], to improve the performance of program execution

(where code is stored in NAND) without considering the ac-cess patterns. Without prefetching data according to the access

patterns of the installed programs, the cache miss rate could

not be effectively reduced and therefore programs cannot be

executed efficiently, because most embedded systems only per-

form some specific functions, especially arcade-game stations.

In this paper, we propose an efficient prediction mechanism

with limited memory-space requirements and an efficient im-

plementation. The prediction mechanism collects the access

patterns of program execution to construct a prediction graph

by adopting the working-set concept [10, 11]. According to

the prediction graph, the prediction mechanism prefetches data

(/code) to the SRAM cache, so as to reduce the cache missrate. Therefore, the performance of the program execution is

improved and the read performance gap between NAND and

NOR is filled up effectively. A series of experiments is then

conducted based on some realistic traces, which are collected

from three different types of popular games Age of Empire II

(AOE II), The Typing of the Death (TTD), and Raiden.

We show that the average read performance of NAND with the

proposed prediction mechanism could be better than that of

NOR for 24%, 216%, and 298% in AOE II, TTD, and Raiden,

respectively. Furthermore, the cache miss rate was 35.27%,

4.21%, and 0.06% for AOE II, TTD, and Raiden, respectively,

and the percentage of redundant prefetched data was lower

than 10% in most cases.The rest of this paper is organized as follows: Section 2 de-

scribes the characteristics of flash memory and research moti-

vation. In Section 3, an efficient prediction mechanism is pro-

posed. Section 4 summarizes the experimental results on read

performance, cache miss rate, and extra overheads. Section 5

is the conclusion.

2. Flash-Memory Characteristics and

Research Motivation

There are two types of flash memory: NAND and NOR. Each

NAND flash memory chip consists of many blocks, and eachblock is of a fixed number of pages. A block is the smallest

unit for erase operations, while reads and writes are done in

pages. A page contains a user area and a spare area, where

the user area is for the data storage of a logical block, and the

spare area stores ECC and other house-keeping information

(i.e., LBA). Because flash memory is write-once, we do not

overwrite data on each update. Instead, data are written to

free space, and the old versions of data are invalidated (or

considered as dead). The update strategy is called out-place

update. In other words, any existing data on flash memory

could not be over-written (updated) unless its corresponding

block is erased. The pages that store live data and dead dataare called valid pages and invalid pages, respectively.

Depending on the designs, blocks have different bounds

on the number of erases over a block. For example, the typ-

ical erase bounds of SLC and MLC2 NAND flash mem-

ory are 10,000 and 1,000, respectively1. Each page of small-

block(/large-block) SLC NAND can store 512B(/2KB) data,

and there are 32(/64) pages per block. The spare area of a

small-block(/large-block) SLC NAND page is 16B(/64B). Onthe other hand, each page of MLC2 NAND can store 2KB,

and there are 128 pages per block. Different from NAND flash

memory, a byte is the unit for reads and writes over NOR flash

memory.

SLC NOR [ 28] SLC NAND [ 24](large-block, 2KB-page)

Price (US$/GB) [14] 34.65 6.79

Read (random access of 8bits) 40ns 25sWrite (random access of 8bits) 14s 300s

Read (s equent ial access ) 23.842MB/ sec 15.33MB/s ecW ri te ( sequenti al acces s) 0.068MB/sec 4.57MB/ secErase 0.217MB/sec 6.25MB/sec

Table 1.The Typical Characteristics of NOR and NAND.

Host Interface

Cache (SRAM)

NAND Flash Memory

Data Bus

Address Bus

Converter

Prefetch

Procedure

Control Logic

512 bytes

byte

Host Interface

Cache (SRAM)

NAND Flash Memory

Data Bus

Address Bus

Converter

Prefetch

Procedure

Control Logic

512 bytes

byte

Figure 1. An Architecture for the Performance Improvement

of NAND Flash Memory.

NAND has been widely adopted in the implementation of

storage systems because of its advantages in cost and write

throughput (for block-oriented access), compared to NOR.

The typical cost of 1GB NOR costs US$34.65 in the mar-

ket, compared to US$6.79 per GB for NAND, and the price

gap of NAND and NOR will get even wider in the coming fu-

ture. However, due to the high performance of NOR in reads,

as shown in Table 1, and its eXecute-In-Place (XIP) char-

acteristics, NOR is adopted in various embedded-system de-signs, such as mobile phones and Personal Multimedia Players

(PMP). The characteristics of NAND and NOR are summa-

rized in Table 1.

This research is motivated by a strong market demand in

the replacement of NOR with NAND in many embedded-

system designs. In order to fill up the performance gap be-

tween NAND and NOR, SRAM is a nature choice for data

caching in performance improvement, such as that in the sim-

ple but effective hardware architecture adopted by OneNAND

1 There are two major NAND flash memory designs: SLC (Single Level Cell)flash memory and MLC (Multiple Level Cell) flash memory. Each cell of SLC

flash memory contains one-bit information while each cell of MLCn flashmemory contains n-bit information.

2


3/8

[15, 25, 26]. (Please see Figure 1.) However, the most criti-

cal technical problem behind the success in the replacement

of NOR with NAND is on the prediction scheme and its imple-

mentation design.Such an observation underlies the objective

of this research. That is the design and implementation of an

effective prediction mechanism for applications, with the con-

siderations of flash-memory characteristics. Because of strin-gent resource supports over embedded systems, the proposed

mechanism must also face challenges in restricted SRAM us-

age and limited computing power.

3. An Efficient Prediction Mechanism

3.1 Overview

In order to fill up the performance gap between NAND and

NOR, SRAMcan serve as a cache layer for data access over

NAND. As shown in Figure 1, the Host Interface is respon-

sible to the communication with the host system via address

and data buses. TheControl Logicmanages the caching activ-

ity and provides the service emulation of NOR with NANDand SRAM. TheControl Logicshould have an intelligent pre-

diction mechanism implemented to improve the system per-

formance. Different from popular caching ideas adopted in

the memory hierarchy, this research aims at an application-

oriented caching mechanism. Instead of the adopting of a LRU-

like policy, we are interested in prediction-based prefetching

based on given execution traces of applications. We consider

the designs of embedded systems with a limited set of appli-

cations, such as a set of selected system programs in mobile

phones or arcade games of amusement-park machines. The

design and implementation should also consider the resource

constraints of a controller in the SRAMcapacity and comput-ing.

There are two major components in the Control Logic: The

Converter emulates NOR access over NAND with a SRAM

cache, where address translation must be done from byte ad-

dressing (for NOR) to Logical Block Address (LBA) address-

ing (for NAND). Note that each 512B/2KB NAND page corre-

sponds to one and four LBAs, respectively [29]. ThePrefetch

Proceduretries to prefetch data from NAND to SRAMso that

the hit rate of the NOR access is high overSRAM. The proce-

dure should parse and extract the behavior of the target appli-

cation via a set of collected traces. According to the extracted

access patterns from the collected traces, the procedure gener-

ates prediction information, referred to as a prediction graph. InSection 3.2, we shall define a prediction graph and present its

implementation strategy over NAND. An algorithm design for

the Prefetch Procedure will be then presented in Section 3.3.

3.2 A Prediction Graph and Implementation

The access pattern of an application execution over NOR (or

NAND) consists of a sequence of LBAs, where some LBAs

are for instructions, and the others are for data. As an appli-

cation runs for multiple times, the virtually complete pic-

ture of the possible access pattern of an application execution

might appear, as shown in Figure 2. Since most application ex-

ecutions are input-dependent or data-driven, there can be morethan one subsequent LBAs following a given LBA, where each

0

5

6

21

3

4

8

9

7

11

13

10

12

0

5

6

21

3

4

8

9

7

11

13

10

12

Figure 2.An example of a prediction graph

LBA corresponds to one node in the graph. A node with more

than one subsequent LBAs is called a branch node(such as the

shaded nodes in Figure 2), and the other nodes are called regu-

larnodes. The graph that corresponds to the access patterns is

referred to as the prediction graphof the patterns. If pages inNAND could be pre-fetched in an on-time fashion, and there is

enoughSRAMspace for caching, then all data accesses can be

done overSRAM.

data data branch table(regular node) (regular node)

data(branch node)

data data branch table(regular node) (regular node)

data(branch node)

(a) Prediction Information

3

addr(b1)

addr(b2)

addr(b3)

3

addr(b1)

addr(b2)

addr(b3)

(b) A Branch Table

Figure 3.The Storage of a Prediction Graph

The technical problems are how to save the prediction graphover flash memory with overheads minimized and how to

prefetch pages based on the graph in a simple but effective

way. We propose to save the subsequent LBA information of

each regular node at the spare area of the corresponding page. It

is because the spare area of a page in current implementations

and the specification has unused space, and the reading of a

page usually comes with the reading of its data and spare areas

simultaneously. In such a way, the accessing of the subsequent

LBA information of a regular node comes with no extra cost.

Since a branch node has more than one subsequent LBAs, the

spare area of the corresponding page might not have enough

free space to store the information. We propose to maintaina branch table to save the subsequent LBA information of all

3


4/8

branch nodes. The starting entry address of the branch table

that corresponds to a branch node can be saved at the spare

area of the corresponding page, as shown in Figure 3.(a). The

starting entry records the number of subsequent LBAs of the

branch node, and the subsequent LBAs are stored in the en-

tries following the starting entry (Please see Figure 3.(b)). The

branch table can be saved on flash memory. During the runtime, the entire table can be loaded into SRAMfor better per-

formance. If there is not enoughSRAMspace, parts of the table

can be loaded in an on-demand fashion.

3.3 A Prefetch Procedure

nextcurrent

2 3 4 5 11 . . .6. . .

nextcurrent

2 3 4 5 11 . . .6. . .

Figure 4.A Snapshot of the Cache

The objective of the prefetch procedure is to prefetch data

from NAND based on a given prediction graph such that mostdata accesses occur over SRAM. The basic idea is to prefetch

data by following the LBA order in the graph. In order to

efficiently look up a selected page in the cache, we propose

to adopt a cyclic buffer in the cache management, and let two

indices current and next denote the pages currently accessed

and prefetched, respectively. Whencurrent= next, the cachingbuffer is empty. When current= (next+ 1) modSIZE, thebuffer is full, where SIZE is the number of buffers for page

caching. Consider the prediction graph shown in Figure 2. The

page that corresponds to Node 2 is currently accessed, and the

page that corresponds to Node 6 is just prefetched (Please see

Figure 4).The prefetch procedure is done in a greedy way: Let P1

be the last prefetched page. If P1 corresponds to a regular

node, then the page that corresponds to the subsequent LBA

is prefetched. If P1 corresponds to a branch node, then the pro-

cedure should prefetch pages by following all possible next

LBA links in an equal base and an alternative way. That is,

the prefetch procedure can follow each LBA link in an al-

ternative way. For example, pages corresponding to Nodes 4

and 5 are prefetched after the page that corresponds to Node

3 is prefetched, as shown in Figure 4. The next pages to be

prefetched are the pages corresponding to Nodes 1 and 6. In

order to properly manage the preferching cost, the prefetch

procedure stops following an LBA link when nextreaches abranch node again along a link, or when nextand currentmight

point to the same page (both referred to as Stop Conditions).

When the caching buffer is full (also referred to as a Stop Con-

dition), the prefetch procedure should also stop temporarily.

Take the prediction graph shown in Figure 2 as an example.

The prefetch procedure should not prefetch pages correspond-

ing to Nodes 8 and 9 when the page corresponding to Node 7

is prefetched. When currentreaches a page that corresponds

to a branch node, the next page to be accessed (referred to as

the target page) will determine which branch the application

execution will follow. The prefetch procedure should start pre-

fecting the page that corresponds to the subsequent LBA ofthe target page (or the pages that correspond to the subsequent

LBAs of the target page if the target page corresponds to a

branch node). The above prefetch procedure shall repeat again

if it stops tentatively because of any Stop Condition. Note that

all pages cached in the SRAM cache between currentand next

stay in the cache after the target page (in the following of a

branch) is accessed. It is because some of the cached pages

might be accessed shortly, even though the access of the targetpage has determined which branch the application execution

will follow. Note that cache misses are still possible, e.g., those

when current=next. In such a case, data are accessed fromNAND and loaded onto the SRAM cache in an on-demand

fashion.

The pseudo code of the prefetch procedure is as shown in

Algorithm 1: Two flags, i.e.,stopandstartbch, are used to track

the prefetching state: stopand startbch denote the satisfaction

of any Stop Condition and the reaching of a branch node,

respectively. Initially,stop and startbch are set as FALSE. If

any Stop Condition is satisfied when the procedure is invoked,

then the procedure simply returns (Step 1). The procedureprefetches one page in each iteration (Steps 2-19) until the

cache is full (i.e., a Stop Condition), or we reach a branch

node for the first time. First, next is checked if it will point

to the same page ascurrentdoes, then the prefetch procedure

stops and returns (Steps 3-6) Otherwise, in each iteration, the

procedure increases next, i.e., the location of the next free

cache buffer (Step 7). The LBA is obtained by checking up

the latest prefetched LBA (Step 8), and then the page of the

LBA is prefetched (Step 9). After the prefetching of a page, the

procedure checks up whether the prefetched page corresponds

to a branch node (Steps 10-11). If so, the procedure loads

the corresponding branch table entries (Step 12) and save the

subsequent LBA of each branch of the branch node (Steps 12-17). Because the prefetched page corresponds to a branch node,

the procedure should start prefetching pages by following each

branch in an alternative way (Steps 20-36). The loop will stop

when the cache is full (Step 20), when every next LBA link of

a branch node reaches the next branch node (Steps 31-35), or

whennextand currentmight point to the same page (Steps 22-

25). In each iteration of the loop, if the LBA link indexed by

idxbch does not yet reach the next branch node (Step 21), the

next LBA following the link shall be prefetched (Steps 26-28).

Pages are prefetched by following all possible next LBA links

in an equal base and an alternative way (Step 30).

Note that stop should be set to FALSEwhen the cache isno longer full or when next and currentdo not point to the

same page2. Moreover, stop and startbchshould both be reset to

FALSEwhen currentpasses a branch node and meets the target

page or when a cache miss occurs (i.e., current= next). Oncestopis set to FALSE, the prefetch procedure is invoked. When

startbchis FALSE in such an invocation, the prefetch procedure

starts prefetching from the first loop between Steps 2 and 19.

Otherwise, the prefetch procedure will continue its previous

prefetching job by following next LBA links of a visited branch

node in an alternative way (Steps 20-36).

2 Performance enhancement is possible by deploying more complicated condi-tion setting and actions.

4


5/8

Algorithm 1: Prefetch Procedure

Input:stop,next,current,lba,idxbch,Nbch,lbabch[], andstartbchOutput:nullifstop=TUREthen return;1

whilestartbch= FALSE and(next+ 1)mod SIZE=currentdo2ifChkNxLBA(lba) =cache(current)then3

stop TRUE;4

return;5

end6

next (next+ 1)modSIZE;7

lba GetNxLBA(lba) ;8

Read(next,lba) ;9

startbch IsBchStart() ;10

ifstartbch= TRUEthen11LdBchTable(GetNxLBA(lba)) ;12

idxbch 0 ;13

Nbch GetBchNum() ;14

fori=0; i


6/8

so as to prevent the size of the branch tables growing too large

to fit in SRAM.

As for the relationship between the number of traces and

the average number of branches of each branch node in the

prediction graph, the more the number of traces was collected

for each game, the higher the average branches of each branch

node. As shown in Figure 6, the average number of branches ofeach branch node was less than four, and the average number

of each branch node grew slowly, except AOE II, because of

the randomness of data access in the real-time strategy game

AOE II.

Figure 6.Increment of average branch number

4.2 Read Performance and Cache Miss Rate

Figure 7 shows the read performance of each game with differ-ent cache sizes where the prediction graph was derived based

on ten traces of each game. Our approach achieved the best

performance for Raiden, due to its regular access patterns, but

the worst performance was observed for AOE II. When the

cache size was 2KB, the average read performance of AOE

II, TTD, and Raiden with the prefetch mechanism was 27.74

MB/s, 68.68 MB/s, and 94.98 MB/s, respectively. We must

point out that all of them were better than the read performance

of NOR (23.84 MB/s). Note that the lower the cache miss rate

in prefetching was, the higher read performance. To resolve

a cache miss, data accesses to NOR had to be redirected to

NAND so that missed data could be loaded from NAND to the

cache. It was also shown that a 4KB cache was sufficient forthe games under considerationsbecause the read performance

became saturated when the cache size was no less than 4KB.

Figure 8 shows the read performance of the proposed ap-

proach for the three games with respect to different numbers of

traces, where the cache size was 4KB. The read performance

of each game was better than that of NOR even when only

two traces were used to generate a prediction graph. For exam-

ple, the improvement ratios over AOE II, TTD, and Raiden

were 24%, 216%, and 298%, respectively, when the num-

ber of traces for each game was 10, and the size of cache

was 4KB. When there were more than two traces, the read

performance of Raiden had almost no improvement becausethe cache miss rate was almost zero. For AOE II, the read

Figure 7.The read performance with different cache sizes (10

traces)

performance was improved slowly when the number of col-lected traces increased because the access pattern of AOE II

was highly random. The increasing in the number of collected

traces for the prediction graph could not reduce the cache miss

rate significantly. For TTD, good improvement was observed

with the inclusion of two more traces. It was because the last

two traces were, in fact, collected during the advance of players

in the game by clearing more stages. Furthermore, we summa-

rize the read performance of the proposed scheme and other ex-

isting products in Table 3. It shows that the read performance

of some specific applications with regular access patterns is

even better than that of OneNAND. On the other hand, without

our prediction mechanism, i.e., the worst case = 100% miss

rate, request data has to be read from NAND flash memoryon each read request. Thus, it is impractical to use NAND to

replace NOR without any prediction mechanism because the

read performance gap between the emulated NOR and NOR is

too large.

Figure 8. The read performance with different numbers of

traces (4KB cache)

Figure 9 shows the cache miss rate. The miss rate was

lower when more number of traces were used to construct theprediction graph. In the figure, when ten traces were used to

6


7/8

AO E II T TD Ra id en Wo rst case NO R O neNAND[25 ]

Read29.57 75.24 94.44 8.76 23.84 68

(MB/s)

Table 3. Comparison of the read performance (10 traces and

4KB cache in our approach)

generate the prediction garph, the cache miss rate of Raiden

was almost zero and that of TTD was lower than 5%, but

that of AOE II could not be reduced effectively because of its

unpredictable access patterns. Comparing to read performance

shown in Figure 8, the read performance of a game was higher

when the cache miss rate was lower.

Figure 9.Cache miss rate with different number of traces (with

4KB cache)

4.3 Main-memory Requirements

The major memory overhead of the prediction mechanism was

to maintain the branch table. The more the traces were used

to create the prediction graph, the bigger the branch table was.

That was because more access patterns were learned. As shown

in Figure 10, the table sizes of AOE II, TTD, and Raiden were

only 39.83KB, 35.14KB, and 0.43KB, respectively, when there

were ten traces used to construct each game. In most embedded

systems, the branch table of each game was still small enough

to be stored in RAM. However, in this experiment, branch ta-

bles were stored in NAND flash memory and loaded to SRAM

on demand. Figure 10 shows that the table size of Raiden keptlow when the number of traces increased, but the table sizes of

AOE II and TTD kept growing because ten traces still couldnt

cover all the access patterns of AOE II and TTD. However, as

shown in Figure 9, the cache miss rate of TTD was very low

and didnt need to involve new traces to improve the cache hit

ratio, and the cache miss rate of AOE II still could not be low-

ered even if more traces were collected.

4.4 Cache Pollution Rate

Cache pollution Rateis the rate of data that are prefetched but

not referenced during the program execution. The prefetching

of unnecessary data represented overheads and might even de-creased the read performance because the prefetching activities

Figure 10.The size of branch table

of unnecessary data might delay the prefetching of useful data.

In addition, unnecessary data transfer leads to extra power con-sumption, which is critical to designs of embedded systems.

LetNSRAM2hostbe the amount of data accessed by the host, and

Nf lash2SRAMthe amount of data transferred from NAND flash

memory to SRAM. The cache pollution rate was defined as

follows:

Cache pollution rate = 1 NSRAM2host

Nf lash2SRAM

As shown in Figure 11, the cache pollution rate increased

as the number of traces for each game increased. That was

because more traces led to a larger number of branches per

branch node, and only one of the LBA links that follow agiven branch node was actually referenced by the program.

In summary, there was a trade-off between the prefetching

accuracy and the prefetching overhead, even though the cache

pollution rates were still lower than 10% in most cases.

Figure 11.The cache pollution rate (4KB cache)

5. Conclusions

This paper addresses the issue of the replacement of NOR withNAND motivated by a strong market demand. Different from

7


8/8

on-demand cache mechanisms proposed in previous work,

we propose an efficient prediction mechanism with limited

memory-space requirement and an efficient implementation to

improve the performance of programs stored in NAND. Binary

code of programs is prefetched from NAND to SRAM cache

precisely and efficiently according to the prediction graph that

is constructed by the collected access patterns of program ex-ecution. A series of experiments is conducted based on re-

alistic traces collected from three different types of popular

games AOE II, TTD, and Raiden. We show that the

average read performance of NAND with the proposed pre-

diction mechanism could be better than that of NOR in most

cases, the cache miss rate was 35.27%, 4.21%, and 0.06% for

AOE II, TTD, and Raiden, respectively, and the percentage of

redundant prefetched data was lower than 10% in most cases.

Fur future research, we shall further extend the proposed

mechanism to adjust the prediction graph on-line to make the

prediction mechanism adaptive to any special and temporal

changes of program executions. We shall also explore the pred-icability of data prefetching for programs that have high ran-

domness in terms of access patterns.

References

[1] Flash Cache Memory Puts Robson in the Middle. Intel.

[2] Flash File System. US Patent 540,448. In Intel Corporation.

[3] FTL Logger Exchanging Data with FTL Systems. Technical

report, Intel Corporation.

[4] Software Concerns of Implementing a Resident Flash Disk. Intel

Corporation.

[5] Flash-memory Translation Layer for NAND flash (NFTL). M-

Systems, 1998.

[6] Understanding the Flash Translation Layer (FTL) Specification,

http://developer.intel.com/. Technical report, Intel Corporation,

Dec 1998.

[7] Windows ReadyDrive and Hybrid Hard Disk Drives,

http:// www.microsoft.com/whdc/device/storage/hybrid.mspx.

Technical report, Microsoft, May 2006.

[8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architecture

for Flash Memory Storage Systems of Embedded Systems. In

IEEE Real-Time and Embedded Technology and Applications

Symposium, pages 187196, 2002.

[9] L.-P. Chang and T.-W. Kuo. An Efficient Management Scheme for

Large-Scale Flash-Memory Storage Systems. InACM Symposium

on Applied Computing (SAC), pages 862868, Mar 2004.

[10] P. J. Denning. The Working Set Model for Program Behavior.

Communications of the ACM, 11(5):323333, 1968.

[11] P. J. Denning and S. C. Schwartz. Properties of the Working-Set

Model. Communications of the ACM, 15(3):191198, 1972.

[12] F. Douglis, R. Caceres, F. Kaashoek, K. Li, B. Marsh, and

J. Tauber. Storage Alternatives for Mobile Computers. In

Proceedings of the USENIX Operating System Design and

Implementation, pages 2537, 1994.

[13] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-

hungry disk. In Proceedings of the 1994 Winter USENIX

Conference, pages 292306, 1994.

[14] DRAMeXchange. NAND Flash Contract Price,http://www.dramexchange.com/, 03 2007.

[15] Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and

N. Chang. Demand Paging for OneNANDT M Flash eXecute-

In-Place. CODES+ISSS, October 2006.

[16] A. Kawaguchi, S. Nishioka, and H. Motoda. A Flash-Memory

Based File System. InProceedings of the 1995 USENIX Technical

Conference, pages 155164, Jan 1995.

[17] J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-typeflash memory package with smart buffer system for spatial and

temporal localities. JOURNAL OF SYSTEMS ARCHITECTURE,

51:111123, 2004.

[18] B. Marsh, F. Douglis, and P. Krishnan. Flash Memory File

Caching for Mobile Computers. InProceedings of the Twenty-

Seventh Annual Hawaii International Conference on System

Sciences, pages 451460, 1994.

[19] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energy-aware

demand paging on nand flash-based embedded storages. ISLPED,

August 2004.

[20] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compiler-

assisted demand paging for embedded systems with flash memory.

EMSOFT, September 2004.[21] C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efficient

memory architecture design of nand flash memory embedded

systems. ICCD, 2003.

[22] Z. Paz. Alternatives to Using NAND Flash White Paper.

Technical report, M-Systems, August 2003.

[23] R. A. Quinnell. Meet Different Needs with NAND and NOR.

Technical report, TOSHIBA, September 2005.

[24] Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND Flash

Memory Data Sheet, 2003.

[25] Samsung Electronics. OneNAND Features and Performance, 11

2005.

[26] Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit

OneNAND Flash Memory Data Sheet, 09 2006.

[27] M. Santarini. NAND versus NOR. Technical report, EDN,

October 2005.

[28] Silicon Storage Technology (SST). SST39LF040 4K x 8bit SST

Flash Memory Data Sheet, 2005.

[29] STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND

Flash Memory, 2005.

[30] A. Tal. Two Technologies Compared: NOR vs. NAND White

Paper. Technical report, M-Systems, July 2003.

[31] C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Management

for the Flash Translation Layer in Embedded Systems. In

IEEE/ACM 2006 International Conference on Computer-Aided

Design (ICCAD), November 2006.[32] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main

Memory Storage System. InProceedings of the Sixth International

Conference on Architectural Support for Programming Languages

and Operating Systems, pages 8697, 1994.

[33] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt,

and W. Litwin. Reliability Mechanisms for Very Large Storage

Systems. In Proceedings of the 20th IEEE / 11th NASA Goddard

Conference on Mass Storage Systems and Technologies (MSS03) ,

pages 146156, Apr 2003.

[34] K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer

for SmartMedia Card Systems. IEEE Transactions on Consumer

Electronics, 50(1):192197, Feburary 2004.

8

A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

Documents

Transcript of A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf