A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

download A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

of 8

Transcript of A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    1/8

    A NOR Emulation Strategy over NAND Flash Memory

    Jian-Hong Lin Yuan-Hao Chang Jen-Wei Hsieh Tei-Wei Kuo Cheng-Chih Yang

    Graduate Institute of Networking and

    Multimedia

    Department of Computer Science and

    Information Engineering

    National Taiwan University, Taipei,

    Taiwan 106, R.O.C.

    {r94944003, d93944006, ktw}@csie.ntu.edu.tw

    Department of Computer Science and

    Information Engineering

    National Chiayi University,Chiayi,

    Taiwan 60004, R.O.C.

    [email protected]

    Product Development Firmware

    Engineering Gruop Genesys Logic,

    Inc. Taipei, Taiwan 231, R.O.C

    [email protected]

    Abstract

    This work is motivated by a strong market demand in the re-

    placement of NOR flash memory with NAND flash memory to

    cut down the cost in many embedded-system designs, such as

    mobile phones. Different from LRU-related caching or buffer-

    ing studies, we are interested in prediction-based prefetching

    based on given execution traces of applications. An implemen-

    tation strategy is proposed in the storage of the prefetching

    information with limited SRAM and run-time overheads. An

    efficient prediction procedure is presented based on informa-tion extracted from application executions to reduce the per-

    formance gap between NAND flash memory and NOR flash

    memory in reads. With the behavior of a target application ex-

    tracted from a set of collected traces, we show that data access

    over NOR flash memory is responded effectively over SRAM

    that serves as a cache for NAND flash memory.

    Keywords:NAND, NOR, flash memory, data caching

    1. INTRODUCTION

    While flash memory remains one of the most popular stor-

    ages in embedded systems because of its non-volatility, shock-resistance, small size, and low energy consumption, its appli-

    cation has grown much beyond its original design. Based on

    its original design, NOR flash memory (referred to as NOR for

    short) is designed to store binary code of programs because

    NOR supports XIP (eXecute-In-Place) and high performance

    in read operations, while NAND flash memory (referred to as

    NAND for short) is used as a data storage because NAND has

    lower price and higher performance in write/erase operations,

    compared to NOR [22, 23, 27, 30]. In recent years, the price

    of NAND has down done much faster than that of NOR, e.g.,

    the price of 8Gbit NAND is lower than one-fifth of 8Gbit NOR

    Supported by the National Science Council of Taiwan, R.O.C., under GrantNSC-95R0062-AE00-07 and NSC-95-2221-E-002-094-MY3

    in the first quarter of year 2007 [14]. In order to reduce the

    hardware cost ultimately, using NAND to replace NOR (mo-

    tivated by a strong market demand) becomes a new trend in

    embedded-system designs, especially on mobile phones and

    arcade games, even though read performance is an essential

    and critical issue in program execution. These observations

    motivate the objective of this work. That is the exploration of

    filling up the read performance gap between NAND and NOR

    with limited overhead.

    The management of flash memory is carried out by either

    software on a host system (as a raw medium) or hardware cir-

    cuits/firmware inside its devices. In the past decade, there were

    a lot of research and implementation designs to manage flash-

    memory storage systems, e.g., [2, 3, 5, 6, 8, 9, 16, 31, 33, 34].

    Some exploited efficient management schemes for large-scale

    storage systems and different architecture designs, e,g., [8, 9,

    16, 31, 33, 34]. To improve the performance of hard disks, Intel

    proposedRobsonto use NAND as a non-volatile cache of hard

    disks [1, 4, 7, 32] and Microsoft proposed a fast booting mech-

    anism (called ReadyBoost and ReadyDrive) in Windows Vista

    to enhance system performance by caching data in flash mem-

    ory. To use flash memory to replace hard disks and to improve

    the performance of flash memory, some proposed to use batterybacked SRAM to cache recently used data from flash memory

    with a copy-on-write scheme [12, 13, 18, 32]. For the new re-

    search trend of using NAND to store both code and data (in

    order to replace NOR), C. Park et al. developed a cost-efficient

    memory architecture integrating NAND flash memory into ex-

    isting memory hierarchy for XIP and also proposed an applica-

    tion specific demand paging mechanism in NAND flash mem-

    ory with compiler assistance [20, 21]. However, the proposed

    mechanisms need the source code of applications and specific

    compilers to lay out the compiled code in fixed memory lo-

    cation. The imposed constraints make the mechanisms hard to

    implement.

    1

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    2/8

    Although researchers have proposed many excellent meth-

    ods in improving the performance of flash memory, very few

    of them work on improving the performance of program ex-

    ecution when using NAND to store binary code. Only some

    research proposed some caching mechanisms, e.g., [15, 17,

    19, 25, 26], to improve the performance of program execution

    (where code is stored in NAND) without considering the ac-cess patterns. Without prefetching data according to the access

    patterns of the installed programs, the cache miss rate could

    not be effectively reduced and therefore programs cannot be

    executed efficiently, because most embedded systems only per-

    form some specific functions, especially arcade-game stations.

    In this paper, we propose an efficient prediction mechanism

    with limited memory-space requirements and an efficient im-

    plementation. The prediction mechanism collects the access

    patterns of program execution to construct a prediction graph

    by adopting the working-set concept [10, 11]. According to

    the prediction graph, the prediction mechanism prefetches data

    (/code) to the SRAM cache, so as to reduce the cache missrate. Therefore, the performance of the program execution is

    improved and the read performance gap between NAND and

    NOR is filled up effectively. A series of experiments is then

    conducted based on some realistic traces, which are collected

    from three different types of popular games Age of Empire II

    (AOE II), The Typing of the Death (TTD), and Raiden.

    We show that the average read performance of NAND with the

    proposed prediction mechanism could be better than that of

    NOR for 24%, 216%, and 298% in AOE II, TTD, and Raiden,

    respectively. Furthermore, the cache miss rate was 35.27%,

    4.21%, and 0.06% for AOE II, TTD, and Raiden, respectively,

    and the percentage of redundant prefetched data was lower

    than 10% in most cases.The rest of this paper is organized as follows: Section 2 de-

    scribes the characteristics of flash memory and research moti-

    vation. In Section 3, an efficient prediction mechanism is pro-

    posed. Section 4 summarizes the experimental results on read

    performance, cache miss rate, and extra overheads. Section 5

    is the conclusion.

    2. Flash-Memory Characteristics and

    Research Motivation

    There are two types of flash memory: NAND and NOR. Each

    NAND flash memory chip consists of many blocks, and eachblock is of a fixed number of pages. A block is the smallest

    unit for erase operations, while reads and writes are done in

    pages. A page contains a user area and a spare area, where

    the user area is for the data storage of a logical block, and the

    spare area stores ECC and other house-keeping information

    (i.e., LBA). Because flash memory is write-once, we do not

    overwrite data on each update. Instead, data are written to

    free space, and the old versions of data are invalidated (or

    considered as dead). The update strategy is called out-place

    update. In other words, any existing data on flash memory

    could not be over-written (updated) unless its corresponding

    block is erased. The pages that store live data and dead dataare called valid pages and invalid pages, respectively.

    Depending on the designs, blocks have different bounds

    on the number of erases over a block. For example, the typ-

    ical erase bounds of SLC and MLC2 NAND flash mem-

    ory are 10,000 and 1,000, respectively1. Each page of small-

    block(/large-block) SLC NAND can store 512B(/2KB) data,

    and there are 32(/64) pages per block. The spare area of a

    small-block(/large-block) SLC NAND page is 16B(/64B). Onthe other hand, each page of MLC2 NAND can store 2KB,

    and there are 128 pages per block. Different from NAND flash

    memory, a byte is the unit for reads and writes over NOR flash

    memory.

    SLC NOR [ 28] SLC NAND [ 24](large-block, 2KB-page)

    Price (US$/GB) [14] 34.65 6.79

    Read (random access of 8bits) 40ns 25sWrite (random access of 8bits) 14s 300s

    Read (s equent ial access ) 23.842MB/ sec 15.33MB/s ecW ri te ( sequenti al acces s) 0.068MB/sec 4.57MB/ secErase 0.217MB/sec 6.25MB/sec

    Table 1.The Typical Characteristics of NOR and NAND.

    Host Interface

    Cache (SRAM)

    NAND Flash Memory

    Data Bus

    Address Bus

    Converter

    Prefetch

    Procedure

    Control Logic

    512 bytes

    byte

    Host Interface

    Cache (SRAM)

    NAND Flash Memory

    Data Bus

    Address Bus

    Converter

    Prefetch

    Procedure

    Control Logic

    512 bytes

    byte

    Figure 1. An Architecture for the Performance Improvement

    of NAND Flash Memory.

    NAND has been widely adopted in the implementation of

    storage systems because of its advantages in cost and write

    throughput (for block-oriented access), compared to NOR.

    The typical cost of 1GB NOR costs US$34.65 in the mar-

    ket, compared to US$6.79 per GB for NAND, and the price

    gap of NAND and NOR will get even wider in the coming fu-

    ture. However, due to the high performance of NOR in reads,

    as shown in Table 1, and its eXecute-In-Place (XIP) char-

    acteristics, NOR is adopted in various embedded-system de-signs, such as mobile phones and Personal Multimedia Players

    (PMP). The characteristics of NAND and NOR are summa-

    rized in Table 1.

    This research is motivated by a strong market demand in

    the replacement of NOR with NAND in many embedded-

    system designs. In order to fill up the performance gap be-

    tween NAND and NOR, SRAM is a nature choice for data

    caching in performance improvement, such as that in the sim-

    ple but effective hardware architecture adopted by OneNAND

    1 There are two major NAND flash memory designs: SLC (Single Level Cell)flash memory and MLC (Multiple Level Cell) flash memory. Each cell of SLC

    flash memory contains one-bit information while each cell of MLCn flashmemory contains n-bit information.

    2

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    3/8

    [15, 25, 26]. (Please see Figure 1.) However, the most criti-

    cal technical problem behind the success in the replacement

    of NOR with NAND is on the prediction scheme and its imple-

    mentation design.Such an observation underlies the objective

    of this research. That is the design and implementation of an

    effective prediction mechanism for applications, with the con-

    siderations of flash-memory characteristics. Because of strin-gent resource supports over embedded systems, the proposed

    mechanism must also face challenges in restricted SRAM us-

    age and limited computing power.

    3. An Efficient Prediction Mechanism

    3.1 Overview

    In order to fill up the performance gap between NAND and

    NOR, SRAMcan serve as a cache layer for data access over

    NAND. As shown in Figure 1, the Host Interface is respon-

    sible to the communication with the host system via address

    and data buses. TheControl Logicmanages the caching activ-

    ity and provides the service emulation of NOR with NANDand SRAM. TheControl Logicshould have an intelligent pre-

    diction mechanism implemented to improve the system per-

    formance. Different from popular caching ideas adopted in

    the memory hierarchy, this research aims at an application-

    oriented caching mechanism. Instead of the adopting of a LRU-

    like policy, we are interested in prediction-based prefetching

    based on given execution traces of applications. We consider

    the designs of embedded systems with a limited set of appli-

    cations, such as a set of selected system programs in mobile

    phones or arcade games of amusement-park machines. The

    design and implementation should also consider the resource

    constraints of a controller in the SRAMcapacity and comput-ing.

    There are two major components in the Control Logic: The

    Converter emulates NOR access over NAND with a SRAM

    cache, where address translation must be done from byte ad-

    dressing (for NOR) to Logical Block Address (LBA) address-

    ing (for NAND). Note that each 512B/2KB NAND page corre-

    sponds to one and four LBAs, respectively [29]. ThePrefetch

    Proceduretries to prefetch data from NAND to SRAMso that

    the hit rate of the NOR access is high overSRAM. The proce-

    dure should parse and extract the behavior of the target appli-

    cation via a set of collected traces. According to the extracted

    access patterns from the collected traces, the procedure gener-

    ates prediction information, referred to as a prediction graph. InSection 3.2, we shall define a prediction graph and present its

    implementation strategy over NAND. An algorithm design for

    the Prefetch Procedure will be then presented in Section 3.3.

    3.2 A Prediction Graph and Implementation

    The access pattern of an application execution over NOR (or

    NAND) consists of a sequence of LBAs, where some LBAs

    are for instructions, and the others are for data. As an appli-

    cation runs for multiple times, the virtually complete pic-

    ture of the possible access pattern of an application execution

    might appear, as shown in Figure 2. Since most application ex-

    ecutions are input-dependent or data-driven, there can be morethan one subsequent LBAs following a given LBA, where each

    0

    5

    6

    21

    3

    4

    8

    9

    7

    11

    13

    10

    12

    0

    5

    6

    21

    3

    4

    8

    9

    7

    11

    13

    10

    12

    Figure 2.An example of a prediction graph

    LBA corresponds to one node in the graph. A node with more

    than one subsequent LBAs is called a branch node(such as the

    shaded nodes in Figure 2), and the other nodes are called regu-

    larnodes. The graph that corresponds to the access patterns is

    referred to as the prediction graphof the patterns. If pages inNAND could be pre-fetched in an on-time fashion, and there is

    enoughSRAMspace for caching, then all data accesses can be

    done overSRAM.

    data data branch table(regular node) (regular node)

    data(branch node)

    data data branch table(regular node) (regular node)

    data(branch node)

    (a) Prediction Information

    3

    addr(b1)

    addr(b2)

    addr(b3)

    3

    addr(b1)

    addr(b2)

    addr(b3)

    (b) A Branch Table

    Figure 3.The Storage of a Prediction Graph

    The technical problems are how to save the prediction graphover flash memory with overheads minimized and how to

    prefetch pages based on the graph in a simple but effective

    way. We propose to save the subsequent LBA information of

    each regular node at the spare area of the corresponding page. It

    is because the spare area of a page in current implementations

    and the specification has unused space, and the reading of a

    page usually comes with the reading of its data and spare areas

    simultaneously. In such a way, the accessing of the subsequent

    LBA information of a regular node comes with no extra cost.

    Since a branch node has more than one subsequent LBAs, the

    spare area of the corresponding page might not have enough

    free space to store the information. We propose to maintaina branch table to save the subsequent LBA information of all

    3

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    4/8

    branch nodes. The starting entry address of the branch table

    that corresponds to a branch node can be saved at the spare

    area of the corresponding page, as shown in Figure 3.(a). The

    starting entry records the number of subsequent LBAs of the

    branch node, and the subsequent LBAs are stored in the en-

    tries following the starting entry (Please see Figure 3.(b)). The

    branch table can be saved on flash memory. During the runtime, the entire table can be loaded into SRAMfor better per-

    formance. If there is not enoughSRAMspace, parts of the table

    can be loaded in an on-demand fashion.

    3.3 A Prefetch Procedure

    nextcurrent

    2 3 4 5 11 . . .6. . .

    nextcurrent

    2 3 4 5 11 . . .6. . .

    Figure 4.A Snapshot of the Cache

    The objective of the prefetch procedure is to prefetch data

    from NAND based on a given prediction graph such that mostdata accesses occur over SRAM. The basic idea is to prefetch

    data by following the LBA order in the graph. In order to

    efficiently look up a selected page in the cache, we propose

    to adopt a cyclic buffer in the cache management, and let two

    indices current and next denote the pages currently accessed

    and prefetched, respectively. Whencurrent= next, the cachingbuffer is empty. When current= (next+ 1) modSIZE, thebuffer is full, where SIZE is the number of buffers for page

    caching. Consider the prediction graph shown in Figure 2. The

    page that corresponds to Node 2 is currently accessed, and the

    page that corresponds to Node 6 is just prefetched (Please see

    Figure 4).The prefetch procedure is done in a greedy way: Let P1

    be the last prefetched page. If P1 corresponds to a regular

    node, then the page that corresponds to the subsequent LBA

    is prefetched. If P1 corresponds to a branch node, then the pro-

    cedure should prefetch pages by following all possible next

    LBA links in an equal base and an alternative way. That is,

    the prefetch procedure can follow each LBA link in an al-

    ternative way. For example, pages corresponding to Nodes 4

    and 5 are prefetched after the page that corresponds to Node

    3 is prefetched, as shown in Figure 4. The next pages to be

    prefetched are the pages corresponding to Nodes 1 and 6. In

    order to properly manage the preferching cost, the prefetch

    procedure stops following an LBA link when nextreaches abranch node again along a link, or when nextand currentmight

    point to the same page (both referred to as Stop Conditions).

    When the caching buffer is full (also referred to as a Stop Con-

    dition), the prefetch procedure should also stop temporarily.

    Take the prediction graph shown in Figure 2 as an example.

    The prefetch procedure should not prefetch pages correspond-

    ing to Nodes 8 and 9 when the page corresponding to Node 7

    is prefetched. When currentreaches a page that corresponds

    to a branch node, the next page to be accessed (referred to as

    the target page) will determine which branch the application

    execution will follow. The prefetch procedure should start pre-

    fecting the page that corresponds to the subsequent LBA ofthe target page (or the pages that correspond to the subsequent

    LBAs of the target page if the target page corresponds to a

    branch node). The above prefetch procedure shall repeat again

    if it stops tentatively because of any Stop Condition. Note that

    all pages cached in the SRAM cache between currentand next

    stay in the cache after the target page (in the following of a

    branch) is accessed. It is because some of the cached pages

    might be accessed shortly, even though the access of the targetpage has determined which branch the application execution

    will follow. Note that cache misses are still possible, e.g., those

    when current=next. In such a case, data are accessed fromNAND and loaded onto the SRAM cache in an on-demand

    fashion.

    The pseudo code of the prefetch procedure is as shown in

    Algorithm 1: Two flags, i.e.,stopandstartbch, are used to track

    the prefetching state: stopand startbch denote the satisfaction

    of any Stop Condition and the reaching of a branch node,

    respectively. Initially,stop and startbch are set as FALSE. If

    any Stop Condition is satisfied when the procedure is invoked,

    then the procedure simply returns (Step 1). The procedureprefetches one page in each iteration (Steps 2-19) until the

    cache is full (i.e., a Stop Condition), or we reach a branch

    node for the first time. First, next is checked if it will point

    to the same page ascurrentdoes, then the prefetch procedure

    stops and returns (Steps 3-6) Otherwise, in each iteration, the

    procedure increases next, i.e., the location of the next free

    cache buffer (Step 7). The LBA is obtained by checking up

    the latest prefetched LBA (Step 8), and then the page of the

    LBA is prefetched (Step 9). After the prefetching of a page, the

    procedure checks up whether the prefetched page corresponds

    to a branch node (Steps 10-11). If so, the procedure loads

    the corresponding branch table entries (Step 12) and save the

    subsequent LBA of each branch of the branch node (Steps 12-17). Because the prefetched page corresponds to a branch node,

    the procedure should start prefetching pages by following each

    branch in an alternative way (Steps 20-36). The loop will stop

    when the cache is full (Step 20), when every next LBA link of

    a branch node reaches the next branch node (Steps 31-35), or

    whennextand currentmight point to the same page (Steps 22-

    25). In each iteration of the loop, if the LBA link indexed by

    idxbch does not yet reach the next branch node (Step 21), the

    next LBA following the link shall be prefetched (Steps 26-28).

    Pages are prefetched by following all possible next LBA links

    in an equal base and an alternative way (Step 30).

    Note that stop should be set to FALSEwhen the cache isno longer full or when next and currentdo not point to the

    same page2. Moreover, stop and startbchshould both be reset to

    FALSEwhen currentpasses a branch node and meets the target

    page or when a cache miss occurs (i.e., current= next). Oncestopis set to FALSE, the prefetch procedure is invoked. When

    startbchis FALSE in such an invocation, the prefetch procedure

    starts prefetching from the first loop between Steps 2 and 19.

    Otherwise, the prefetch procedure will continue its previous

    prefetching job by following next LBA links of a visited branch

    node in an alternative way (Steps 20-36).

    2 Performance enhancement is possible by deploying more complicated condi-tion setting and actions.

    4

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    5/8

    Algorithm 1: Prefetch Procedure

    Input:stop,next,current,lba,idxbch,Nbch,lbabch[], andstartbchOutput:nullifstop=TUREthen return;1

    whilestartbch= FALSE and(next+ 1)mod SIZE=currentdo2ifChkNxLBA(lba) =cache(current)then3

    stop TRUE;4

    return;5

    end6

    next (next+ 1)modSIZE;7

    lba GetNxLBA(lba) ;8

    Read(next,lba) ;9

    startbch IsBchStart() ;10

    ifstartbch= TRUEthen11LdBchTable(GetNxLBA(lba)) ;12

    idxbch 0 ;13

    Nbch GetBchNum() ;14

    fori=0; i

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    6/8

    so as to prevent the size of the branch tables growing too large

    to fit in SRAM.

    As for the relationship between the number of traces and

    the average number of branches of each branch node in the

    prediction graph, the more the number of traces was collected

    for each game, the higher the average branches of each branch

    node. As shown in Figure 6, the average number of branches ofeach branch node was less than four, and the average number

    of each branch node grew slowly, except AOE II, because of

    the randomness of data access in the real-time strategy game

    AOE II.

    Figure 6.Increment of average branch number

    4.2 Read Performance and Cache Miss Rate

    Figure 7 shows the read performance of each game with differ-ent cache sizes where the prediction graph was derived based

    on ten traces of each game. Our approach achieved the best

    performance for Raiden, due to its regular access patterns, but

    the worst performance was observed for AOE II. When the

    cache size was 2KB, the average read performance of AOE

    II, TTD, and Raiden with the prefetch mechanism was 27.74

    MB/s, 68.68 MB/s, and 94.98 MB/s, respectively. We must

    point out that all of them were better than the read performance

    of NOR (23.84 MB/s). Note that the lower the cache miss rate

    in prefetching was, the higher read performance. To resolve

    a cache miss, data accesses to NOR had to be redirected to

    NAND so that missed data could be loaded from NAND to the

    cache. It was also shown that a 4KB cache was sufficient forthe games under considerationsbecause the read performance

    became saturated when the cache size was no less than 4KB.

    Figure 8 shows the read performance of the proposed ap-

    proach for the three games with respect to different numbers of

    traces, where the cache size was 4KB. The read performance

    of each game was better than that of NOR even when only

    two traces were used to generate a prediction graph. For exam-

    ple, the improvement ratios over AOE II, TTD, and Raiden

    were 24%, 216%, and 298%, respectively, when the num-

    ber of traces for each game was 10, and the size of cache

    was 4KB. When there were more than two traces, the read

    performance of Raiden had almost no improvement becausethe cache miss rate was almost zero. For AOE II, the read

    Figure 7.The read performance with different cache sizes (10

    traces)

    performance was improved slowly when the number of col-lected traces increased because the access pattern of AOE II

    was highly random. The increasing in the number of collected

    traces for the prediction graph could not reduce the cache miss

    rate significantly. For TTD, good improvement was observed

    with the inclusion of two more traces. It was because the last

    two traces were, in fact, collected during the advance of players

    in the game by clearing more stages. Furthermore, we summa-

    rize the read performance of the proposed scheme and other ex-

    isting products in Table 3. It shows that the read performance

    of some specific applications with regular access patterns is

    even better than that of OneNAND. On the other hand, without

    our prediction mechanism, i.e., the worst case = 100% miss

    rate, request data has to be read from NAND flash memoryon each read request. Thus, it is impractical to use NAND to

    replace NOR without any prediction mechanism because the

    read performance gap between the emulated NOR and NOR is

    too large.

    Figure 8. The read performance with different numbers of

    traces (4KB cache)

    Figure 9 shows the cache miss rate. The miss rate was

    lower when more number of traces were used to construct theprediction graph. In the figure, when ten traces were used to

    6

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    7/8

    AO E II T TD Ra id en Wo rst case NO R O neNAND[25 ]

    Read29.57 75.24 94.44 8.76 23.84 68

    (MB/s)

    Table 3. Comparison of the read performance (10 traces and

    4KB cache in our approach)

    generate the prediction garph, the cache miss rate of Raiden

    was almost zero and that of TTD was lower than 5%, but

    that of AOE II could not be reduced effectively because of its

    unpredictable access patterns. Comparing to read performance

    shown in Figure 8, the read performance of a game was higher

    when the cache miss rate was lower.

    Figure 9.Cache miss rate with different number of traces (with

    4KB cache)

    4.3 Main-memory Requirements

    The major memory overhead of the prediction mechanism was

    to maintain the branch table. The more the traces were used

    to create the prediction graph, the bigger the branch table was.

    That was because more access patterns were learned. As shown

    in Figure 10, the table sizes of AOE II, TTD, and Raiden were

    only 39.83KB, 35.14KB, and 0.43KB, respectively, when there

    were ten traces used to construct each game. In most embedded

    systems, the branch table of each game was still small enough

    to be stored in RAM. However, in this experiment, branch ta-

    bles were stored in NAND flash memory and loaded to SRAM

    on demand. Figure 10 shows that the table size of Raiden keptlow when the number of traces increased, but the table sizes of

    AOE II and TTD kept growing because ten traces still couldnt

    cover all the access patterns of AOE II and TTD. However, as

    shown in Figure 9, the cache miss rate of TTD was very low

    and didnt need to involve new traces to improve the cache hit

    ratio, and the cache miss rate of AOE II still could not be low-

    ered even if more traces were collected.

    4.4 Cache Pollution Rate

    Cache pollution Rateis the rate of data that are prefetched but

    not referenced during the program execution. The prefetching

    of unnecessary data represented overheads and might even de-creased the read performance because the prefetching activities

    Figure 10.The size of branch table

    of unnecessary data might delay the prefetching of useful data.

    In addition, unnecessary data transfer leads to extra power con-sumption, which is critical to designs of embedded systems.

    LetNSRAM2hostbe the amount of data accessed by the host, and

    Nf lash2SRAMthe amount of data transferred from NAND flash

    memory to SRAM. The cache pollution rate was defined as

    follows:

    Cache pollution rate = 1 NSRAM2host

    Nf lash2SRAM

    As shown in Figure 11, the cache pollution rate increased

    as the number of traces for each game increased. That was

    because more traces led to a larger number of branches per

    branch node, and only one of the LBA links that follow agiven branch node was actually referenced by the program.

    In summary, there was a trade-off between the prefetching

    accuracy and the prefetching overhead, even though the cache

    pollution rates were still lower than 10% in most cases.

    Figure 11.The cache pollution rate (4KB cache)

    5. Conclusions

    This paper addresses the issue of the replacement of NOR withNAND motivated by a strong market demand. Different from

    7

  • 8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf

    8/8

    on-demand cache mechanisms proposed in previous work,

    we propose an efficient prediction mechanism with limited

    memory-space requirement and an efficient implementation to

    improve the performance of programs stored in NAND. Binary

    code of programs is prefetched from NAND to SRAM cache

    precisely and efficiently according to the prediction graph that

    is constructed by the collected access patterns of program ex-ecution. A series of experiments is conducted based on re-

    alistic traces collected from three different types of popular

    games AOE II, TTD, and Raiden. We show that the

    average read performance of NAND with the proposed pre-

    diction mechanism could be better than that of NOR in most

    cases, the cache miss rate was 35.27%, 4.21%, and 0.06% for

    AOE II, TTD, and Raiden, respectively, and the percentage of

    redundant prefetched data was lower than 10% in most cases.

    Fur future research, we shall further extend the proposed

    mechanism to adjust the prediction graph on-line to make the

    prediction mechanism adaptive to any special and temporal

    changes of program executions. We shall also explore the pred-icability of data prefetching for programs that have high ran-

    domness in terms of access patterns.

    References

    [1] Flash Cache Memory Puts Robson in the Middle. Intel.

    [2] Flash File System. US Patent 540,448. In Intel Corporation.

    [3] FTL Logger Exchanging Data with FTL Systems. Technical

    report, Intel Corporation.

    [4] Software Concerns of Implementing a Resident Flash Disk. Intel

    Corporation.

    [5] Flash-memory Translation Layer for NAND flash (NFTL). M-

    Systems, 1998.

    [6] Understanding the Flash Translation Layer (FTL) Specification,

    http://developer.intel.com/. Technical report, Intel Corporation,

    Dec 1998.

    [7] Windows ReadyDrive and Hybrid Hard Disk Drives,

    http:// www.microsoft.com/whdc/device/storage/hybrid.mspx.

    Technical report, Microsoft, May 2006.

    [8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architecture

    for Flash Memory Storage Systems of Embedded Systems. In

    IEEE Real-Time and Embedded Technology and Applications

    Symposium, pages 187196, 2002.

    [9] L.-P. Chang and T.-W. Kuo. An Efficient Management Scheme for

    Large-Scale Flash-Memory Storage Systems. InACM Symposium

    on Applied Computing (SAC), pages 862868, Mar 2004.

    [10] P. J. Denning. The Working Set Model for Program Behavior.

    Communications of the ACM, 11(5):323333, 1968.

    [11] P. J. Denning and S. C. Schwartz. Properties of the Working-Set

    Model. Communications of the ACM, 15(3):191198, 1972.

    [12] F. Douglis, R. Caceres, F. Kaashoek, K. Li, B. Marsh, and

    J. Tauber. Storage Alternatives for Mobile Computers. In

    Proceedings of the USENIX Operating System Design and

    Implementation, pages 2537, 1994.

    [13] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-

    hungry disk. In Proceedings of the 1994 Winter USENIX

    Conference, pages 292306, 1994.

    [14] DRAMeXchange. NAND Flash Contract Price,http://www.dramexchange.com/, 03 2007.

    [15] Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and

    N. Chang. Demand Paging for OneNANDT M Flash eXecute-

    In-Place. CODES+ISSS, October 2006.

    [16] A. Kawaguchi, S. Nishioka, and H. Motoda. A Flash-Memory

    Based File System. InProceedings of the 1995 USENIX Technical

    Conference, pages 155164, Jan 1995.

    [17] J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-typeflash memory package with smart buffer system for spatial and

    temporal localities. JOURNAL OF SYSTEMS ARCHITECTURE,

    51:111123, 2004.

    [18] B. Marsh, F. Douglis, and P. Krishnan. Flash Memory File

    Caching for Mobile Computers. InProceedings of the Twenty-

    Seventh Annual Hawaii International Conference on System

    Sciences, pages 451460, 1994.

    [19] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energy-aware

    demand paging on nand flash-based embedded storages. ISLPED,

    August 2004.

    [20] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compiler-

    assisted demand paging for embedded systems with flash memory.

    EMSOFT, September 2004.[21] C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efficient

    memory architecture design of nand flash memory embedded

    systems. ICCD, 2003.

    [22] Z. Paz. Alternatives to Using NAND Flash White Paper.

    Technical report, M-Systems, August 2003.

    [23] R. A. Quinnell. Meet Different Needs with NAND and NOR.

    Technical report, TOSHIBA, September 2005.

    [24] Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND Flash

    Memory Data Sheet, 2003.

    [25] Samsung Electronics. OneNAND Features and Performance, 11

    2005.

    [26] Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit

    OneNAND Flash Memory Data Sheet, 09 2006.

    [27] M. Santarini. NAND versus NOR. Technical report, EDN,

    October 2005.

    [28] Silicon Storage Technology (SST). SST39LF040 4K x 8bit SST

    Flash Memory Data Sheet, 2005.

    [29] STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND

    Flash Memory, 2005.

    [30] A. Tal. Two Technologies Compared: NOR vs. NAND White

    Paper. Technical report, M-Systems, July 2003.

    [31] C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Management

    for the Flash Translation Layer in Embedded Systems. In

    IEEE/ACM 2006 International Conference on Computer-Aided

    Design (ICCAD), November 2006.[32] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main

    Memory Storage System. InProceedings of the Sixth International

    Conference on Architectural Support for Programming Languages

    and Operating Systems, pages 8697, 1994.

    [33] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt,

    and W. Litwin. Reliability Mechanisms for Very Large Storage

    Systems. In Proceedings of the 20th IEEE / 11th NASA Goddard

    Conference on Mass Storage Systems and Technologies (MSS03) ,

    pages 146156, Apr 2003.

    [34] K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer

    for SmartMedia Card Systems. IEEE Transactions on Consumer

    Electronics, 50(1):192197, Feburary 2004.

    8