A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
-
Upload
muthumura9089 -
Category
Documents
-
view
216 -
download
0
Transcript of A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
1/8
A NOR Emulation Strategy over NAND Flash Memory
Jian-Hong Lin Yuan-Hao Chang Jen-Wei Hsieh Tei-Wei Kuo Cheng-Chih Yang
Graduate Institute of Networking and
Multimedia
Department of Computer Science and
Information Engineering
National Taiwan University, Taipei,
Taiwan 106, R.O.C.
{r94944003, d93944006, ktw}@csie.ntu.edu.tw
Department of Computer Science and
Information Engineering
National Chiayi University,Chiayi,
Taiwan 60004, R.O.C.
Product Development Firmware
Engineering Gruop Genesys Logic,
Inc. Taipei, Taiwan 231, R.O.C
Abstract
This work is motivated by a strong market demand in the re-
placement of NOR flash memory with NAND flash memory to
cut down the cost in many embedded-system designs, such as
mobile phones. Different from LRU-related caching or buffer-
ing studies, we are interested in prediction-based prefetching
based on given execution traces of applications. An implemen-
tation strategy is proposed in the storage of the prefetching
information with limited SRAM and run-time overheads. An
efficient prediction procedure is presented based on informa-tion extracted from application executions to reduce the per-
formance gap between NAND flash memory and NOR flash
memory in reads. With the behavior of a target application ex-
tracted from a set of collected traces, we show that data access
over NOR flash memory is responded effectively over SRAM
that serves as a cache for NAND flash memory.
Keywords:NAND, NOR, flash memory, data caching
1. INTRODUCTION
While flash memory remains one of the most popular stor-
ages in embedded systems because of its non-volatility, shock-resistance, small size, and low energy consumption, its appli-
cation has grown much beyond its original design. Based on
its original design, NOR flash memory (referred to as NOR for
short) is designed to store binary code of programs because
NOR supports XIP (eXecute-In-Place) and high performance
in read operations, while NAND flash memory (referred to as
NAND for short) is used as a data storage because NAND has
lower price and higher performance in write/erase operations,
compared to NOR [22, 23, 27, 30]. In recent years, the price
of NAND has down done much faster than that of NOR, e.g.,
the price of 8Gbit NAND is lower than one-fifth of 8Gbit NOR
Supported by the National Science Council of Taiwan, R.O.C., under GrantNSC-95R0062-AE00-07 and NSC-95-2221-E-002-094-MY3
in the first quarter of year 2007 [14]. In order to reduce the
hardware cost ultimately, using NAND to replace NOR (mo-
tivated by a strong market demand) becomes a new trend in
embedded-system designs, especially on mobile phones and
arcade games, even though read performance is an essential
and critical issue in program execution. These observations
motivate the objective of this work. That is the exploration of
filling up the read performance gap between NAND and NOR
with limited overhead.
The management of flash memory is carried out by either
software on a host system (as a raw medium) or hardware cir-
cuits/firmware inside its devices. In the past decade, there were
a lot of research and implementation designs to manage flash-
memory storage systems, e.g., [2, 3, 5, 6, 8, 9, 16, 31, 33, 34].
Some exploited efficient management schemes for large-scale
storage systems and different architecture designs, e,g., [8, 9,
16, 31, 33, 34]. To improve the performance of hard disks, Intel
proposedRobsonto use NAND as a non-volatile cache of hard
disks [1, 4, 7, 32] and Microsoft proposed a fast booting mech-
anism (called ReadyBoost and ReadyDrive) in Windows Vista
to enhance system performance by caching data in flash mem-
ory. To use flash memory to replace hard disks and to improve
the performance of flash memory, some proposed to use batterybacked SRAM to cache recently used data from flash memory
with a copy-on-write scheme [12, 13, 18, 32]. For the new re-
search trend of using NAND to store both code and data (in
order to replace NOR), C. Park et al. developed a cost-efficient
memory architecture integrating NAND flash memory into ex-
isting memory hierarchy for XIP and also proposed an applica-
tion specific demand paging mechanism in NAND flash mem-
ory with compiler assistance [20, 21]. However, the proposed
mechanisms need the source code of applications and specific
compilers to lay out the compiled code in fixed memory lo-
cation. The imposed constraints make the mechanisms hard to
implement.
1
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
2/8
Although researchers have proposed many excellent meth-
ods in improving the performance of flash memory, very few
of them work on improving the performance of program ex-
ecution when using NAND to store binary code. Only some
research proposed some caching mechanisms, e.g., [15, 17,
19, 25, 26], to improve the performance of program execution
(where code is stored in NAND) without considering the ac-cess patterns. Without prefetching data according to the access
patterns of the installed programs, the cache miss rate could
not be effectively reduced and therefore programs cannot be
executed efficiently, because most embedded systems only per-
form some specific functions, especially arcade-game stations.
In this paper, we propose an efficient prediction mechanism
with limited memory-space requirements and an efficient im-
plementation. The prediction mechanism collects the access
patterns of program execution to construct a prediction graph
by adopting the working-set concept [10, 11]. According to
the prediction graph, the prediction mechanism prefetches data
(/code) to the SRAM cache, so as to reduce the cache missrate. Therefore, the performance of the program execution is
improved and the read performance gap between NAND and
NOR is filled up effectively. A series of experiments is then
conducted based on some realistic traces, which are collected
from three different types of popular games Age of Empire II
(AOE II), The Typing of the Death (TTD), and Raiden.
We show that the average read performance of NAND with the
proposed prediction mechanism could be better than that of
NOR for 24%, 216%, and 298% in AOE II, TTD, and Raiden,
respectively. Furthermore, the cache miss rate was 35.27%,
4.21%, and 0.06% for AOE II, TTD, and Raiden, respectively,
and the percentage of redundant prefetched data was lower
than 10% in most cases.The rest of this paper is organized as follows: Section 2 de-
scribes the characteristics of flash memory and research moti-
vation. In Section 3, an efficient prediction mechanism is pro-
posed. Section 4 summarizes the experimental results on read
performance, cache miss rate, and extra overheads. Section 5
is the conclusion.
2. Flash-Memory Characteristics and
Research Motivation
There are two types of flash memory: NAND and NOR. Each
NAND flash memory chip consists of many blocks, and eachblock is of a fixed number of pages. A block is the smallest
unit for erase operations, while reads and writes are done in
pages. A page contains a user area and a spare area, where
the user area is for the data storage of a logical block, and the
spare area stores ECC and other house-keeping information
(i.e., LBA). Because flash memory is write-once, we do not
overwrite data on each update. Instead, data are written to
free space, and the old versions of data are invalidated (or
considered as dead). The update strategy is called out-place
update. In other words, any existing data on flash memory
could not be over-written (updated) unless its corresponding
block is erased. The pages that store live data and dead dataare called valid pages and invalid pages, respectively.
Depending on the designs, blocks have different bounds
on the number of erases over a block. For example, the typ-
ical erase bounds of SLC and MLC2 NAND flash mem-
ory are 10,000 and 1,000, respectively1. Each page of small-
block(/large-block) SLC NAND can store 512B(/2KB) data,
and there are 32(/64) pages per block. The spare area of a
small-block(/large-block) SLC NAND page is 16B(/64B). Onthe other hand, each page of MLC2 NAND can store 2KB,
and there are 128 pages per block. Different from NAND flash
memory, a byte is the unit for reads and writes over NOR flash
memory.
SLC NOR [ 28] SLC NAND [ 24](large-block, 2KB-page)
Price (US$/GB) [14] 34.65 6.79
Read (random access of 8bits) 40ns 25sWrite (random access of 8bits) 14s 300s
Read (s equent ial access ) 23.842MB/ sec 15.33MB/s ecW ri te ( sequenti al acces s) 0.068MB/sec 4.57MB/ secErase 0.217MB/sec 6.25MB/sec
Table 1.The Typical Characteristics of NOR and NAND.
Host Interface
Cache (SRAM)
NAND Flash Memory
Data Bus
Address Bus
Converter
Prefetch
Procedure
Control Logic
512 bytes
byte
Host Interface
Cache (SRAM)
NAND Flash Memory
Data Bus
Address Bus
Converter
Prefetch
Procedure
Control Logic
512 bytes
byte
Figure 1. An Architecture for the Performance Improvement
of NAND Flash Memory.
NAND has been widely adopted in the implementation of
storage systems because of its advantages in cost and write
throughput (for block-oriented access), compared to NOR.
The typical cost of 1GB NOR costs US$34.65 in the mar-
ket, compared to US$6.79 per GB for NAND, and the price
gap of NAND and NOR will get even wider in the coming fu-
ture. However, due to the high performance of NOR in reads,
as shown in Table 1, and its eXecute-In-Place (XIP) char-
acteristics, NOR is adopted in various embedded-system de-signs, such as mobile phones and Personal Multimedia Players
(PMP). The characteristics of NAND and NOR are summa-
rized in Table 1.
This research is motivated by a strong market demand in
the replacement of NOR with NAND in many embedded-
system designs. In order to fill up the performance gap be-
tween NAND and NOR, SRAM is a nature choice for data
caching in performance improvement, such as that in the sim-
ple but effective hardware architecture adopted by OneNAND
1 There are two major NAND flash memory designs: SLC (Single Level Cell)flash memory and MLC (Multiple Level Cell) flash memory. Each cell of SLC
flash memory contains one-bit information while each cell of MLCn flashmemory contains n-bit information.
2
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
3/8
[15, 25, 26]. (Please see Figure 1.) However, the most criti-
cal technical problem behind the success in the replacement
of NOR with NAND is on the prediction scheme and its imple-
mentation design.Such an observation underlies the objective
of this research. That is the design and implementation of an
effective prediction mechanism for applications, with the con-
siderations of flash-memory characteristics. Because of strin-gent resource supports over embedded systems, the proposed
mechanism must also face challenges in restricted SRAM us-
age and limited computing power.
3. An Efficient Prediction Mechanism
3.1 Overview
In order to fill up the performance gap between NAND and
NOR, SRAMcan serve as a cache layer for data access over
NAND. As shown in Figure 1, the Host Interface is respon-
sible to the communication with the host system via address
and data buses. TheControl Logicmanages the caching activ-
ity and provides the service emulation of NOR with NANDand SRAM. TheControl Logicshould have an intelligent pre-
diction mechanism implemented to improve the system per-
formance. Different from popular caching ideas adopted in
the memory hierarchy, this research aims at an application-
oriented caching mechanism. Instead of the adopting of a LRU-
like policy, we are interested in prediction-based prefetching
based on given execution traces of applications. We consider
the designs of embedded systems with a limited set of appli-
cations, such as a set of selected system programs in mobile
phones or arcade games of amusement-park machines. The
design and implementation should also consider the resource
constraints of a controller in the SRAMcapacity and comput-ing.
There are two major components in the Control Logic: The
Converter emulates NOR access over NAND with a SRAM
cache, where address translation must be done from byte ad-
dressing (for NOR) to Logical Block Address (LBA) address-
ing (for NAND). Note that each 512B/2KB NAND page corre-
sponds to one and four LBAs, respectively [29]. ThePrefetch
Proceduretries to prefetch data from NAND to SRAMso that
the hit rate of the NOR access is high overSRAM. The proce-
dure should parse and extract the behavior of the target appli-
cation via a set of collected traces. According to the extracted
access patterns from the collected traces, the procedure gener-
ates prediction information, referred to as a prediction graph. InSection 3.2, we shall define a prediction graph and present its
implementation strategy over NAND. An algorithm design for
the Prefetch Procedure will be then presented in Section 3.3.
3.2 A Prediction Graph and Implementation
The access pattern of an application execution over NOR (or
NAND) consists of a sequence of LBAs, where some LBAs
are for instructions, and the others are for data. As an appli-
cation runs for multiple times, the virtually complete pic-
ture of the possible access pattern of an application execution
might appear, as shown in Figure 2. Since most application ex-
ecutions are input-dependent or data-driven, there can be morethan one subsequent LBAs following a given LBA, where each
0
5
6
21
3
4
8
9
7
11
13
10
12
0
5
6
21
3
4
8
9
7
11
13
10
12
Figure 2.An example of a prediction graph
LBA corresponds to one node in the graph. A node with more
than one subsequent LBAs is called a branch node(such as the
shaded nodes in Figure 2), and the other nodes are called regu-
larnodes. The graph that corresponds to the access patterns is
referred to as the prediction graphof the patterns. If pages inNAND could be pre-fetched in an on-time fashion, and there is
enoughSRAMspace for caching, then all data accesses can be
done overSRAM.
data data branch table(regular node) (regular node)
data(branch node)
data data branch table(regular node) (regular node)
data(branch node)
(a) Prediction Information
3
addr(b1)
addr(b2)
addr(b3)
3
addr(b1)
addr(b2)
addr(b3)
(b) A Branch Table
Figure 3.The Storage of a Prediction Graph
The technical problems are how to save the prediction graphover flash memory with overheads minimized and how to
prefetch pages based on the graph in a simple but effective
way. We propose to save the subsequent LBA information of
each regular node at the spare area of the corresponding page. It
is because the spare area of a page in current implementations
and the specification has unused space, and the reading of a
page usually comes with the reading of its data and spare areas
simultaneously. In such a way, the accessing of the subsequent
LBA information of a regular node comes with no extra cost.
Since a branch node has more than one subsequent LBAs, the
spare area of the corresponding page might not have enough
free space to store the information. We propose to maintaina branch table to save the subsequent LBA information of all
3
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
4/8
branch nodes. The starting entry address of the branch table
that corresponds to a branch node can be saved at the spare
area of the corresponding page, as shown in Figure 3.(a). The
starting entry records the number of subsequent LBAs of the
branch node, and the subsequent LBAs are stored in the en-
tries following the starting entry (Please see Figure 3.(b)). The
branch table can be saved on flash memory. During the runtime, the entire table can be loaded into SRAMfor better per-
formance. If there is not enoughSRAMspace, parts of the table
can be loaded in an on-demand fashion.
3.3 A Prefetch Procedure
nextcurrent
2 3 4 5 11 . . .6. . .
nextcurrent
2 3 4 5 11 . . .6. . .
Figure 4.A Snapshot of the Cache
The objective of the prefetch procedure is to prefetch data
from NAND based on a given prediction graph such that mostdata accesses occur over SRAM. The basic idea is to prefetch
data by following the LBA order in the graph. In order to
efficiently look up a selected page in the cache, we propose
to adopt a cyclic buffer in the cache management, and let two
indices current and next denote the pages currently accessed
and prefetched, respectively. Whencurrent= next, the cachingbuffer is empty. When current= (next+ 1) modSIZE, thebuffer is full, where SIZE is the number of buffers for page
caching. Consider the prediction graph shown in Figure 2. The
page that corresponds to Node 2 is currently accessed, and the
page that corresponds to Node 6 is just prefetched (Please see
Figure 4).The prefetch procedure is done in a greedy way: Let P1
be the last prefetched page. If P1 corresponds to a regular
node, then the page that corresponds to the subsequent LBA
is prefetched. If P1 corresponds to a branch node, then the pro-
cedure should prefetch pages by following all possible next
LBA links in an equal base and an alternative way. That is,
the prefetch procedure can follow each LBA link in an al-
ternative way. For example, pages corresponding to Nodes 4
and 5 are prefetched after the page that corresponds to Node
3 is prefetched, as shown in Figure 4. The next pages to be
prefetched are the pages corresponding to Nodes 1 and 6. In
order to properly manage the preferching cost, the prefetch
procedure stops following an LBA link when nextreaches abranch node again along a link, or when nextand currentmight
point to the same page (both referred to as Stop Conditions).
When the caching buffer is full (also referred to as a Stop Con-
dition), the prefetch procedure should also stop temporarily.
Take the prediction graph shown in Figure 2 as an example.
The prefetch procedure should not prefetch pages correspond-
ing to Nodes 8 and 9 when the page corresponding to Node 7
is prefetched. When currentreaches a page that corresponds
to a branch node, the next page to be accessed (referred to as
the target page) will determine which branch the application
execution will follow. The prefetch procedure should start pre-
fecting the page that corresponds to the subsequent LBA ofthe target page (or the pages that correspond to the subsequent
LBAs of the target page if the target page corresponds to a
branch node). The above prefetch procedure shall repeat again
if it stops tentatively because of any Stop Condition. Note that
all pages cached in the SRAM cache between currentand next
stay in the cache after the target page (in the following of a
branch) is accessed. It is because some of the cached pages
might be accessed shortly, even though the access of the targetpage has determined which branch the application execution
will follow. Note that cache misses are still possible, e.g., those
when current=next. In such a case, data are accessed fromNAND and loaded onto the SRAM cache in an on-demand
fashion.
The pseudo code of the prefetch procedure is as shown in
Algorithm 1: Two flags, i.e.,stopandstartbch, are used to track
the prefetching state: stopand startbch denote the satisfaction
of any Stop Condition and the reaching of a branch node,
respectively. Initially,stop and startbch are set as FALSE. If
any Stop Condition is satisfied when the procedure is invoked,
then the procedure simply returns (Step 1). The procedureprefetches one page in each iteration (Steps 2-19) until the
cache is full (i.e., a Stop Condition), or we reach a branch
node for the first time. First, next is checked if it will point
to the same page ascurrentdoes, then the prefetch procedure
stops and returns (Steps 3-6) Otherwise, in each iteration, the
procedure increases next, i.e., the location of the next free
cache buffer (Step 7). The LBA is obtained by checking up
the latest prefetched LBA (Step 8), and then the page of the
LBA is prefetched (Step 9). After the prefetching of a page, the
procedure checks up whether the prefetched page corresponds
to a branch node (Steps 10-11). If so, the procedure loads
the corresponding branch table entries (Step 12) and save the
subsequent LBA of each branch of the branch node (Steps 12-17). Because the prefetched page corresponds to a branch node,
the procedure should start prefetching pages by following each
branch in an alternative way (Steps 20-36). The loop will stop
when the cache is full (Step 20), when every next LBA link of
a branch node reaches the next branch node (Steps 31-35), or
whennextand currentmight point to the same page (Steps 22-
25). In each iteration of the loop, if the LBA link indexed by
idxbch does not yet reach the next branch node (Step 21), the
next LBA following the link shall be prefetched (Steps 26-28).
Pages are prefetched by following all possible next LBA links
in an equal base and an alternative way (Step 30).
Note that stop should be set to FALSEwhen the cache isno longer full or when next and currentdo not point to the
same page2. Moreover, stop and startbchshould both be reset to
FALSEwhen currentpasses a branch node and meets the target
page or when a cache miss occurs (i.e., current= next). Oncestopis set to FALSE, the prefetch procedure is invoked. When
startbchis FALSE in such an invocation, the prefetch procedure
starts prefetching from the first loop between Steps 2 and 19.
Otherwise, the prefetch procedure will continue its previous
prefetching job by following next LBA links of a visited branch
node in an alternative way (Steps 20-36).
2 Performance enhancement is possible by deploying more complicated condi-tion setting and actions.
4
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
5/8
Algorithm 1: Prefetch Procedure
Input:stop,next,current,lba,idxbch,Nbch,lbabch[], andstartbchOutput:nullifstop=TUREthen return;1
whilestartbch= FALSE and(next+ 1)mod SIZE=currentdo2ifChkNxLBA(lba) =cache(current)then3
stop TRUE;4
return;5
end6
next (next+ 1)modSIZE;7
lba GetNxLBA(lba) ;8
Read(next,lba) ;9
startbch IsBchStart() ;10
ifstartbch= TRUEthen11LdBchTable(GetNxLBA(lba)) ;12
idxbch 0 ;13
Nbch GetBchNum() ;14
fori=0; i
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
6/8
so as to prevent the size of the branch tables growing too large
to fit in SRAM.
As for the relationship between the number of traces and
the average number of branches of each branch node in the
prediction graph, the more the number of traces was collected
for each game, the higher the average branches of each branch
node. As shown in Figure 6, the average number of branches ofeach branch node was less than four, and the average number
of each branch node grew slowly, except AOE II, because of
the randomness of data access in the real-time strategy game
AOE II.
Figure 6.Increment of average branch number
4.2 Read Performance and Cache Miss Rate
Figure 7 shows the read performance of each game with differ-ent cache sizes where the prediction graph was derived based
on ten traces of each game. Our approach achieved the best
performance for Raiden, due to its regular access patterns, but
the worst performance was observed for AOE II. When the
cache size was 2KB, the average read performance of AOE
II, TTD, and Raiden with the prefetch mechanism was 27.74
MB/s, 68.68 MB/s, and 94.98 MB/s, respectively. We must
point out that all of them were better than the read performance
of NOR (23.84 MB/s). Note that the lower the cache miss rate
in prefetching was, the higher read performance. To resolve
a cache miss, data accesses to NOR had to be redirected to
NAND so that missed data could be loaded from NAND to the
cache. It was also shown that a 4KB cache was sufficient forthe games under considerationsbecause the read performance
became saturated when the cache size was no less than 4KB.
Figure 8 shows the read performance of the proposed ap-
proach for the three games with respect to different numbers of
traces, where the cache size was 4KB. The read performance
of each game was better than that of NOR even when only
two traces were used to generate a prediction graph. For exam-
ple, the improvement ratios over AOE II, TTD, and Raiden
were 24%, 216%, and 298%, respectively, when the num-
ber of traces for each game was 10, and the size of cache
was 4KB. When there were more than two traces, the read
performance of Raiden had almost no improvement becausethe cache miss rate was almost zero. For AOE II, the read
Figure 7.The read performance with different cache sizes (10
traces)
performance was improved slowly when the number of col-lected traces increased because the access pattern of AOE II
was highly random. The increasing in the number of collected
traces for the prediction graph could not reduce the cache miss
rate significantly. For TTD, good improvement was observed
with the inclusion of two more traces. It was because the last
two traces were, in fact, collected during the advance of players
in the game by clearing more stages. Furthermore, we summa-
rize the read performance of the proposed scheme and other ex-
isting products in Table 3. It shows that the read performance
of some specific applications with regular access patterns is
even better than that of OneNAND. On the other hand, without
our prediction mechanism, i.e., the worst case = 100% miss
rate, request data has to be read from NAND flash memoryon each read request. Thus, it is impractical to use NAND to
replace NOR without any prediction mechanism because the
read performance gap between the emulated NOR and NOR is
too large.
Figure 8. The read performance with different numbers of
traces (4KB cache)
Figure 9 shows the cache miss rate. The miss rate was
lower when more number of traces were used to construct theprediction graph. In the figure, when ten traces were used to
6
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
7/8
AO E II T TD Ra id en Wo rst case NO R O neNAND[25 ]
Read29.57 75.24 94.44 8.76 23.84 68
(MB/s)
Table 3. Comparison of the read performance (10 traces and
4KB cache in our approach)
generate the prediction garph, the cache miss rate of Raiden
was almost zero and that of TTD was lower than 5%, but
that of AOE II could not be reduced effectively because of its
unpredictable access patterns. Comparing to read performance
shown in Figure 8, the read performance of a game was higher
when the cache miss rate was lower.
Figure 9.Cache miss rate with different number of traces (with
4KB cache)
4.3 Main-memory Requirements
The major memory overhead of the prediction mechanism was
to maintain the branch table. The more the traces were used
to create the prediction graph, the bigger the branch table was.
That was because more access patterns were learned. As shown
in Figure 10, the table sizes of AOE II, TTD, and Raiden were
only 39.83KB, 35.14KB, and 0.43KB, respectively, when there
were ten traces used to construct each game. In most embedded
systems, the branch table of each game was still small enough
to be stored in RAM. However, in this experiment, branch ta-
bles were stored in NAND flash memory and loaded to SRAM
on demand. Figure 10 shows that the table size of Raiden keptlow when the number of traces increased, but the table sizes of
AOE II and TTD kept growing because ten traces still couldnt
cover all the access patterns of AOE II and TTD. However, as
shown in Figure 9, the cache miss rate of TTD was very low
and didnt need to involve new traces to improve the cache hit
ratio, and the cache miss rate of AOE II still could not be low-
ered even if more traces were collected.
4.4 Cache Pollution Rate
Cache pollution Rateis the rate of data that are prefetched but
not referenced during the program execution. The prefetching
of unnecessary data represented overheads and might even de-creased the read performance because the prefetching activities
Figure 10.The size of branch table
of unnecessary data might delay the prefetching of useful data.
In addition, unnecessary data transfer leads to extra power con-sumption, which is critical to designs of embedded systems.
LetNSRAM2hostbe the amount of data accessed by the host, and
Nf lash2SRAMthe amount of data transferred from NAND flash
memory to SRAM. The cache pollution rate was defined as
follows:
Cache pollution rate = 1 NSRAM2host
Nf lash2SRAM
As shown in Figure 11, the cache pollution rate increased
as the number of traces for each game increased. That was
because more traces led to a larger number of branches per
branch node, and only one of the LBA links that follow agiven branch node was actually referenced by the program.
In summary, there was a trade-off between the prefetching
accuracy and the prefetching overhead, even though the cache
pollution rates were still lower than 10% in most cases.
Figure 11.The cache pollution rate (4KB cache)
5. Conclusions
This paper addresses the issue of the replacement of NOR withNAND motivated by a strong market demand. Different from
7
-
8/14/2019 A_NOR_Emulation_Strategy_over_NAND_Flash_Memory.pdf
8/8
on-demand cache mechanisms proposed in previous work,
we propose an efficient prediction mechanism with limited
memory-space requirement and an efficient implementation to
improve the performance of programs stored in NAND. Binary
code of programs is prefetched from NAND to SRAM cache
precisely and efficiently according to the prediction graph that
is constructed by the collected access patterns of program ex-ecution. A series of experiments is conducted based on re-
alistic traces collected from three different types of popular
games AOE II, TTD, and Raiden. We show that the
average read performance of NAND with the proposed pre-
diction mechanism could be better than that of NOR in most
cases, the cache miss rate was 35.27%, 4.21%, and 0.06% for
AOE II, TTD, and Raiden, respectively, and the percentage of
redundant prefetched data was lower than 10% in most cases.
Fur future research, we shall further extend the proposed
mechanism to adjust the prediction graph on-line to make the
prediction mechanism adaptive to any special and temporal
changes of program executions. We shall also explore the pred-icability of data prefetching for programs that have high ran-
domness in terms of access patterns.
References
[1] Flash Cache Memory Puts Robson in the Middle. Intel.
[2] Flash File System. US Patent 540,448. In Intel Corporation.
[3] FTL Logger Exchanging Data with FTL Systems. Technical
report, Intel Corporation.
[4] Software Concerns of Implementing a Resident Flash Disk. Intel
Corporation.
[5] Flash-memory Translation Layer for NAND flash (NFTL). M-
Systems, 1998.
[6] Understanding the Flash Translation Layer (FTL) Specification,
http://developer.intel.com/. Technical report, Intel Corporation,
Dec 1998.
[7] Windows ReadyDrive and Hybrid Hard Disk Drives,
http:// www.microsoft.com/whdc/device/storage/hybrid.mspx.
Technical report, Microsoft, May 2006.
[8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architecture
for Flash Memory Storage Systems of Embedded Systems. In
IEEE Real-Time and Embedded Technology and Applications
Symposium, pages 187196, 2002.
[9] L.-P. Chang and T.-W. Kuo. An Efficient Management Scheme for
Large-Scale Flash-Memory Storage Systems. InACM Symposium
on Applied Computing (SAC), pages 862868, Mar 2004.
[10] P. J. Denning. The Working Set Model for Program Behavior.
Communications of the ACM, 11(5):323333, 1968.
[11] P. J. Denning and S. C. Schwartz. Properties of the Working-Set
Model. Communications of the ACM, 15(3):191198, 1972.
[12] F. Douglis, R. Caceres, F. Kaashoek, K. Li, B. Marsh, and
J. Tauber. Storage Alternatives for Mobile Computers. In
Proceedings of the USENIX Operating System Design and
Implementation, pages 2537, 1994.
[13] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-
hungry disk. In Proceedings of the 1994 Winter USENIX
Conference, pages 292306, 1994.
[14] DRAMeXchange. NAND Flash Contract Price,http://www.dramexchange.com/, 03 2007.
[15] Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and
N. Chang. Demand Paging for OneNANDT M Flash eXecute-
In-Place. CODES+ISSS, October 2006.
[16] A. Kawaguchi, S. Nishioka, and H. Motoda. A Flash-Memory
Based File System. InProceedings of the 1995 USENIX Technical
Conference, pages 155164, Jan 1995.
[17] J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-typeflash memory package with smart buffer system for spatial and
temporal localities. JOURNAL OF SYSTEMS ARCHITECTURE,
51:111123, 2004.
[18] B. Marsh, F. Douglis, and P. Krishnan. Flash Memory File
Caching for Mobile Computers. InProceedings of the Twenty-
Seventh Annual Hawaii International Conference on System
Sciences, pages 451460, 1994.
[19] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energy-aware
demand paging on nand flash-based embedded storages. ISLPED,
August 2004.
[20] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compiler-
assisted demand paging for embedded systems with flash memory.
EMSOFT, September 2004.[21] C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efficient
memory architecture design of nand flash memory embedded
systems. ICCD, 2003.
[22] Z. Paz. Alternatives to Using NAND Flash White Paper.
Technical report, M-Systems, August 2003.
[23] R. A. Quinnell. Meet Different Needs with NAND and NOR.
Technical report, TOSHIBA, September 2005.
[24] Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND Flash
Memory Data Sheet, 2003.
[25] Samsung Electronics. OneNAND Features and Performance, 11
2005.
[26] Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit
OneNAND Flash Memory Data Sheet, 09 2006.
[27] M. Santarini. NAND versus NOR. Technical report, EDN,
October 2005.
[28] Silicon Storage Technology (SST). SST39LF040 4K x 8bit SST
Flash Memory Data Sheet, 2005.
[29] STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND
Flash Memory, 2005.
[30] A. Tal. Two Technologies Compared: NOR vs. NAND White
Paper. Technical report, M-Systems, July 2003.
[31] C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Management
for the Flash Translation Layer in Embedded Systems. In
IEEE/ACM 2006 International Conference on Computer-Aided
Design (ICCAD), November 2006.[32] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main
Memory Storage System. InProceedings of the Sixth International
Conference on Architectural Support for Programming Languages
and Operating Systems, pages 8697, 1994.
[33] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt,
and W. Litwin. Reliability Mechanisms for Very Large Storage
Systems. In Proceedings of the 20th IEEE / 11th NASA Goddard
Conference on Mass Storage Systems and Technologies (MSS03) ,
pages 146156, Apr 2003.
[34] K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer
for SmartMedia Card Systems. IEEE Transactions on Consumer
Electronics, 50(1):192197, Feburary 2004.
8