FAST: Quick Application Launch on Solid-State Drives...FAST: Quick Application Launch on Solid-State...
Transcript of FAST: Quick Application Launch on Solid-State Drives...FAST: Quick Application Launch on Solid-State...
FAST: Quick Application Launch on Solid-State Drives
Yongsoo Joo†, Junhee Ryu‡, Sangsoo Park†, and Kang G. Shin†∗
†Ewha Womans University, 11-1 Daehyun-dong Seodaemun-gu, Seoul 120-750, Korea
‡Seoul National University, 599 Kwanak-Gu Kwanak Rd., Seoul 151-744, Korea∗ University of Michigan, 2260 Hayward St., Ann Arbor, MI 48109, USA
Abstract
Application launch performance is of great importance
to system platform developers and vendors as it greatly
affects the degree of users’ satisfaction. The single most
effective way to improve application launch performance
is to replace a hard disk drive (HDD) with a solid state
drive (SSD), which has recently become affordable and
popular. A natural question is then whether or not to
replace the traditional HDD-aware application launchers
with a new SSD-aware optimizer.
We address this question by analyzing the inefficiency
of the HDD-aware application launchers on SSDs and
then proposing a new SSD-aware application prefetching
scheme, called the Fast Application STarter (FAST). The
key idea of FAST is to overlap the computation (CPU)
time with the SSD access (I/O) time during an applica-
tion launch. FAST is composed of a set of user-level
components and system debugging tools provided by the
Linux OS (operating system). In addition, FAST uses a
system-call wrapper to automatically detect application
launches. Hence, FAST can be easily deployed in any
recent Linux versions without kernel recompilation. We
implemented FAST on a desktop PC with a SSD running
Linux 2.6.32 OS and evaluated it by launching a set of
widely-used applications, demonstrating an average of
28% reduction of application launch time as compared
to PC without a prefetcher.
1 Introduction
Application launch performance is one of the impor-
tant metrics for the design or selection of a desktop or
a laptop PC as it critically affects the user-perceived
performance. Unfortunately, application launch perfor-
mance has not kept up with the remarkable progress of
CPU performance that has thus far evolved according to
Moore’s law. As frequently-used or popular applications
get “heavier” (by adding new functions) with each new
release, their launch takes longer even if a new, power-
ful machine equipped with high-speed multi-core CPUs
and several GBs of main memory is used. This undesir-
able trend is known to stem from the poor random access
performance of hard disk drives (HDDs). When an ap-
plication stored in a HDD is launched, up to thousands
of block requests are sent to the HDD, and a significant
portion of its launch time is spent on moving the disk
head to proper track and sector positions, i.e., seek and
rotational latencies. Unfortunately, the HDD seek and
rotational latencies have not been improved much over
the last few decades, especially compared to the CPU
speed improvement. In spite of the various optimizations
proposed to improve the HDD performance in launch-
ing applications, users must often wait tens of seconds
for the completion of launching frequently-used applica-
tions, such as Windows Outlook.
A quick and easy solution to eliminate the HDD’s seek
and rotational latencies during an application launch is to
replace the HDD with a solid state drive (SSD). A SSD
consists of a number of NAND flash memory modules,
and does not use any mechanical parts, unlike disk heads
and arms of a conventional HDD. While the HDD ac-
cess latency—which is the sum of seek and rotational
latencies—ranges up to a few tens of milliseconds (ms),
depending on the seek distance, the SSD shows a rather
uniform access latency of about a few hundred micro-
seconds (us). Replacing a HDD with a SSD is, there-
fore, the single most effective way to improve applica-
tion launch performance.
Until recently, using SSDs as the secondary storage of
desktops or laptops has not been an option for most users
due to the high cost-per-bit of NAND flash memories.
However, the rapid advance of semiconductor technol-
ogy has continuously driven the SSD price down, and at
the end of 2009, the price of an 80 GB SSD has fallen be-
low 300 US dollars. Furthermore, SSDs can be installed
in existing systems without additional hardware or soft-
ware support because they are usually equipped with the
1
same interface as HDDs, and OSes see a SSD as a block
device just like a HDD. Thus, end-users begin to use a
SSD as their system disk to install the OS image and ap-
plications.
Although a SSD can significantly reduce the applica-
tion launch time, it does not give users ultimate satisfac-
tion for all applications. For example, using a SSD re-
duces the launch time of a heavy application from tens of
seconds to several seconds. However, users will soon be-
come used to the SSD launch performance, and will then
want the launch time to be reduced further, just as they
see from light applications. Furthermore, users will keep
on adding functions to applications, making them heav-
ier with each release and their launch time greater. Ac-
cording to a recent report [24], the growth of software is
rapid and limited only by the ability of hardware. These
call for the need to further improve application launch
performance on SSDs.
Unfortunately, most previous optimizers for applica-
tion launch performance are intended for HDDs and have
not accounted for the SSD characteristics. Furthermore,
some of themmay rather be detrimental to SSDs. For ex-
ample, running a disk defragmentation tool on a SSD is
not beneficial at all because changing the physical loca-
tion of data in the SSD does not affect its access latency.
Rather, it generates unnecessary write and erase opera-
tions, thus shortening the SSD’s lifetime.
In view of these, the first step toward SSD-aware op-
timization may be to simply disable the traditional op-
timizers designed for HDDs. For example, Windows 7
disables many functions, such as disk defragmentation,
application prefetch, Superfetch, and Readyboost when
it detects a SSD being used as a system disk [27]. Let’s
consider another example. Linux is equipped with four
disk I/O schedulers: NOOP, anticipatory, deadline, and
completely fair queueing. The NOOP scheduler almost
does nothing to improve HDD access performance, thus
providing the worst performance on a HDD. Surpris-
ingly, it has been reported that NOOP shows better per-
formance than the other three sophisticated schedulers on
a SSD [11].
To the best of our knowledge, this is the first attempt
to focus entirely on improving application launch perfor-
mance on SSDs. Specifically, we propose a new appli-
cation prefetching method, called the Fast Application
STarter (FAST), to improve application launch time on
SSDs. The key idea of FAST is to overlap the compu-
tation (CPU) time with the SSD access (I/O) time dur-
ing each application launch. To achieve this, we monitor
the sequence of block requests in each application, and
launch the application simultaneously with a prefetcher
that generates I/O requests according to the a priorimon-
itored application’s I/O request sequence. FAST consists
of a set of user-level components, a system-call wrap-
per, and system debugging tools provided by the Linux
OS. FAST can be easily deployed in most recent Linux
versions without kernel recompilation. We have imple-
mented and evaluated FAST on a desktop PC with a SSD
running Linux 2.6.32, demonstrating an average of 28%
reduction of application launch time as compared to PC
without a prefetcher.
This paper makes the following contributions:
• Qualitative and quantitative evaluation of the ineffi-
ciency of traditional HDD-aware application launch
optimizers on SSDs;
• Development of a new SSD-aware application
prefetching scheme, called FAST; and
• Implementation and evaluation of FAST, demon-
strating its superiority and deployability.
While FAST can be also applied to HDDs, its per-
formance improvements are only limited to high I/O re-
quirements of application launches on HDDs. We ob-
served that existing application prefetchers outperformed
FAST on HDDs by effectively optimizing disk head
movements, which will be discussed further in Section 5.
The paper is organized as follows. In Section 2, we re-
view other related efforts and discuss their performance
in optimizing application launch on SSDs. Section 3
describes the key idea of FAST and presents an upper
bound for its performance. Section 4 details the imple-
mentation of FAST on the Linux OS, while Section 5
evaluates its performance using various real-world appli-
cations. Section 6 discusses the applicability of FAST to
smartphones and Section 7 compares FAST with tradi-
tional I/O prefetching techniques. We conclude the paper
with Section 8.
2 Background
2.1 Application Launch Optimization
Application-level optimization. Application developers
are usually advised to optimize their applications for fast
startup. For example, they may be advised to postpone
loading non-critical functions or libraries so as to make
applications respond as fast as possible [2, 30]. They
are also advised to reduce the number of symbol reloca-
tions while loading libraries, and to use dynamic library
loading. There have been numerous case studies—based
on in-depth analyses and manual optimizations—of vari-
ous target applications/platforms, such as Linux desktop
suite platform [8], a digital TV [17], and a digital still
camera [33]. However, such an approach requires the
experts’ manual optimizations for each and every appli-
cation. Hence, it is economically infeasible for general-
purpose systems with many (dynamic) application pro-
grams.
2
Snapshot technique. A snapshot boot technique has
also been suggested for fast startup of embedded systems
[19], which is different from the traditional hibernate
shutdown function in that a snapshot of the main mem-
ory after booting an OS is captured only once, and used
repeatedly for every subsequent booting of the system.
However, applying this approach for application launch
is not practical for the following reasons. First, the page
cache in main memory is shared by all applications, and
separating only the portion of the cache content that is
related to a certain application is not possible without
extensive modification of the page cache. Furthermore,
once an application is updated, its snapshot should be in-
validated immediately, which incurs runtime overhead.
Prediction-based prefetch. Modern desktops are
equipped with large (up to several GBs) main memory,
and often have abundant free space available in the main
memory. Prediction-based prefetching, such as Super-
fetch [28] and Preload [12], loads an application’s code
blocks in the free space even if the user does not ex-
plicitly express his intent to execute that particular ap-
plication. These techniques monitor and analyze the
users’ access patterns to predict which applications to be
launched in future. Consequently, the improvement of
launch performance depends strongly on prediction ac-
curacy.
Sorted prefetch. The Windows OS is equipped with
an application prefetcher [36] that prefetches appli-
cation code blocks in a sorted order of their logical
block addresses (LBAs) to minimize disk head move-
ments. A similar idea has also been implemented for
Linux OS [15, 25]. We call these approaches sorted
prefetch. It monitors HDD activities to maintain a list
of blocks accessed during the launch of each application.
Upon detection of an application launch, the application
prefetcher immediately pauses its execution and begins
to fetch the blocks in the list in an order sorted by their
LBAs. The application launch is resumed after fetching
all the blocks, and hence, no page miss occurs during the
launch.
Application defragmentation. The block list informa-
tion can also be used in a different way to further reduce
the seek distance during an application launch. Modern
OSes commonly support a HDD defragmentation tool
that reorganizes the HDD layout so as to place each file in
a contiguous disk space. In contrast, the defragmentation
tool can relocate the blocks in the list of each application
by their access order [36], which helps reduce the total
HDD seek distance during the launch.
Data pinning on flash caches. Recently, flash cache has
been introduced to exploit the advantage of SSDs at a
cost comparable to HDDs. A flash cache can be inte-
grated into traditional HDDs, which is called a hybrid
HDD [37]. Also, a PCI card-type flash cache is available
[26], which is connected to the mother board of a desk-
top or laptop PC. As neither seek nor rotational latency is
incurred while accessing data in the flash cache, we can
accelerate application launch by storing the code blocks
of frequently-used applications, which is called a pinned
set. Due to the small capacity of flash cache, how to
determine the optimal pinned set subject to the capacity
constraint is a key to making performance improvement,
and a few results of addressing this problem have been
reported [16, 18, 22]. We expect that FAST can be in-
tegrated with the flash cache for further improvement of
performance, but leave it as part of our future work.
2.2 SSD Performance Optimization
SSDs have become affordable and begun to be deployed
in desktop and laptop PCs, but their performance char-
acteristics have not yet been understood well. So, re-
searchers conducted in-depth analyses of their perfor-
mance characteristics, and suggested ways to improve
their runtime performance. Extensive experiments have
been carried out to understand the performance dynam-
ics of commercially-available SSDs under various work-
loads, without knowledge of their internal implementa-
tions [7]. Also, SSD design space has been explored
and some guidelines to improve the SSD performance
have been suggested [10]. A new write buffer manage-
ment scheme has also been suggested to improve the ran-
dom write performance of SSDs [20]. Traditional I/O
schedulers optimized for HDDs have been revisited in
order to evaluate their performance on SSDs, and then
a new I/O scheduler optimized for SSDs has been pro-
posed [11, 21].
2.3 Launch Optimization on SSDs
As discussed in Section 2.1, various approaches have
been developed and deployed to improve the applica-
tion launch performance on HDDs. On one hand, many
of them are effective on SSDs as well, and orthogo-
nal to FAST. For example, application-level optimiza-
tion and prediction-based prefetch can be used together
with FAST to further improve application launch perfor-
mance.
On the other hand, some of them exploit the HDD
characteristics to reduce the seek and rotational delay
during an application launch, such as the sorted prefetch
and the application defragmentation. Such methods are
ineffective for SSDs because the internal structure of a
SSD is very different from that of a HDD. A SSD typi-
cally consists of multiple NAND flash memory modules,
and does not have any mechanical moving part. Hence,
unlike a HDD, the access latency of a SSD is irrelevant to
the LBA distance between the last and the current block
3
0
0
0
0
s1 s2 s3 s4
c1 c2 c3 c4
s1 s2 s3 s4c1 c2 c3 c4
Application
Prefetcher
s1 s2s3 s4
c1 c2 c3 c4Application
Prefetcher
Application
Time
Time
Time
Time
Time
(a) Cold start scenario
(c) Proposed prefetching ( )
c1 c2 c3 c4Application
Time(b) Warm start scenario
tcpu > tssd
(d) Proposed prefetching ( )tcpu < tssd
tlaunch
tlaunch
tlaunch
tlaunch
Figure 1: Various application launch scenarios (n= 4).
requests. Thus, prefetching the application code blocks
according to the sorted order of their LBAs or changing
their physical locations will not make any significant per-
formance improvement on SSDs. As the sorted prefetch
has the most similar structure to FAST, we will quanti-
tatively compare its performance with FAST in Section
5.
3 Application Prefetching on SSDs
This section illustrates the main idea of FASTwith exam-
ples and derives a lower bound of the application launch
time achievable with FAST.
3.1 Cold and Warm Starts
We focus on the performance improvement in case of
a cold start, or the first launch of an application upon
system bootup, representing the worst-case application
launch performance. Figure 1(a) shows an example cold
start scenario, where si is the i-th block request gener-
ated during the launch and n the total number of block
requests. After si is completed, the CPU proceeds with
the launch process until another page miss takes place.
Let ci denote this computation.
The opposite extreme is a warm start in which all the
code blocks necessary for launch have been found in the
page cache, and thus, no block request is generated, as
shown in Figure 1(b). This occurs when the application
is launched again shortly after its closure. The warm start
represents an upper-bound of the application launch per-
formance improvement achievable with optimization of
the secondary storage.
Let the time spent for si and ci be denoted by t(si) andt(ci), respectively. Then, the computation (CPU) time,
tcpu, is expressed as
tcpu =n
∑i=1
t(ci), (1)
and the SSD access (I/O) time, tssd , is expressed as
tssd =n
∑i=1
t(si). (2)
3.2 The Proposed Application Prefetcher
The rationale behind FAST is that the I/O request se-
quence generated during an application launch does not
change over repeated launches of the application in case
of cold-start. The key idea of FAST is to overlap the SSD
access (I/O) time with the computation (CPU) time by
running the application prefetcher concurrently with the
application itself. The application prefetcher replays the
I/O request sequence of the original application, which
we call an application launch sequence. An application
launch sequence S can be expressed as (s1, . . . ,sn).Figure 1(c) illustrates how FAST works, where tcpu >
tssd is assumed. At the beginning, the target applica-
tion and the prefetcher start simultaneously, and compete
with each other to send their first block request to the
SSD. However, the SSD always receives the same block
request s1 regardless of which process gets the bus grant
first. After s1 is fetched, the application can proceed with
its launch by the time t(c1), while the prefetcher keeps
issuing the subsequent block requests to the SSD. After
completing c1, the application accesses the code block
corresponding to s2, but no page miss occurs for s2 be-
cause it has already been fetched by the prefetcher. It is
the same for the remaining block requests, and thus, the
resulting application launch time tlaunch becomes
tlaunch = t(s1)+ tcpu. (3)
Figure 1(d) shows another possible scenario where tcpu <
tssd . In this case, the prefetcher cannot complete fetching
s2 before the application finishes computation c1. How-
ever, s2 can be fetched by t(c1) earlier than that of the
cold start, and this improvement is accumulated for all
of the remaining block requests, resulting in tlaunch:
tlaunch = tssd + t(cn). (4)
Note that n ranges up to a few thousands for typical ap-
plications, and thus, t(s1)≪ tcpu and t(cn)≪ tssd . Con-
sequently, Eqs. (3) and (4) can be combined into a single
equation as:
tlaunch ≈max(tssd , tcpu), (5)
4
0
s1 s2 s3 s4
c1 c2 c3Application
PrefetcherTime
Time
c5 c6 c7 c8c4
s5 . . .s8 tactualtexpected
t(c8) tactual− texpected
Figure 2: A worst-case example (tcpu = tssd).
Disk I/O profiler
Application launch
sequence
LBA-to-inode map
Raw block request
sequences
Application launch sequence
extractor
LBA-to-inode reverse mapper
Application prefetcher generator
Target application
Application prefetcher
Application launch manager
System call profiler
< Launch time processes > < Idle time processes >
Figure 3: The proposed application prefetching.
which represents a lower bound of the application launch
time achievable with FAST.
However, FAST may not achieve application launch
performance close to Eq. (5) when there is a significant
variation of I/O intensiveness, especially if the beginning
of the launch process is more I/O intensive than the other.
Figure 2 illustrates an extreme example of such a case,
where the first half of this example is SSD-bound and the
second half is CPU-bound. In this example, tcpu is equal
to tssd , and thus the expected launch time texpected is given
to be tssd + t(c8), according to Eq. (4). However, the ac-
tual launch time tactual is much larger than texpected . The
CPU usage in the first half of the launch time is kept quite
low despite the fact that there are lots of remaining CPU
computations (i.e., c5, . . . ,c8) due to the dependency be-
tween si and ci. We will provide a detailed analysis for
this case using real applications in Section 5.
4 Implementation
We chose the Linux OS to demonstrate the feasibility and
the superior performance of FAST. The implementation
of FAST consists of a set of components: an application
launch manager, a system-call profiler, a disk I/O pro-
filer, an application launch sequence extractor, a LBA-
to-inode reverse mapper, and an application prefetcher
generator. Figure 3 shows how these components inter-
act with each other. In what follows, we detail the imple-
mentation of each of these components.
4.1 Application Launch Sequence
4.1.1 Disk I/O Profiler
The disk I/O profiler is used to track the block re-
quests generated during an application launch. We used
Blktrace [3], a built-in Linux kernel I/O-tracing tool
that monitors the details of I/O behavior for the evalua-
tion of I/O performance. Blktrace can profile various
I/O events: inserting an item into the block layer, merg-
ing the item with a previous request in the queue, remap-
ping onto another device, issuing a request to the device
driver, and a completion signal from the device. From
these events, we collect the trace of device-completion
events, each of which consists of a device number, a
LBA, the I/O size, and completion time.
4.1.2 Application Launch Sequence Extractor
Ideally, the application launch sequence should include
all of the block requests that are generated every time the
application is launched in the cold start scenario, with-
out including any block requests that are not relevant to
the application launch. We observed that the raw block
request sequence captured by Blktrace does not vary
from one launch to another, i.e., deterministic for mul-
tiple launches of the same application. However, we
observed that other processes (e.g., OS and application
daemons) sometimes generate their own I/O requests si-
multaneously with the application launch. To handle this
case, the application launch sequence extractor collects
two or more raw block request sequences to extract a
common sequence, which is then used as a launch se-
quence of the corresponding application. The imple-
mentation of the application launch sequence extractor
is simple: it searches for and removes any block requests
appearing in some of the input sequences. This proce-
dure makes all the input sequences the same, so we use
any of them as an application launch sequence.
4.2 LBA-to-Inode Map
4.2.1 LBA-to-Inode Reverse Mapper
Our goal is to create an application prefetcher that gen-
erates exactly the same block request sequence as the
obtained application launch sequence, where each block
request is represented as a tuple of starting LBA and
size. Since the application prefetcher is implemented as
a user-level program, every disk access should be made
via system calls with a file name and an offset in that file.
Hence, we must obtain the file name and the offset of
each block request in an application launch sequence.
Most file systems, including EXT3, do not support
such a reverse mapping from LBA to file name and off-
set. However, for a given file name, we can easily find
5
the LBA of all of the blocks that belong to the file and
their relative offset in the file. Hence, we can build a
LBA-to-inode map by gathering this information for ev-
ery file. However, building such a map of the entire file
system is time-consuming and impractical because a file
system, in general, contains tens of thousands of files and
their block locations on the disk change very often.
Therefore, we build a separate LBA-to-inode map
for each application, which can significantly reduce the
overhead of creating a LBA-to-inode map because (1)
the number of applications and the number of files used
in launching each application are very small compared
to the number of files in the entire file system; and (2)
most of them are shared libraries and application code
blocks, so their block locations remain unchanged unless
they are updated or disk defragmentation is performed.
We implement the LBA-to-inode reverse mapper that
receives a list of file names as input and creates a LBA-
to-inode map as output. A LBA-to-inode map is built
using a red-black tree in order to reduce the search time.
Each node in the red-black tree has the LBA of a block as
its key, and a block type as its data by default. According
to the block type, different types of data are added to
the node. A block type includes a super block, a group
descriptor, an inode block bitmap, a data block bitmap,
an inode table, and a data block. For example, a node for
a data block has a block type, a device number, an inode
number, an offset, and a size. Also, for a data block, a
table is created to keep the mapping information between
an inode number and its file name.
4.2.2 System-Call Profiler
The system-call profiler obtains a full list of file names
that are accessed during an application launch,1 and
passes it to the LBA-to-inode reverse mapper. We used
strace for the system-call profiler, which is a debugging
tool in Linux. We can specify the argument of strace
so that it may monitor only the system calls that have a
file name as their argument. As many of these system
calls are rarely called during an application launch, we
monitor only the following system calls that frequently
occur during application launches: open(), creat(),
execve(), stat(), stat64(), lstat(), lstat64(),
access(), truncate(), truncate64(), statfs(),
statfs64(), readlink(), and unlink().
4.3 Application Prefetcher
4.3.1 Application Prefetcher Generator
The application prefetcher is a user-level program that
replays the disk access requests made by a target appli-
1Files mounted on pseudo file systems such as procfs and sysfs
are not processed because they never generate any disk I/O request.
Table 1: System calls to replay access of blocks in an
application launch sequence
Block type System call
Inode table open()
Data block: a directory opendir() and readdir()
Data block: a regular file read() or posix_fadvise()
Data block: a symbolic
link file
readlink()
cation. We implemented the application prefetcher gen-
erator to automatically create an application prefetcher
for each target application. It performs the following op-
erations.
1. Read si one-by-one from S of the target application.
2. Convert si into its associated data items stored in the
LBA-to-inode map, e.g.,
(dev,LBA,size)→(datablk,filename,offset,size) or
(dev,LBA,size)→(inode,start_inode,end_inode).
3. Depending on the type of block, generate an appro-
priate system call using the converted disk access
information.
4. Repeat Steps 1–3 until processing all si.
Table 1 shows the kind of system calls used for each
block type. There are two system calls that can be
used to replay the disk access for data blocks of a reg-
ular file. If we use read(), data is first moved from
the SSD to the page cache, and then copying takes
place from the page cache to the user buffer. The sec-
ond step is unnecessary for our purpose, as the process
that actually manipulates the data is not the application
prefetcher but the target application. Hence, we chose
posix fadvise() that performs only the first step, from
which we can avoid the overhead of read(). We use
the POSIX FADV WILLNEED parameter, which informs
the OS that the specified data will be used in the near
future. When to issue the corresponding disk access af-
ter posix fadvise() is called depends on the OS im-
plementation. We confirmed that the current version of
Linux we used issues a block request immediately after
receiving the information through posix fadvise(),
thus meeting our need. A symbolic-linked file name is
stored in data block pointers in an inode entry when the
length of the file name is less than or equal to 60 bytes
(c.f., the space of data block pointers is 60 bytes, 4*12
for direct, 4 for single indirect, another 4 for double in-
direct, and last 4 for triple indirect data block pointer).
If the length of linked file name is more than 60 bytes,
the name is stored in the data blocks pointed to by data
block pointers in the inode entry. We use readlink() to
replay the data block access of symbolic-link file names
that are longer than 60 bytes.
6
int main(void) {
...readlink("/etc/fonts/conf.d/90-ttf-arphic-uming-emb
olden.conf", linkbuf, 256);int fd423;fd423 = open("/etc/fonts/conf.d/90-ttf-arphic-uming
-embolden.conf", O_RDONLY);posix_fadvise(fd423, 0, 4096, POSIX_FADV_WILLNEED);
posix_fadvise(fd351, 286720, 114688, POSIX_FADV_WILLNEED);
int fd424;fd424 = open("/usr/share/fontconfig/conf.avail/90-tt
f-arphic-uming-embolden.conf", O_RDONLY);
posix_fadvise(fd424, 0, 4096, POSIX_FADV_WILLNEED);int fd425;
fd425 = open("/root/.gnupg/trustdb.gpg", O_RDONLY);posix_fadvise(fd425, 0, 4096, POSIX_FADV_WILLNEED);dirp = opendir("/var/cache/");
if(dirp)while(readdir(dirp));...
return 0;}
Figure 4: An example application prefetcher.
Figure 4 is an example of automatically-generated ap-
plication prefetcher. Unlike the target application, the
application prefetcher successively fetches all the blocks
as soon as possible to minimize the time between adja-
cent block requests.
4.3.2 Implicitly-Prefetched Blocks
In the EXT3 file system, the inode of a file includes
pointers of up to 12 data blocks, so these blocks can
be found immediately after accessing the inode. If the
file size exceeds 12 blocks, indirect, double indirect, and
triple indirect pointer blocks are used to store the point-
ers to the data blocks. Therefore, requests for indirect
pointer blocks may occur in the cold start scenario when
the application is accessing files larger than 12 blocks.
We cannot explicitly load those indirect pointer blocks in
the application prefetcher because there is no such sys-
tem call. However, the posix fadvise() call for a data
block will first make a request for the indirect blockwhen
needed, so it can be fetched in a timely manner by run-
ning the application prefetcher.
The following types of block request are not listed in
Table 1: a superblock, a group descriptor, an inode entry
bitmap, a data block bitmap. We found that requests to
these types of blocks seldom occur during an application
launch, so we did not consider their prefetching.
4.4 Application Launch Manager
The role of the application launch manager is to detect
the launch of an application and to take an appropri-
ate action. We can detect the beginning of an applica-
tion launch by monitoring execve() system call, which
is implemented using a system-call wrapper. There are
three phases with which the application launch manager
Table 2: Variables and parameters used by the applica-
tion launch manager
Type Description
ninit A counter to record the number of application
launches done in the initial launch phase
npre f A counter to record the number of launches
done in the application prefetch phase after the
last check of the miss ratio of the application
prefetcher
Nrawseq The number of raw block request sequences that
are to be captured at the launch profiling phase
Nchk The period to check the miss ratio of the applica-
tion prefetcher
Rmiss A threshold value for the prefetcher miss ratio that
is used to determine if an update of the application
or shared libraries has taken place
Tidle A threshold value for the idle time period that is
used to determine if an application launch is com-
pleted
Ttimeout The maximum amount of time allowed for the
disk I/O profiler to capture block requests
deals: a launch profiling phase, a prefetcher generation
phase, and an application prefetch phase. The applica-
tion launch manager uses a set of variables and param-
eters for each application to decide when to change its
phase. These are summarized in Table 2.
Here we describe the operations performed in each
phase:
(1) Launch profiling. If no application prefetcher is
found for that application, the application launch man-
ager regards the current launch as the first launch of this
application, and enters the initial launch phase. In this
phase, the application launch manager performs the fol-
lowing operations in addition to the launch of the target
application:
1. Increase ninit of the current application by 1.
2. If ninit = 1, run the system call profiler.
3. Flush the page cache, dentries (directory entries),
and inodes in the main memory to ensure a cold start
scenario, which is done by the following command:
echo 3 > /proc/sys/vm/drop_caches
4. Run the disk I/O profiler. Terminate the disk I/O
profiler when any of the following conditions are
met: (1) if no block request occurs during the last
Tidle seconds or (2) the elapsed time since the start
of the disk I/O profiler exceeds Ttimeout seconds.
5. If ninit = Nrawseq, enter the prefetcher generation
phase after the current launch is completed.
(2) Prefetcher generation. Once application launch
profiling is done, it is ready to generate an application
7
prefetcher using the information obtained from the first
phase. This can be performed either immediately after
the application launch is completed, or when the system
is idle. The following operations are performed:
1. Run the application launch sequence extractor.
2. Run the LBA-to-inode reverse mapper.
3. Run the application prefetcher generator.
4. Reset the values of ninit and npre f to 0.
(3) Application prefetch. If the application prefetcher
for the current application is found, the application
launch manager runs the prefetcher simultaneously with
the target application. It also periodically checks the miss
ratio of the prefetcher to determine if there has been any
update of the application or shared libraries. Specifically,
the following operations are performed:
1. Increase npre f of the current application by 1.
2. If npre f = Nchk, reset the value of npre f to 0 and run
the disk I/O profiler. Its termination conditions are
the same as those in the first phase.
3. Run the application prefetcher simultaneously with
the target application.
4. If a raw block request sequence is captured, use it to
calculate the miss ratio of the application prefetcher.
If it exceeds Rmiss, delete the application prefetcher.
The miss ratio is defined as the ratio of the number of
block requests not issued by the prefetcher to the total
number of block requests in the application launch se-
quence.
5 Performance Evaluation
5.1 Experimental Setup
Experimental platform. We used a desktop PC
equipped with an Intel i7-860 2.8 GHz CPU, 4GB of
PC12800 DDR3 SDRAM and an Intel 80GB SSD (X25-
M G2Mainstream). We installed a Fedora 12 with Linux
kernel 2.6.32 on the desktop, in which we set NOOP
as the default I/O scheduler. For benchmark applica-
tions, we chose frequently used user-interactive appli-
cations, for which application launch performance mat-
ters much. Such an application typically uses graphical
user interfaces and requires user interaction immediately
after completing its launch. Applications like gcc and
gzip are not included in our set of benchmarks as launch
performance is not an issue for them. Our benchmark
set consists of the following Linux applications: Acro-
bat reader, Designer-qt4, Eclipse, F-Spot, Firefox, Gimp,
Gnome, Houdini, Kdevdesigner, Kdevelop, Konqueror,
Labview, Matlab, OpenOffice, Skype, Thunderbird, and
XilinxISE. In addition to these, we used Wine [1], which
is an implementation of the Windows API running on the
Linux OS, to test Access, Excel, Powerpoint, Visio, and
Word—typical Windows applications.
Test scenarios. For each benchmark application, we
measured its launch time for the following scenarios.
• Cold start: The application is launched immediately
after flushing the page cache, using the method de-
scribed in Section 4.4. The resulting launch time is
denoted by tcold .
• Warm start: We first run the application prefetcher
only to load all the blocks in the application launch
sequence to the page cache, and then launch the
application. Let twarm denote the resulting launch
time.
• Sorted prefetch: To evaluate the performance of the
sorted prefetch [15, 25, 36] on SSDs, we modify the
application prefetcher to fetch the block requests in
the application launch sequence in the sorted order
of their LBAs. After flushing the page cache, we
first run the modified application prefetcher, then
immediately run the application. Let tsorted denote
the resulting launch time.
• FAST: We flush the page cache, and then run
the application simultaneously with the application
prefetcher. The resulting launch time is denoted by
tFAST .
• Prefetcher only: We flush the page cache and run
the application prefetcher. The completion time
of the application prefetcher is denoted by tssd . It
is used to calculate a lower bound of the appli-
cation launch time tbound = max(tssd , tcpu), wheretcpu = twarm is assumed.
Launch-time measurement. We start an application
launch by clicking an icon or inputting a command, and
can accurately measure the launch start time by moni-
toring when execve() is called. Although it is difficult
to clearly define the completion of a launch, a reasonable
definition is the first moment the application becomes re-
sponsive to the user [2]. However, it is difficult to accu-
rately and automatically measure that moment. So, as
an alternative, we measured the completion time of the
last block request in an application launch sequence us-
ing Blktrace, assuming that the launch will be com-
pleted very soon after issuing the last block request. For
the warm start scenario, we executed posix fadvise()
with POSIX FADV DONTNEED parameter to evict the last
block request from the page cache. For the sorted
prefetch and the FAST scenarios, we modified the ap-
plication prefetcher so that it skips prefetching of the last
block request.
8
T Q [ 2 Z R \ S ] TN
N
TNNNNN
QNNNNN
[NNNNN
2NNNNN
Y=;'#%*&9*-/5=)*'4&"F*%#@=#$)*$#@=#/"#$
!554-"()-&/*4(=/"I
*$#@=#/"#*$-^#*_'4&"F$`
Figure 5: The size of application launch sequences.
5.2 Experimental Results
Application launch sequence generation. We captured
10 raw block request sequences during the cold start
launch of each application. We ran the application launch
sequence extractor with a various number of input block
request sequences, and observed the size of the result-
ing application launch sequences. Figure 5 shows that
for all the applications we tested, there is no significant
reduction of the application launch sequence size while
increasing the number of inputs from 2 to 10. Hence, we
set the value of Nrawseq in Table 2 to 2 in this paper. We
used the size of the first captured input sequence as the
number of inputs one in Figure 5 (the application launch
sequence extractor requires at least two input sequences).
For some applications, there are noticeable differences in
size between the number of inputs one and two. This is
because the first raw input request sequence includes a
set of bursty I/O requests generated by OS and user dae-
mons that are irrelevant to the application launch. Fig-
ure 5 shows that such I/O requests can be effectively
excluded from the resulting application launch sequence
using just two input request sequences.
The second and third columns of Table 3 summarize
the total number of block requests and accessed blocks of
the thus-obtained application launch sequences, respec-
tively. The last column shows the total number of files
used during the launch of each application.
Testing of the application prefetcher. Application
prefetchers are automatically generated for the bench-
mark applications using the application launch sequences
in Table 3. In order to see if the application prefetch-
ers fetch all the blocks used by an application, we
first flushed the page cache, and launched each applica-
tion immediately after running the application prefetcher.
During the application launch, we captured all the block
requests generated using Blktrace, and counted the
number of missed block requests. The average number of
missed block requests was 1.6% of the number of block
requests in the application launch sequence, but varied
among repeated launches, e.g., from 0% to 6.1% in the
experiments we performed.
Table 3: Collected launch sequences (Nrawseq = 2)
Application # of block # of fetched # of used
requests blocks files
Access 1296 106 992 555
Acrobat reader 960 73 784 178
Designer-qt4 2400 138 608 410
Eclipse 4163 155 216 787
Excel 1610 169 112 583
F-Spot 1180 49 968 304
Firefox 1566 60 944 433
Gimp 1939 66 928 799
Gnome 4739 228 872 538
Houdini 4836 290 320 724
Kdevdesigner 1537 44 904 467
Kdevelop 1970 63 104 372
Konqueror 1780 62 216 296
Labview 2927 154 768 354
Matlab 6125 267 312 742
OpenOffice 1425 104 600 308
Powerpoint 1405 120 808 576
Skype 892 41 560 197
Thunderbird 1533 64 784 429
Visio 1769 168 832 662
Word 1715 181 496 613
Xilinx ISE 4718 328 768 351
By examining themissed block requests, we could cat-
egorize them into three types: (1) files opened by OS
daemons and user daemons at boot time; (2) journaling
data or swap partition accesses; and (3) files dynamically
created or renamed at every launch (e.g., tmpfile()).
The first type occurs because we force the page cache to
be flushed in the experiment. In reality, they are highly
likely to reside in the page cache, and thus, this type of
misses will not be a problem. The second type is irrel-
evant to the application, and observed even during idle
time. The third type occurs more or less often, depend-
ing on the application. FAST does not prefetch this type
of block requests as they change at every launch.
Experiments for the test scenarios. We measured the
launch time of the benchmark applications for each test
scenario listed in Section 5.1. Figure 6 shows that the
average launch time reduction of FAST is 28% over the
cold start scenario. The performance of FAST varies
considerably among applications, ranging from 16% to
46% reduction of launch time. In particular, FAST shows
performance very close to tbound for some applications,
such as Eclipse, Gnome, and Houdini. On the other hand,
the gap between tbound and tFAST is relatively larger for
such applications as Acrobat reader, Firefox, OpenOf-
fice, and Labview.
Launch time behavior. We conducted experiments to
see if the application prefetcher works well as expected
when it is simultaneously run with the application. We
9
+''>))
+'82=7658>74>8
G>)?H@>8IJ6%
K'3?,)>
KL'>3
MIN,26
M?8>*2L
O?:,
O@2:>
P2<4?@?
Q4>R4>)?H@>8
Q4>R>32,
Q2@B<>828
S7=R?>(
T7637=
U,>@U**?'>
V2(>8,2?@6
NAW,>
XC<@4>8=?84
Y?)?2
9284
Z?3?@L[NK
+R>87H>
#\#]
$#\#]
%#\#]
.#\#]
&#\#]
"##\#]
"$#\#] 1.6s 0.8s 1.9s 4.8s 2.1s 1.1s 0.9s 2.3s 2.6s 5.6s 1.8s 1.6s 1.2s 2.7s 5.1s 0.9s 1.9s 1.0s 1.0s 3.7s 2.6s 6.6s 93%
72%
63%
27%
tcold
tsorted
tFAST
twarm
tssd
tbound
Figure 6: Measured application launch time (normalized to tcold).
tcoldtwarm tFAST tsorted
tcoldtwarm tFAST tsorted
SSDCPU
SSDCPU
SSDCPU
SSDCPU
SSDCPU
SSDCPU
SSDCPU
SSDCPU
Application: Eclipse
Application: Firefox
0 5
0 1
(sec)
(sec)
Coldstart
Warmstart
FAST
Sortedprefetch
Coldstart
Warmstart
FAST
Sortedprefetch
Low CPU usage
(a)
(b)
(c)
1 2 3 4
Figure 7: Usage of CPU and SSD (sampling rate = 1 KHz).
chose Firefox because it shows a large gap between
tbound and tFAST . We monitored the generated block re-
quests during the launch of Firefox with the application
prefetcher, and observed that the first 12 of the entire
1566 block requests were issued by Firefox, which took
about 15 ms. As the application prefetcher itself should
be launched as well, FAST cannot prefetch these block
requests until finishing its launch. However, we ob-
served that all the remaining block requests were issued
by FAST, meaning that they are successfully prefetched
before the CPU needs them.
CPU and SSD usage patterns. We performed another
experiment to observe the CPU and SSD usage patterns
in each test scenario. We chose two applications, Eclipse
and Firefox, representing the two groups of applications
of which tFAST is close to and far from tbound , respec-
tively. We modified the OS kernel to sample the number
of CPU cores having runnable processes and to count the
number of cores in the I/O wait state. Figure 7 shows
the CPU and SSD usage of the two applications, where
the entire CPU is regarded as busy if at least one of its
cores is active. Similarly, the SSD is assumed busy if
there are one or more cores in the I/O wait state. In the
cold start scenario, there is almost no overlap between
CPU computation and SSD access for both applications.
In the warm start scenario, the CPU stays fully active
until the launch is completed as there is no wait. One ex-
ception we observed is the time period marked with Cir-
cle (a), during which the CPU seems to be in the event-
waiting state. FAST is shown to be successful in overlap-
ping CPU computation with SSD access as we intended.
However, CPU usage is observed to be low at the begin-
ning of launch for both applications, which can be ex-
plained with the example in Figure 2. As Eclipse shows
a shorter such time period (Circle (b)) than Firefox (Cir-
cle (c)), tFAST can reach closer to tbound . In the case of
Firefox, however, the ratio of tcpu to tssd is close to 1:1,
allowing FAST to achieve more reduction of launch time
for Firefox than for Eclipse.
Performance of sorted prefetch. Figure 6 shows that
the sorted prefetch reduces the application launch time
by an average of 7%, which is less efficient than FAST,
but non-negligible. One reason for this improvement is
the difference in I/O burstiness between the cold start
and the sorted prefetch. Most SSDs (including the one
we used) support the native command queueing (NCQ)
10
Z TN TZ QN
N
2
S
TQ
"
B
$
9
Launch c
om
ple
tion
tim
e (
sec)
Number of applications
tcold
twarm
tFAST
tsorted
Figure 8: Simultaneous launch of multiple applications.
feature, which allows up to 31 block requests to be sent
to a SSD controller. Using this information, the SSD
controller can read as many NAND flash chips as possi-
ble, effectively increasing read throughput. The average
queue depth in the cold start scenario is close to 1, mean-
ing that for most of time there is only one outstanding
request in case of SSD. In contrast, in the sorted prefetch
scenario, the queue depth will likely grow larger than
1 because the prefetcher may successively issue asyn-
chronous I/O requests using posix fadvise(), at small
inter-issue intervals.
On the other hand, we could not find a clear evidence
that sorting block requests in their LBA order is advan-
tageous in case of SSD. Rather, the execution time of
the sorted prefetcher was slightly longer than its unsorted
version for most of the applications we tested. Also, the
sorted prefetch shows worse performance than the cold
start for Excel, Powerpoint, Skype, and Word. Although
these observations were consistent over repeated tests, a
further investigation is necessary to understand such a
behavior.
Simultaneous launch of applications. We performed
experiments to see how well FAST can scale up for
launching multiple applications. We launched multiple
applications starting from the top of Table 3, adding five
at a time, and measured the launch completion time of
all launched applications2. Figure 8 shows that FAST
could reduce the launch completion time for all the tests,
whereas the sorted prefetch does not scale beyond 10 ap-
plications. Note that the FAST improvement decreased
from 20% to 7% as the number of applications increased
from 5 to 20.
Runtime and space overhead. We analyzed the run-
time overhead of FAST for seven possible combinations
of running processes, and summarized the results in Ta-
ble 4. Cases 2 and 3 belong to the launch profiling phase,
which was described in Section 4.4. During this phase,
Case 2 occurs only once, and Case 3 occurs Nrawseq
times. Case 4 corresponds to the prefetcher generation
phase (the right side of Figure 3), and shows a relatively
long runtime. However, we can hide it from users by run-
ning it in background. Also, since we primarily focused
on functionality in the current implementation, there is
2Except for Gnome that cannot be launched with other applications,
and Houdini whose license had expired.
Table 4: Runtime overhead (application: Firefox)
Running processes Runtime (sec)
1. Application only (cold start scenario) 0.86
2. strace + blktrace + application 1.21
3. blktrace + application 0.88
4. Prefetcher generation 5.01
5. Prefetcher + application 0.56
6. Prefetcher + blktrace + application 0.59
7. Miss ratio calculation 0.90
room for further optimization. Cases 5, 6, and 7 belong
to the application prefetch phase, and repeatedly occur
until the application prefetcher is invalidated. Cases 6
and 7 occur only when npre f reaches Nchk, and Case 7
can be run in background.
FAST creates temporary files such as system call log
files and I/O traces, but these can be deleted after FAST
completes creating application prefetchers. However, the
generated prefetchers occupy disk space as far as ap-
plication prefetching is used. In addition, application
launch sequences are stored to check the miss ratio of
the corresponding application prefetcher. In our exper-
iment, the total size of the application prefetchers and
application launch sequences for all 22 applications was
7.2 MB.
FAST applicability. While previous examples clearly
demonstrated the benefits of FAST for a wide range of
applications, FAST does not guarantee improvements for
all cases. One such a scenario is when a target appli-
cation is too small to offset the overhead of loading the
prefetcher. We tested FASTwith the Linux utility uname,
which displays the name of the OS. It generated 3 I/O re-
quests whose total size was 32 KB. The measured tcoldwas 2.2 ms, and tFAST was 2.3 ms, 5% longer than the
cold start time.
Another possible scenario is when the target applica-
tion experiences a major update. In this scenario, FAST
may fetch data that will not be used by the newly up-
dated application until it detects the application update
and enters a new launch profiling phase. We modified
the application prefetcher so that it fetches the same size
of data from the same file but from another offset that
is not used by the application. We tested the modi-
fied prefetcher with Firefox. Even in this case, FAST
reduced application launch time by 4%, because FAST
could still prefetch some of the metadata used by the ap-
plication. Assuming most of the file names are changed
after the update, we ran Firefox with the prefetcher for
Gimp, which fetches a similar number of blocks as Fire-
fox. In this experiment, the measured application launch
time was 7% longer than the cold start time, but the per-
formance degradation was not drastic due to the internal
parallelism of the SSD we used (10 channels).
11
Configuring application launch manager. The appli-
cation launch manager has a set of parameters to be con-
figured, as shown in Table 2. If Nrawseq is set too large,
users will experience the cold-start performance during
the initialization phase. If it is set too small, unnecessary
blocks may be included in the application prefetcher.
Figure 5 shows that setting it between 2 and 4 is a good
choice. The proper value of Nchk will depend on the run-
time overhead of Blktrace; if FAST is placed in the OS
kernel, the miss ratio of the application prefetchermay be
checked upon every launch (Nchk = 1) without noticeable
overhead. Also, setting Rmiss to 0.1 is reasonable, but it
needs to be adjusted after gaining enough experience in
using FAST. To find the proper value of Tidle, we investi-
gated the SSD’s maximum idle time during the cold-start
of applications, and found it to range from 24 ms (Thun-
derbird) to 826 ms (Xilinx ISE). Hence, setting Tidle to 2
seconds is proper in practice. As the maximum cold-start
launch time is observed to be less than 10 seconds, 30
seconds may be reasonable for Ttimeout . All these values
may need to be adjusted, depending on the underlying
OS and applications.
Running FAST on HDDs. To see how FAST works on
a HDD, we replaced the SSD with a Seagate 3.5” 1 TB
HDD (ST31000528AS) and measured the launch time of
the same set of benchmark applications. Although FAST
worked well as expected by hiding most of CPU com-
putation from the application launch, the average launch
time reduction was only 16%. It is because the applica-
tion launch on a HDD is mostly I/O bound; in the cold
start scenario, we observed that about 85% of the appli-
cation launch time was spent on accessing the HDD. In
contrast, the sorted prefetch was shown to be more ef-
fective; it could reduce the application launch time by an
average of 40% by optimizing disk head movements.
We performed another experiment by modifying the
sorted prefetch so that the prefetcher starts simultane-
ously with the original application, like FAST. However,
the resulting launch time reduction was only 19%, which
is worse than that of the unmodified sorted prefetch. The
performance degradation is due to the I/O contention be-
tween the prefetcher and the application.
6 Applicability of FAST to Smartphones
The similarity between modern smartphones and PCs
with SSDs in terms of the internal structure and the us-
age pattern, as summarized below, makes smartphones a
good candidate to which we can apply FAST:
• Unlike other mobile embedded systems, smart-
phones run different applications at different times,
making application launch performance matter
more;
!55T
!55Q
!55[
!552
!55Z
!55R
!55\
!55S
!55]
!55TN
!55TT
!55TQ
!55T[
!55T2
N
Z
TN
TZU&4+*$)(%)
K(%;*$)(%)
Applic
ation
launch t
ime (
sec)
Figure 9: Measured application launch time on iPhone 4
(CPU: 1 GHz, SDRAM: 512 MB, NAND flash: 32 GB).
• Smartphones use NAND flash as their secondary
storage, of which the performance characteristics
are basically the same as the SSD; and
• Smartphones often use slightly customized (if not
the same) OSes and file systems that are designed
for PCs, reducing the effort to port FAST to smart-
phones.
Furthermore, a smartphone has the characteristics that
enhance the benefit of using FAST as follows:
• Users tend to launch and quit applications more fre-
quently on smartphones than on PCs;
• Due to relatively smaller main memory of a smart-
phone, users will experience cold start performance
more frequently; and
• Its relatively slower CPU and flash storage speed
may increase the absolute reduction of application
launch time by applying FAST.
Although we have not yet implemented FAST on a
smartphone, we could measure the launch time of some
smartphone applications by simply using a stopwatch.
We randomly chose 14 applications installed on the
iPhone 4 to compare their cold and warm start times, of
which the results are plotted in Figure 9. The average
cold start time of the smartphone applications is 6.1 sec-
onds, which is more than twice of the average cold start
time of the PC applications (2.4 seconds) shown in Fig-
ure 6. Figure 9 also shows that the average warm start
time is 63% of the cold start time (almost the same ra-
tio as in Figure 6), implying that we can achieve similar
benefits from applying FAST to smartphones.
7 Comparison of FAST with Traditional
Prefetching
FAST is a special type of prefetching optimized for appli-
cation launch, whereas most of the traditional prefetch-
ing schemes focus on runtime performance improve-
ment. We compare FAST with the traditional prefetching
algorithms by answering the following three questions
that are inspired by previous work [32].
12
7.1 What to Prefetch
FAST prefetches the blocks appeared in the application
launch sequence. While many prediction-based prefetch-
ing schemes [9, 23, 39] suffer from the low hit ratio of
the prefetched data, FAST can achieve near 100% hit
ratio. This is because the application launch sequence
changes little over repeated launches of an application,
as observed by previous work [4, 18, 34].
Sequential pattern detection schemes like readahead
[13, 31] can achieve a fairly good hit ratio when acti-
vated, but they are applicable only when such a pattern
is detected. By contrast, FAST guarantees stable perfor-
mance improvement for every application launch.
One way to enhance the prefetch hit ratio for a com-
plicated disk I/O pattern is to analyze the application
source code to extract its access pattern. Using the thus-
obtained pattern, prefetching can be done by either in-
serting prefetch codes into the application source code
[29, 38] or converting the source code into a computa-
tion thread and a prefetch thread [40]. However, such
an approach does not work well for application launch
optimization because many of the block requests gener-
ated during an application launch are not from the ap-
plication itself but from other sources, such as loading
shared libraries, which cannot be analyzed by examin-
ing the application source code. Furthermore, both re-
quire modification of the source code, which is usually
not available for most commercial applications. Even
if the source code is available, modifying and recompil-
ing every application would be very tedious and incon-
venient. In contrast, FAST does not require application
source code and is thus applicable for any commercial
application.
Another relevant approach [6] is to deploy a shadow
process that speculatively executes the copy of the orig-
inal application to get hints for the future I/O requests.
It does not require any source modification, but con-
sumes non-negligibleCPU andmemory resources for the
shadow process. Although it is acceptable when CPU
is otherwise stalled waiting for the I/O completion, em-
ploying such a shadow process in FAST may degrade ap-
plication launch performance as there is not enough CPU
idle period as shown in Figure 7.
7.2 When to Prefetch
FAST is not activated until an application is launched,
which is as conservative as demand paging. Thus, un-
like prediction-based application prefetching schemes
[12, 28], there is no cache-pollution problem or addi-
tional disk I/O activity during idle period. However, once
activated, FAST aggressively performs prefetching: it
keeps on fetching subsequent blocks in the application
launch sequence asynchronously even in the absence of
page misses. As the prefetched blocks are mostly (if not
all) used by the application, the performance improve-
ment of FAST is comparable to that of the prediction-
based schemes when their prediction is accurate.
7.3 What to Replace
FAST does not modify the replacement algorithm of
page cache in main memory, so the default page replace-
ment algorithm is used to determine which page to evict
in order to secure free space for the prefetched blocks.
In general, prefetching may significantly affect the
performance of page replacement. Thus, previous work
[5, 14, 35] emphasized the need for integrated prefetch-
ing and caching. However, FAST differs from the tradi-
tional prefetching schemes since it prefetches only those
blocks that will be referenced before the application
launch completes (e.g., in next few seconds). If the page
cache in the main memory is large enough to store all
the blocks in the application launch sequence, which is
commonly the case, FAST will have minimal effect on
the optimality of the page replacement algorithm.
8 Conclusion
We proposed a new I/O prefetching technique called
FAST for the reduction of application launch time on
SSDs. We implemented and evaluated FAST on the
Linux OS, demonstrating its deployability and perfor-
mance superiority. While the HDD-aware application
launcher showed only 7% of launch time reduction on
SSDs, FAST achieved a 28% reduction with no addi-
tional overhead, demonstrating the need for, and the
utility of, a new SSD-aware optimizer. FAST with a
well-designed entry-level SSD can provide end-users the
fastest application launch performance. It also incurs
fairly low implementation overhead and has excellent
portability, facilitating its wide deployment in various
platforms.
Acknowledgments
We deeply appreciate Prof. Heonshik Shin for his sup-
port and providing research facility. We also thank our
shepherd Arkady Kanevsky, and the anonymous review-
ers for their invaluable comments that improved this pa-
per. This research was supported by WCU (World Class
University) program through National Research Founda-
tion of Korea funded by the Ministry of Education, Sci-
ence and Technology (R33-10085), and RP-Grant 2010
of Ewha Womans University. Sangsoo Park is the corre-
sponding author (email: [email protected]).
13
References
[1] Wine User Guide. http://www.winehq.org/docs/wineusr-guide/
index, Last accessed on: 17 November 2010.
[2] APPLE INC. Launch Time Performance Guidelines. http://
developer.apple.com/documentation/Performance/Conceptual/
LaunchTime/LaunchTime.pdf, 2006.
[3] AXBOE, J. Block IO Tracing. http://www.kernel.org/git/?p=
linux/kernel/git/axboe/blktrace.git;a=blob;f=README, 2006.
[4] BHADKAMKAR, M., GUERRA, J., USECHE, L., BURNETT, S.,
LIPTAK, J., RANGASWAMI, R., AND HRISTIDIS, V. BORG:
Block-reORGanization for self-optimizing storage systems. In
Proc. FAST (2009), pp. 183–196.
[5] CAO, P., FELTEN, E. W., KARLIN, A. R., AND LI, K. A study
of integrated prefetching and caching strategies. In Proc. SIG-
METRICS (1995), pp. 188–197.
[6] CHANG, F., AND GIBSON, G. A. Automatic I/O hint generation
through speculative execution. In Proc. OSDI (1999), pp. 1–14.
[7] CHEN, F., KOUFATY, D. A., AND ZHANG, X. Understanding
intrinsic characteristics and system implications of flash memory
based solid state drives. In Proc. SIGMETRICS (2009), pp. 181–
192.
[8] COLITTI, L. Analyzing and improving GNOME startup time. In
Proc. SANE (2006), pp. 1–11.
[9] CUREWITZ, K. M., KRISHNAN, P., AND VITTER, J. S. Practi-
cal prefetching via data compression. SIGMODRec. 22, 2 (1993),
257–266.
[10] DIRIK, C., AND JACOB, B. The performance of PC solid-state
disks (SSDs) as a function of bandwidth, concurrency, device
architecture, and system organization. In Proc. ISCA (2009),
pp. 279–289.
[11] DUNN, M., AND REDDY, A. L. N. A new I/O scheduler for
solid state devices. Tech. Rep. TAMU-ECE-2009-02, Depart-
ment of Electrical and Computer Engineering, Texas A&M Uni-
versity, 2009.
[12] ESFAHBOD, B. Preload—An adaptive prefetching daemon. Mas-
ter’s thesis, Graduate Department of Computer Science, Univer-
sity of Toronto, Canada, 2006.
[13] FENGGUANG, W., HONGSHENG, X., AND CHENFENG, X. On
the design of a new Linux readahead framework. SIGOPS Oper.
Syst. Rev. 42, 5 (2008), 75–84.
[14] GILL, B. S., AND MODHA, D. S. SARC: Sequential prefetching
in adaptive replacement cache. In Proc. USENIX (2005), pp. 293–
308.
[15] HUBERT, B. On faster application startup times: Cache stuff-
ing, seek profiling, adaptive preloading. In Proc. OLS (2005),
pp. 245–248.
[16] INTEL. Intel Turbo Memory with User Pinning. Intel,
http://www.intel.com/design/flash/nand/turbomemory/index.htm,
Last accessed on: 17 November 2010.
[17] JO, H., KIM, H., JEONG, J., LEE, J., AND MAENG, S. Optimiz-
ing the startup time of embedded systems: A case study of digital
TV. IEEE Trans. Consumer Electron. 55, 4 (2009), 2242–2247.
[18] JOO, Y., CHO, Y., LEE, K., AND CHANG, N. Improving ap-
plication launch times with hybrid disks. In Proc. CODES+ISSS
(2009), pp. 373–382.
[19] KAMINAGA, H. Improving Linux startup time using software
resume. In Proc. OLS (2006), pp. 25–34.
[20] KIM, H., AND AHN, S. BPLRU: A buffer management scheme
for improving random writes in flash storage. In Proc. FAST
(2008), pp. 1–14.
[21] KIM, J., OH, Y., KIM, E., CHOI, J., LEE, D., AND NOH, S. H.
Disk schedulers for solid state drivers. In Proc. EMSOFT (2009),
pp. 295–304.
[22] KIM, Y.-J., LEE, S.-J., ZHANG, K., AND KIM, J. I/O perfor-
mance optimization techniques for hybrid hard disk-based mobile
consumer devices. IEEE Trans. Consumer Electron. 53, 4 (2007),
1469–1476.
[23] KOTZ, D., AND ELLIS, C. S. Practical prefetching techniques
for parallel file systems. In Proc. PDIS (1991), pp. 182–189.
[24] LARUS, J. Spending Moore’s dividend. Commun. ACM 52, 5
(2009), 62–69.
[25] LICHOTA, K. Prefetch: Linux solution for prefetching necessary
data during application and system startup. http://code.google.
com/p/prefetch/, 2007.
[26] MATTHEWS, J., TRIKA, S., HENSGEN, D., COULSON, R.,
AND GRIMSRUD, K. Intel R©Turbo Memory: Nonvolatile disk
caches in the storage hierarchy of mainstream computer systems.
ACM Trans. Storage 4, 2 (2008), 1–24.
[27] MICROSOFT. Support and Q&A for Solid-State Drives. Mi-
crosoft, http://blogs.msdn.com/e7/archive/2009/05/05/support-
and-q-a-for-solid-state-drives-and.aspx, 2009.
[28] MICROSOFT. Windows PC Accelerators. http://www.microsoft.
com/whdc/system/sysperf/perfaccel.mspx, Last accessed on: 17
November 2010.
[29] MOWRY, T. C., DEMKE, A. K., AND KRIEGER, O. Automatic
compiler-inserted I/O prefetching for out-of-core applications. In
Proc. OSDI (1996), pp. 3–17.
[30] NEELAKANTH NADGIR. Reducing Application Startup Time
in the Solaris 8 OS. http://developers.sun.com/solaris/articles/
reducing app.html, 2002.
[31] PAI, R., PULAVARTY, B., AND CAO, M. Linux 2.6 perfor-
mance improvement through readahead optimization. In Proc.
OLS (2004), pp. 105–116.
[32] PAPATHANASIOU, A. E., AND SCOTT, M. L. Energy efficient
prefetching and caching. In Proc. USENIX (2004), pp. 22–22.
[33] PARK, C., KIM, K., JANG, Y., AND HYUN, K. Linux bootup
time reduction for digital still camera. In Proc. OLS (2006),
pp. 239–248.
[34] PARUSH, N., PELLEG, D., BEN-YEHUDA, M., AND TA-SHMA,
P. Out-of-band detection of boot-sequence termination events. In
Proc. ICAC (2009), pp. 71–72.
[35] PATTERSON, R. H., GIBSON, G. A., GINTING, E., STODOL-
SKY, D., AND ZELENKA, J. Informed prefetching and caching.
In Proc. SOSP (1995), pp. 79–95.
[36] RUSSINOVICH, M. E., AND SOLOMON, D. Microsoft Windows
Internals, 4th ed. Microsoft Press, 2004, pp. 458–462.
[37] SAMSUNG SEMICONDUCTOR. Samsung Hybrid Hard Drive.
http://www.samsung.com/global/business/semiconductor/support/
brochures/downloads/hdd/hd datasheet 200708.pdf, 2007.
[38] VANDEBOGART, S., FROST, C., AND KOHLER, E. Reducing
seek overhead with application-directed prefetching. In Proc.
USENIX (2009).
[39] VELLANKI, V., AND CHERVENAK, A. L. A cost-benefit scheme
for high performance predictive prefetching. In Proc. SC (1999).
[40] YANG, C.-K., MITRA, T., AND CHIUEH, T.-C. A decoupled
architecture for application-specific file prefetching. In Proc.
USENIX (2002), pp. 157–170.
14