On Performance Stability in LSM-based Storage Systems · LSM disk component is range-partitioned...

On Performance Stability in LSM-based Storage Systems

Chen LuoUniversity of California, Irvine

[email protected]

Michael J. CareyUniversity of California, Irvine

[email protected]

ABSTRACTThe Log-Structured Merge-Tree (LSM-tree) has been widelyadopted for use in modern NoSQL systems for its superiorwrite performance. Despite the popularity of LSM-trees,they have been criticized for suffering from write stalls andlarge performance variances due to the inherent mismatchbetween their fast in-memory writes and slow backgroundI/O operations. In this paper, we use a simple yet effectivetwo-phase experimental approach to evaluate write stalls forvarious LSM-tree designs. We further identify and explorethe design choices of LSM merge schedulers to minimizewrite stalls given an I/O bandwidth budget. We have con-ducted extensive experiments in the context of the ApacheAsterixDB system and we present the results here.

PVLDB Reference Format:Chen Luo, Michael J. Carey. On Performance Stability in LSM-based Storage Systems. PVLDB, 13(4): 449-462, 2019.DOI: https://doi.org/10.14778/3372716.3372719

1. INTRODUCTIONIn recent years, the Log-Structured Merge-Tree (LSM-

tree) [45, 41] has been widely used in modern key-valuestores and NoSQL systems [2, 4, 5, 7, 14]. Different fromtraditional index structures, such as B+-trees, which ap-ply updates in-place, an LSM-tree always buffers writes intomemory. When memory is full, writes are flushed to disk andsubsequently merged using sequential I/Os. To improve ef-ficiency and minimize blocking, flushes and merges are oftenperformed asynchronously in the background.

Despite their popularity, LSM-trees have been criticizedfor suffering from write stalls and large performance vari-ances [3, 51, 57]. To illustrate this problem, we conducted amicro-experiment on RocksDB [7], a state-of-the-art LSM-based key-value store, to evaluate its write throughput onSSDs using the YCSB benchmark [24]. The instantaneouswrite throughput over time is depicted in Figure 1. As onecan see, the write throughput of RocksDB periodically slowsdown after the first 300 seconds, which is when the system

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 13, No. 4ISSN 2150-8097.DOI: https://doi.org/10.14778/3372716.3372719

0

5000

10000

15000

20000

25000

30000

0 120 240 360 480 600 720

WriteThroughput(records/s)

Time (s)

Figure 1: Instantaneous write throughput ofRocksDB: writes are periodically stalled to wait forlagging merges

has to wait for background merges to catch up. Write stallscan significantly impact percentile write latencies and mustbe minimized to improve the end-user experience or to meetstrict service-level agreements [35].

In this paper, we study the impact of write stalls and howto minimize write stalls for various LSM-tree designs. Itshould first be noted that some write stalls are inevitable.Due to the inherent mismatch between fast in-memory writesand slower background I/O operations, in-memory writesmust be slowed down or stopped if background flushes ormerges cannot catch up. Without such a flow control mech-anism, the system will eventually run out of memory (due toslow flushes) or disk space (due to slow merges). Thus, it isnot a surprise that an LSM-tree can exhibit large write stallsif one measures its maximum write throughput by writingas quickly data as possible, such as we did in Figure 1.

This inevitability of write stalls does not necessarily limitthe applicability of LSM-trees since in practice writes donot arrive as quickly as possible, but rather are controlledby the expected data arrival rate. The data arrival ratedirectly impacts the write stall behavior and resulting writelatencies of an LSM-tree. If the data arrival rate is relativelylow, then write stalls are unlikely to happen. However, it isalso desirable to maximize the supported data arrival rate sothat the system’s resources can be fully utilized. Moreover,the expected data arrival rate is subject to an importantconstraint - it must be smaller than the processing capacityof the target LSM-tree. Otherwise, the LSM-tree will neverbe able to process writes as they arrive, causing infinite writelatencies. Thus, to evaluate the write stalls of an LSM-tree,the first step is to choose a proper data arrival rate.

As the first contribution, we propose a simple yet effec-tive approach to evaluate the write stalls of various LSM-treedesigns by answering the following question: If we set the

449

data arrival rate close to (e.g., 95% of) the maximum writethroughput of an LSM-tree, will that cause write stalls? Inother words, can a given LSM-tree design provide both ahigh write throughput and a low write latency? Briefly,the proposed approach consists of two phases: a testingphase and a running phase. During the testing phase, weexperimentally measure the maximum write throughput ofan LSM-tree by simply writing as much data as possible.During the running phase, we then set the data arrival rateclose to the measured maximum write throughput as thelimiting data arrival rate to evaluate its write stall behaviorbased on write latencies. If write stalls happen, the mea-sured write throughput is not sustainable since it cannot beused in the long-term due to the large latencies. However, ifwrite stalls do not happen, then write stalls are no longer aproblem since the given LSM-tree can provide a high writethroughput with small performance variance.

Although this approach seems to be straightforward atfirst glance, there exist two challenges that must be ad-dressed. First, how can we accurately measure the maxi-mum sustainable write rate of an LSM-tree experimentally?Second, how can we best schedule LSM I/O operations soas to minimize write stalls at runtime? In the remainder ofthis paper, we will see that the merge scheduler of an LSM-tree can have a large impact on write stalls. As the secondcontribution, we identify and explore the design choices forLSM merge schedulers and present a new merge schedulerto address these two challenges.

As the paper’s final contribution, we have implementedthe proposed techniques and various LSM-tree designs insideApache AsterixDB [14]. This enabled us to carry out exten-sive experiments to evaluate the write stalls of LSM-treesand the effectiveness of the proposed techniques using ourtwo-phase evaluation approach. We argue that with propertuning and configuration, LSM-trees can achieve both a highwrite throughput and small performance variance.

The remainder of this paper is organized as follows: Sec-tion 2 provides background on LSM-trees and discusses re-lated work. Section 3 describes the general experimentalsetup used throughout this paper. Section 4 identifies thescheduling choices for LSM-trees and experimentally evalu-ates bLSM’s merge scheduler [51]. Sections 5 and 6 presentour techniques for minimizing write stalls for various LSM-tree designs. Section 7 summarize the important lessons andinsights from our evaluation. Finally, Section 8 concludesthe paper. An extended version of this paper [42] furtherextends our evaluation to the size-tiered merge policy usedin practical systems and LSM-based secondary indexes.

2. BACKGROUND

2.1 Log-Structured Merge TreesThe LSM-tree [45] is a persistent index structure opti-

mized for write-intensive workloads. In an LSM-tree, writesare first buffered into a memory component. An insert orupdate simply adds a new entry with the same key, while adelete adds an anti-matter entry indicating that a key hasbeen deleted. When the memory component is full, it isflushed to disk to form a new disk component, within whichentries are ordered by keys. Once flushed, LSM disk com-ponents are immutable.

A query over an LSM-tree has to reconcile the entrieswith identical keys from multiple components, as entries

L0

L1

L2

memory

disk

(b) Tiering Merge Policy

T components per level

(a) Leveling Merge Policy

1 component per level

T times larger merge T components

Figure 2: LSM-tree Merge Policies

from newer components override those from older compo-nents. A point lookup query simply searches all compo-nents from newest to oldest until the first match is found.A range query searches all components simultaneously us-ing a priority queue to perform reconciliation. To speed uppoint lookups, a common optimization is to build Bloom fil-ters [19] over the sets of keys stored in disk components. If aBloom filter reports that a key does not exist, then that diskcomponent can be excluded from searching. As disk com-ponents accumulate, query performance tends to degradesince more components must be examined. To counter this,smaller disk components are gradually merged into largerones. This is implemented by scanning old disk componentsto create a new disk component with unique entries. Thedecision of what disk components to merge is made by apre-defined merge policy, which is discussed below.

Merge Policy. Two types of LSM merge policies arecommonly used in practice [41], both of which organize com-ponents into “levels”. The leveling merge policy (Figure 2a)maintains one component per level, and a component atLevel i + 1 will be T times larger than that of Level i. Asa result, the component at Level i will be merged multipletimes with the component from Level i − 1 until it fills upand is then merged into Level i + 1. In contrast, the tier-ing merge policy (Figure 2b) maintains multiple componentsper level. When a Level i becomes full with T components,these T components are merged together into a new compo-nent at Level i + 1. In both merge policies, T is called thesize ratio, as it controls the maximum capacity of each level.We will refer to both of these merge policies as full mergessince components are merged entirely.

In general, the leveling merge policy optimizes for queryperformance by minimizing the number of components butat the cost of write performance. This design also maximizesspace efficiency, which measures the amount of space usedfor storing obsolete entries, by having most of the entriesat the largest level. In contrast, the tiering merge policy ismore write-optimized by reducing the merge frequency, butthis leads to lower query performance and space utilization.

Partitioning. Partitioning is a commonly used opti-mization in modern LSM-based key-value stores that is of-ten implemented together with the leveling merge policy,as pioneered by LevelDB [5]. In this optimization, a largeLSM disk component is range-partitioned into multiple (of-ten fixed-size) files. This bounds the processing time andthe temporary space of each merge. An example of a par-titioned LSM-tree with the leveling merge policy is shownin Figure 3, where each file is labeled with its key range.Note that partitioning starts from Level 1, as componentsin Level 0 are directly flushed from memory. To merge afile from Level i to Level i+ 1, all of its overlapping files at

450

L0

L1

L2

0-99

0-50 55-99

0-20 22-52 53-75 80-95

0-99

memory

disk

0-99

Before Merge

0-9955-99

53-75 80-95

0-99

0-99

0-15 17-30 32-52After Merge

merging filenew file

component/file

Figure 3: Partitioned LSM-tree with LevelingMerge Policy

Level i + 1 are selected and these files are merged to formnew files at Level i + 1. For example in Figure 3, the filelabeled 0-50 at Level 1 will be merged with the files labeled0-20 and 22-52 at Level 2, which produce new files labeled0-15, 17-30, and 32-52 at Level 2. To select which file tomerge next, LevelDB uses a round-robin policy.

Both full merges and partitioned merges are widely usedin existing systems. Full merges are used in AsterixDB [1],Cassandra [2], HBase [4], ScyllaDB [8], Tarantool [9], andWiredTiger (MongoDB) [11]. Partitioned merges are usedin LevelDB [5], RocksDB [7], and X-Engine [33].

Write Stalls in LSM-trees. Since in-memory writes areinherently faster than background I/O operations, writingto memory components sometimes must be stalled to en-sure the stability of an LSM-tree, which, however, will neg-atively impact write latencies. This is often referred to asthe write stall problem. If the incoming write speed is fasterthan the flush speed, writes will be stalled when all memorycomponents are full. Similarly, if there are too many diskcomponents, new writes should be stalled as well. In gen-eral, merges are the major source of write stalls since writesare flushed once but merged multiple times. Moreover, flushstalls can be avoided by giving higher I/O priority to flushes.In this paper, we thus focus on write stalls caused by merges.

2.2 Apache AsterixDBApache AsterixDB [1, 14, 22] is a parallel, semi-structured

Big Data Management System (BDMS) that aims to man-age large amounts of data efficiently. It supports a feed-based framework for efficient data ingestion [31, 55]. Therecords of a dataset in AsterixDB are hash-partitioned basedon their primary keys across multiple nodes of a shared-nothing cluster; thus, a range query has to search all nodes.Each partition of a dataset uses a primary LSM-based B+-tree index to store the data records, while local secondary in-dexes, including LSM-based B+-trees, R-trees, and invertedindexes, can be built to expedite query processing. Aster-ixDB internally uses a variation of the tiering merge policyto manage disk components, similar to the one used in ex-isting systems [4, 7]. Instead of organizing components intolevels explicitly as in Figure 2b, AsterixDB’s variation sim-ply schedules merges based on the sizes of disk components.In this work, we do not focus on the LSM-tree implementa-tion in AsterixDB but instead use AsterixDB as a commontestbed to evaluate various LSM-tree designs.

2.3 Related WorkLSM-trees. Recently, a large number of improvements

of the original LSM-tree [45] have been proposed. [41] sur-veys these improvements, ranging from improving write per-formance [18, 27, 28, 37, 39, 44, 47, 56], optimizing mem-

C0

C1

C2

memory

C’1disk merge (out1)

merge (out0/in1)writes (in0)

Figure 4: bLSM’s Spring-and-Gear Merge Scheduler

ory management [13, 17, 52, 58], supporting automatic tun-ing of LSM-trees [25, 26, 38], optimizing LSM-based sec-ondary indexes [40, 46], to extending the applicability ofLSM-trees [43, 49]. However, all of these efforts have largelyignored performance variances and write stalls of LSM-trees.

Several LSM-tree implementations seek to bound the writeprocessing latency to alleviate the negative impact of writestalls [5, 7, 36, 57]. bLSM [51] proposes a spring-and-gearmerge scheduler to avoid write stalls. As shown in Figure 4,bLSM has one memory component, C0, and two disk com-ponents, C1 and C2. The memory component C0 is contin-uously flushed and merged with C1. When C1 becomes full,a new C1 component is created while the old C1, which nowbecomes C′1, will be merged with C2. bLSM ensures that foreach Level i, the progress of merging C′i into Ci+1 (denotedas “outi”) will be roughly identical to the progress of theformation of a new Ci (denoted as “ini”). This eventuallylimits the write rate for the memory component (in0) andavoids blocking writes. However, we will see later that sim-ply bounding the maximum write processing latency aloneis insufficient, because a large variance in the processing ratecan still cause large queuing delays for subsequent writes.

Performance Stability. Performance stability has longbeen recognized as a critical performance metric. The TPC-C benchmark [10] measures not only absolute throughput,but also specifies the acceptable upper bounds for the per-centile latencies. Huang et al. [35] applied VProfiler [34] toidentify major sources of variance in database transactions.Various techniques have been proposed to optimize the vari-ance of query processing [15, 16, 20, 23, 48, 54]. Cao etal. [21] found that variance is common in storage stacks andheavily depends on configurations and workloads. Dean andBarroso [29] discussed several engineering techniques to re-duce performance variances at Google. Different from theseefforts, in this work we focus on the performance variances ofLSM-trees due to their inherent out-of-place update design.

3. EXPERIMENTAL METHODOLOGYFor ease of presentation, we will mix our techniques with a

detailed performance analysis for each LSM-tree design. Wenow describe the general experimental setup and methodol-ogy for all experiments to follow.

3.1 Experimental SetupAll experiments were run on a single node with an 8-core

Intel i7-7567U 3.5GHZ CPU, 16 GB of memory, a 500GBSSD, and a 1TB 7200 rpm hard disk. We used the SSD forLSM storage and configured the hard disk for transactionlogging due to its sufficiently high sequential throughput.We allocated 10GB of memory for the AsterixDB instance.Within that allocation, the buffer cache size was set at 2GB.Each LSM memory component had a 128MB budget, andeach LSM-tree had two memory components to minimizestalls during flushes. Each disk component had a Bloom

451

filter with a false positive rate setting of 1%. The data pagesize was set at 4KB to align with the SSD page size.

It is important to note that not all sources of performancevariance can be eliminated [35]. For example, writing a key-value pair with a 1MB value inherently requires more workthan writing one that only has 1KB. Moreover, short timeperiods with quickly occurring writes (workload bursts) willbe much more likely to cause write stalls than a long periodof slow writes, even though their long-term write rate maybe the same. In this paper, we will focus on the avoidablevariance [35] caused by the internal implementation of LSM-trees instead of variances in the workloads.

To evaluate the internal variances of LSM-trees, we adoptYCSB [24] as the basis for our experimental workload. In-stead of using the pre-defined YCSB workloads, we designedour own workloads to better study the performance stabilityof LSM-trees. Each experiment first loads an LSM-tree with100 million records, in random key order, where each recordhas size 1KB. It then runs for 2 hours to update the pre-viously loaded LSM-tree. This ensures that the measuredwrite throughput of an LSM-tree is stable over time. Unlessotherwise noted, we used one writer thread for writing datato the LSM memory components. We evaluated two updateworkloads, where the updated keys follow either a uniformor Zipf distribution. The specific workload setups will bediscussed in the subsequent sections.

We used two commonly used I/O optimizations when im-plementing LSM-trees, namely I/O throttling and periodicdisk forces. In all experiments, we throttled the SSD writespeed of all LSM flush and merge operations to 100MB/s.This was implemented by using a rate limiter to inject ar-tificial sleeps into SSD writes. This mechanism bounds thenegative impact of the SSD writes on query performance andallows us to more fairly compare the performance differencesof various LSM merge schedulers. We further had each flushor merge operation force its SSD writes after each 16MB ofdata. This helps to limit the OS I/O queue length, reducingthe negative impact of SSD writes on queries. We have ver-ified that disabling this optimization would not impact theperformance trends of writes; however, large forces at theend of each flush and merge operation, which are requiredfor durability, can significantly interfere with queries.

3.2 Performance MetricsTo quantify the impact of write stalls, we will not only

present the write throughput of LSM-trees but also theirwrite latencies. However, there are different models for mea-suring write latencies. Throughout the paper, we will usearrival rate to denote the rate at which writes are submittedby clients, processing rate to denote the rate at which writescan be processed by an LSM-tree, and write throughput todenote the number of writes processed by an LSM-tree perunit of time. The difference between the write throughputand arrival/processing rates is discussed further below.

The bLSM paper [51], as well as most of the existing LSMresearch, used the experimental setup depicted in Figure 5ato write as much data as possible and measure the latency ofeach write. In this closed system setup [32], the processingrate essentially controls the arrival rate, which further equalsthe write throughput. Although this model is sufficient formeasuring the maximum write throughput of LSM-trees, itis not suitable for characterizing their write latencies forseveral reasons. First, writing to memory is inherently faster

(b) Open System(a) Closed System

client

client...

queue

processingrate

arrival rate

LSM-tree

writethroughput

clientclient

client. . .

arrival rate = processing ratewrite throughput

LSM-tree

Figure 5: Models for Measuring Write Latency

than background I/Os, so an LSM-tree will always have tostall writes in order to wait for lagged flushes and merges.Moreover, under this model, a client cannot submit its nextwrite until its current write is completed. Thus, when theLSM-tree is stalled, only a small number of ongoing writeswill actually experience a large latency since the remainingwrites have not been submitted yet1.

In practice, a DBMS generally cannot control how quicklywrites are submitted by external clients, nor will their writesalways arrive as fast as possible. Instead, the arrival rate isusually independent from the processing rate, and when thesystem is not able to process writes as fast as they arrive,the newly arriving writes must be temporarily queued. Insuch an open system (Figure 5b), the measured write latencyincludes both the queuing latency and processing latency.Moreover, an important constraint is that the arrival ratemust be smaller than the processing rate since otherwisethe queue length will be unbounded. Thus, the (overall)write throughput is actually determined by the arrival rate.

A simple example will illustrate the important differencebetween these two models. Suppose that 5 clients are usedto generate an intended arrival rate of 1000 writes/s andthat the LSM-tree stalls for 1 second. Under the closedsystem model (Figure 5a), only 5 delayed writes will expe-rience a write latency of 1s since the remaining (intended)995 writes simply will not occur. However, under the opensystem model (Figure 5b), all 1000 writes will be queuedand their average latency will be at least 0.5s.

To evaluate write latencies in an open system, one mustfirst set the arrival rate properly since the write latencyheavily depends on the arrival rate. It is also importantto maximize the arrival rate to maximize the system’s uti-lization. For these reasons, we propose a two-phase evalu-ation approach with a testing phase and a running phase.During the testing phase, we use the closed system model(Figure 5a) to measure the maximum write throughput of anLSM-tree, which is also its processing rate. When measur-ing the maximum write throughput, we excluded the initial20-minute period (out of 2 hours) of the testing phase sincethe initially loaded LSM-tree has a relatively small num-ber of disk components at first. During the running phase,we use the open system model (Figure 5b) to evaluate thewrite latencies under a constant arrival rate set at 95% ofthe measured maximum write throughput. Based on queu-ing theory [32], the queuing time approaches infinity whenthe utilization, which is the ratio between the arrival rateand the processing rate, approaches 100%. We thus empir-ically determine a high utilization load (95%) while leavingsome room for the system to absorb variance. If the runningphase then reports large write latencies, the maximum write

1The original release of the YCSB benchmark [24] mistak-enly used this model; this was corrected later in 2015 [12].

452

throughput as determined in the testing phase is not sustain-able; we must improve the implementation of the LSM-treeor reduce the expected arrival rate to reduce the latencies.In contrast, if the measured write latency is small, then thegiven LSM-tree can provide a high write throughput with asmall performance variance.

4. LSM MERGE SCHEDULERDifferent from a merge policy, which decides which com-

ponents to merge, a merge scheduler is responsible for exe-cuting the merge operations created by the merge policy. Inthis section, we discuss the design choices for a merge sched-uler and evaluate bLSM’s spring-and-gear merge scheduler.

4.1 Scheduling ChoicesThe write cost of an LSM-tree, which is the number of

I/Os per write, is determined by the LSM-tree design it-self and the workload characteristics but not by how mergesare executed [25]. Thus, a merge scheduler will have littleimpact on the overall write throughput of an LSM-tree aslong as the allocated I/O bandwidth budget can be fully uti-lized. However, different scheduling choices can significantlyimpact the write stalls of an LSM-tree, and merge schedulersmust be carefully designed to minimize write stalls. We haveidentified the following design choices for a merge scheduler.

Component Constraint: A merge scheduler usuallyspecifies an upper-bound constraint on the total number ofcomponents allowed to accumulate before incoming writes tothe LSM memory components should be stalled. We call thisthe component constraint. For example, bLSM [51] allowsat most two disk components per level, while other systemslike HBase [4] or Cassandra [2] specify the total number ofdisk components across all levels.

Interaction with Writes: There exist different strate-gies to enforce a given component constraint. One strategyis to simply stop processing writes once the component con-straint is violated. Alternatively, the processing of writescan be degraded gracefully based on the merge pressure [51].

Degree of Concurrency: In general, an LSM-tree canoften create multiple merge operations in the same time pe-riod. A merge scheduler should decide how these merge op-erations should be scheduled. Allowing concurrent mergeswill enable merges at multiple levels to proceed concurrently,but they will also compete for CPU and I/O resources, whichcan negatively impact query performance [13]. As two ex-amples, bLSM [51] allows one merge operation per level,while LevelDB [5] uses just one single background thread toexecute all merges one by one.

I/O Bandwidth Allocation: Given multiple concur-rent merge operations, the merge scheduler should furtherdecide how to allocate the available I/O bandwidth amongthese merge operations. A commonly used heuristic is toallocate I/O bandwidth “fairly” (evenly) to all active mergeoperations. Alternatively, bLSM [51] allocates I/O band-width based on the relative progress of the merge operationsto ensure that merges at each level all make steady progress.

4.2 Evaluation of bLSMDue to the implementation complexity of bLSM and its

dependency on a particular storage system, Stasis [50], wechose to directly evaluate the released version of bLSM [6].bLSM uses the leveling merge policy with two on-disk levels.We set its memory component size to 1GB and size ratio to

10 so that the experimental dataset with 100 million recordscan fit into the last level. We used 8 write threads to maxi-mize the write throughput of bLSM.

Testing Phase. During the testing phase, we measuredthe maximum write throughput of bLSM by writing as muchdata as possible using both the uniform and Zipf updateworkloads. The instantaneous write throughput of bLSMunder these two workloads is shown in Figure 6a. For read-ability, the write throughput is averaged over 30-second win-dows. (Unless otherwise noted, the same aggregation appliesto all later experiments as well.)

Even though bLSM’s merge scheduler prevents writes frombeing stalled, the instantaneous write throughput still ex-hibits a large variance with regular temporary peaks. Re-call that bLSM uses the merge progress at each level tocontrol its in-memory write speed. After the component C1

is full and becomes C′1, the original C1 will be empty andwill have much shorter merge times. This will temporarilyincrease the in-memory write speed of bLSM, which thenquickly drops as C1 grows larger and larger. Moreover, theZipf update workload increases the write throughput onlybecause updated entries can be reclaimed earlier, but theoverall variance performance trends are still the same.

Running Phase. Based on the maximum write through-put measured in the testing phase, we then used a constantdata arrival process (95% of the maximum) in the runningphase to evaluate bLSM’s behavior. Figure 6b shows theinstantaneous write throughput of bLSM under the uniformand Zipf update workloads. bLSM maintains a sustainedwrite throughput during the initial period of the experi-ment, but later has to slow down its in-memory write rateperiodically due to background merge pressure. Figure 6cfurther shows the resulting percentile write and processinglatencies. The processing latency measures only the timefor the LSM-tree to process a write, while the write latencyincludes both the write’s queuing time and processing time.By slowing down the in-memory write rate, bLSM indeedbounds the processing latency. However, the write latencyis much larger because writes must be queued when theycannot be processed immediately. This suggests that sim-ply bounding the maximum processing latency is far fromsufficient; it is important to minimize the variance in anLSM-tree’s processing rate to minimize write latencies.

5. FULL MERGESIn this section, we explore the scheduling choices of LSM-

trees with full merges and then evaluate the impact of mergescheduling on write stalls using our two-phase approach.

5.1 Merge Scheduling for Full MergesWe first introduce some useful notation for use throughout

our analysis in Table 1. To simplify the analysis, we willignore the I/O cost of flushes since merges consume most ofthe I/O bandwidth.

5.1.1 Component ConstraintTo provide acceptable query performance and space uti-

lization, the total number of disk components of an LSM-tree must be bounded. We call this upper bound the com-ponent constraint, and it can be enforced either locally orglobally. A local constraint specifies the maximum num-ber of disk components per level. For example, bLSM [51]uses a local constraint to allow at most two components per

453

0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

Writ

e Th

roug

hput

(kop

s/s)

(a) Testing Phase: InstantaneousWrite Throughput (Maximum)

zipfuniform

0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

Writ

e Th

roug

hput

(kop

s/s)

(b) Running Phase: InstantaneousWriteThroughput (95% Load)

zipfuniform

50 90 95 99 99.9 99.99Percentile (%)

10 3

10 1

101

Writ

e La

tenc

y (s

)

(c) Running Phase: PercentileWrite Latencies (95% Load)

zipf: write latencyzipf: processing latency

uniform: write latencyuniform: processing latency

Figure 6: Two-Phase Evaluation of bLSM

Table 1: List of notation used in this paper

Term Definition UnitT size ratio of the merge policyL the number of levels in an LSM-treeM memory component size entriesB I/O bandwidth entries/sµ write arrival rate entries/sW write throughput of an LSM-tree entries/s

level. A global constraint instead specifies the maximumnumber of disk components across all levels. Here we ar-gue that global component constraints will better minimizewrite stalls. In addition to external factors, such as deletesor shifts in write patterns, the merge time at each level in-herently varies for leveling since the size of the component atLevel i varies from 0 to (T − 1) ·M · T i−1. Because of this,bLSM cannot provide a high yet stable write throughputover time. Global component constraints will better absorbthis variance and minimize the write stalls .

It remains a question how to determine the maximumnumber of disk components for the component constraint.In general, tolerating more disk components will increase theLSM-tree’s ability to reduce write stalls and absorb writebursts, but it will decrease query performance and spaceutilization. Given the negative impact of stalls on write la-tencies, one solution is to tolerate a sufficient number of diskcomponents to avoid write stalls while the worst-case queryperformance and space utilization are still bounded. Forexample, one conservative constraint would be to toleratetwice the expected number of disk components, e.g., 2 · Lcomponents for leveling and 2 ·T ·L components for tiering.

5.1.2 Interaction with WritesWhen the component constraint is violated, the process-

ing of writes by an LSM-tree has to be slowed down orstopped. Existing LSM-tree implementations [5, 7, 51] pre-fer to gracefully slow down the in-memory write rate byadding delays to some writes. This approach reduces themaximum processing latency, as large pauses are brokendown into many smaller ones, but the overall processingrate of an LSM-tree, which depends on the I/O cost of eachwrite, is not affected. Moreover, this approach will resultin an even larger queuing latency. There may be additionalconsiderations for gracefully slowing down writes, but we ar-gue that processing writes as quickly as possible minimizesthe overall write latency, as stated by the following theo-rem. See [42] for detailed proofs for all theorems.

Theorem 1. Given any data arrival process and anyLSM-tree, processing writes as quickly as possible minimizesthe latency of each write.

Proof Sketch. Consider two merge schedulers S andS′ which only differ in that S may add arbitrary delays towrites while S′ processes writes as quickly as possible. Foreach write request r, r must be completed by S′ no laterthan S because the LSM-tree has the same processing ratebut S adds some delays to writes.

It should be noted that Theorem 1 only considers writelatencies. By processing writes as quickly as possible, diskcomponents can stack up more quickly (up to the compo-nent constraint), which may negatively impact query per-formance. Thus, a better approach may be to increase thewrite processing rate, e.g., by changing the structure of theLSM-tree. We leave the exploration of this direction as fu-ture work.

5.1.3 Degree of ConcurrencyA merge policy can often create multiple merge operations

simultaneously. For full merges, we can show that a single-threaded scheduler that executes one merge at a time is notsufficient for minimizing write stalls. Consider a merge op-eration at Level i. For leveling, the merge time varies from 0

to M·T i

Bbecause the size of the component at Level i varies

from 0 to (T − 1) ·M · T i−1. For tiering, each componenthas size M · T i−1 and merging T components thus takes

time M·T i

B. Suppose the arrival rate is µ. Without concur-

rent merges, there would be µM

· M·Ti

B= µ·T i

Bnewly flushed

components added while this merge operation is being exe-cuted, assuming that flushes can still proceed.

Our two-phase evaluation approach chooses the maximumwrite throughput of an LSM-tree as the arrival rate µ. Forleveling, the maximum write throughput is approximatelyWlevel = 2·B

T ·L , as each entry is merged T2

times per level. Fortiering, the maximum write throughput is approximatelyWtier = B

L, as each entry is merged only once per level. By

substituting Wlevel and Wtier for µ, one needs to tolerate at

least 2·T i−1

Lflushed components for leveling and T i

Lflushed

components for tiering to avoid write stalls. Since the termT i grows exponentially, a large number of flushed compo-nents will have to be tolerated when a large disk componentis being merged. Consider the leveling merge policy witha size ratio of 10. To merge a disk component at Level 5,

approximately 2·1045

= 4000 flushed components would needto be tolerated, which is highly unacceptable.

Clearly, concurrent merges must be performed to mini-mize write stalls. When a large merge is being processed,

454

smaller merges can still be completed to reduce the numberof components. By the definition of the tiering and levelingmerge policies, there can be at most one active merge oper-ation per level. Thus, given an LSM-tree with L levels, atmost L merge operations can be scheduled concurrently.

5.1.4 I/O Bandwidth AllocationGiven multiple active merge operations, the merge sched-

uler must further decide how to allocate I/O bandwidth tothese operations. A heuristic used by existing systems [2, 4,7] is to allocate I/O bandwidth fairly (evenly) to all ongoingmerges. We call this the fair scheduler. The fair schedulerensures that all merges at different levels can proceed, thuseliminating potential starvation. Recall that write stalls oc-cur when an LSM-tree has too many disk components, thusviolating the component constraint. It is unclear whetheror not the fair scheduler can minimize write stalls by mini-mizing the number of disk components over time.

Recall that both the leveling and tiering merge policiesalways merge the same number of disk components at once.We propose a novel greedy scheduler that always allocatesthe full I/O bandwidth to the merge operation with thesmallest remaining number of bytes. The greedy schedulerhas a useful property that it minimizes the number of diskcomponents over time for a given set of merge operations.

Theorem 2. Given any set of merge operations that pro-cess the same number of disk components and any I/O band-width budget, the greedy scheduler minimizes the number ofdisk components at any time instant.

Proof Sketch. Consider an arbitrary scheduler S andthe greedy scheduler S′. Given N merge operations, we canshow that S′ always completes the i-th (1 ≤ i ≤ N) mergeoperation no later than S. This can be done by noting thatS′ always processes the smallest merge operation first.

Theorem 2 only considers a set of statically created mergeoperations. This conclusion may not hold in general be-cause sometimes completing a large merge may enable themerge policy to create smaller merges, which can then re-duce the number of disk components more quickly. Becauseof this, there actually exists no merge scheduler that canalways minimize the number of disk components over time,as stated by the following theorem. However, as we will seein our later evaluation, the greedy scheduler is still a veryeffective heuristic for minimizing write stalls.

Theorem 3. Given any I/O bandwidth budget, no mergescheduler can minimize the number of disk components atany time instant for any data arrival process and any LSM-tree for a deterministic merge policy where all merge opera-tions process the same number of disk components.

Proof Sketch. Consider an LSM-tree that has createda small merge MS and a large merge ML. Completing ML

allows the LSM-tree to create a new small merge M ′S thatis smaller than MS . Consider two merge schedulers S1 andS2, where S1 first processes MS and then ML, and S2 firstprocesses ML and then M ′S . It can be shown that S1 hasthe earliest completion time for the first merge and S2 hasthe earliest completion time for the second merge, but nomerge scheduler can outperform both S1 and S2.

5.1.5 Putting Everything TogetherBased on the discussion of each scheduling choice, we

now summarize the proposed greedy scheduler. The greedyscheduler enforces a global component constraint with a suf-

ficient number of disk components, e.g., twice the expectednumber of components of an LSM-tree, to minimize writestalls while ensuring the stability of the LSM-tree. It pro-cesses writes as quickly as possible and only stops the pro-cessing of writes when the component constraint is violated.The greedy scheduler performs concurrent merges but allo-cates the full I/O bandwidth to the merge operation withthe smallest remaining bytes. Whenever a merge opera-tion is created or completed, the greedy scheduler is notifiedto find the smallest merge to execute next. Thus, a largemerge can be interrupted by a newly created smaller merge.However, in general one cannot exactly know which mergeoperation requires the least amount of I/O bandwidth untilthe new component is fully produced. To handle this, thesmallest merge operation can be approximated by using thenumber of remaining pages of the merging components.

Under the greedy scheduler, larger merges may be starvedat times since they receive lower priority. This has a few im-plications. First, during normal user workloads, such star-vation can only occur if the arrival rate is temporarily fasterthan the processing rate of an LSM-tree. Given the nega-tive impact of write stalls on write latencies, it can actuallybe beneficial to temporarily delay large merges so that thesystem can better absorb write bursts. Second, the greedyscheduler should not be used in the testing phase because itwould report a higher but unsustainable write throughputdue to such starved large merges.

Finally, our discussions of the greedy scheduler as well asthe single-threaded scheduler are based on an important as-sumption that a single merge operation is able to fully utilizethe available I/O bandwidth budget. Otherwise, multiplemerges must be executed at the same time. It is straightfor-ward to extend the greedy scheduler to execute the smallestk merge operations, where k is the degree of concurrencyneeded to fully utilize the I/O bandwidth budget.

5.2 Experimental EvaluationWe now experimentally evaluate the write stalls of LSM-

trees using our two-phase approach. We discuss the specificexperimental setup followed by the detailed evaluation, in-cluding the impact of merge schedulers on write stalls, thebenefit of enforcing the component constraint globally andof processing writes as quickly as possible, and the impactof merge scheduling on query performance.

5.2.1 Experimental SetupAll experiments in this section were performed using As-

terixDB with the general setup described in Section 3. Un-less otherwise noted, the size ratio of leveling was set at10, which is a commonly used configuration in practice [5,7]. For the experimental dataset with 100 million uniquerecords, this results in a three-level LSM-tree, where thelast level is nearly full. For tiering, the size ratio was setat 3, which leads to better write performance than levelingwithout sacrificing too much on query performance. Thisratio results in an eight-level LSM-tree.

We evaluated the single-threaded scheduler (Section 5.1.3),the fair scheduler (Section 5.1.3), and the proposed greedyscheduler (Section 5.1.5). The single-threaded scheduler onlyexecutes one merge at a time using a single thread. Boththe fair and greedy schedulers are concurrent schedulers thatexecute each merge using a separate thread. The differenceis that the fair scheduler allocates the I/O bandwidth to all

455

0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

Writ

e Th

roug

hput

(kop

s/s) (a) Tiering Merge Policy

single schedulerfair schedulergreedy scheduler

0 1800 3600 5400 7200Elapsed Time (s)

0369

121518

Writ

e Th

roug

hput

(kop

s/s) (b) Leveling Merge Policy


Figure 7: Testing Phase: Instantaneous WriteThroughput

ongoing merges evenly, while the greedy scheduler alwaysallocates the full I/O bandwidth to the smallest merge. Tominimize flush stalls, a flush operation is always executed ina separate thread and receives higher I/O priority. Unlessotherwise noted, all three schedulers enforce global compo-nent constraints and process writes as quickly as possible.The maximum number of disk components is set at twice theexpected number of disk components for each merge policy.Each experiment was performed under both the uniform andZipf update workloads. Since the Zipf update workload hadlittle impact on the overall performance trends, except thatit led to higher write throughput, its experiment results areomitted here for brevity.

5.2.2 Testing PhaseDuring the testing phase, we measured the maximum write

throughput of an LSM-tree by writing as much data as pos-sible. In general, alternative merge schedulers have little im-pact on the maximum write throughput since the I/O band-width budget is fixed, but their measured write throughputmay be different due to the finite experimental period.

Figures 7a and 7b shows the instantaneous write through-out of LSM-trees using different merge schedulers for tieringand leveling. Under both merge policies, the single-threadedscheduler regularly exhibits long pauses, making its writethroughput vary over time. The fair scheduler exhibits arelatively stable write throughput over time since all mergelevels can proceed at the same rate. With leveling, its writethroughput still varies slightly over time since the compo-nent size at each level varies. The greedy scheduler appearsto achieve a higher write throughput than the fair sched-uler by starving large merges. However, this higher writethroughput eventually drops when no small merges can bescheduled. For example, the write throughput with tieringdrops slightly at 1100s and 4000s, and there is a long pausefrom 6000s to 7000s with leveling. This result confirms thatthe fair scheduler is more suitable for testing the maximumwrite throughput of an LSM-tree, as merges at all levels canproceed at the same rate. In contrast, the single-threaded

scheduler incurs many long pauses, causing a large variancein the measured write throughput. The greedy schedulerprovides a higher write throughput by starving large merges,which would be undesirable at runtime.

5.2.3 Running PhaseTurning to the running phase, we used a constant data

arrival process, configured based on 95% of the maximumwrite throughput measured by the fair scheduler, to evaluatethe write stalls of LSM-trees.

LSM-trees can provide a stable write throughput.We first evaluated whether LSM-trees with different mergeschedulers can support a high write throughput with lowwrite latencies. For each experiment, we measured the in-stantaneous write throughput and the number of disk com-ponents over time as well as percentile write latencies.

The results for tiering are shown in Figure 8. Both thefair and greedy schedulers are able to provide stable writethroughputs and the total number of disk components neverreaches the configured threshold. The greedy scheduler alsominimizes the number of disk components over time. Thesingle-threaded scheduler, however, causes a large numberof write stalls due to the blocking of large merges, whichconfirms our previous analysis. Because of this, the single-threaded scheduler incurs large percentile write latencies.In contrast, both the fair and greedy schedulers providesmall write latencies because of their stable write through-put. Figure 9 shows the corresponding results for leveling.The single-threaded scheduler again performs poorly, caus-ing a lot of stalls and thus large write latencies. Due to theinherent variance of merge times, the fair scheduler alonecannot provide a stable write throughput; this results in rel-atively large write latencies. In contrast, the greedy sched-uler avoids write stalls by always minimizing the number ofcomponents, which results in small write latencies.

This experiment confirms that LSM-trees can achieve astable write throughput with a relatively small performancevariance. Moreover, the write stalls of an LSM-tree heavilydepend on the design of the merge scheduler.

Impact of Size Ratio. To verify our findings on LSM-trees with different shapes, we further carried out a set ofexperiments by varying the size ratio from 2 to 10 for bothtiering and leveling. For leveling, we applied the dynamiclevel size optimization [30] so that the largest level remainsalmost full by slightly modifying the size ratio between Lev-els 0 and 1. This optimization maximizes space utilizationwithout impacting write or query performance.

During the testing phase, we measured the maximum writethroughput for each LSM-tree configuration using the fairscheduler, which is shown in Figure 10a. In general, alarger size ratio increases write throughput for tiering butdecreases write throughput for leveling because it decreasesthe merge frequency of tiering but increases that of level-ing. During the running phase, we evaluated the 99% per-centile write latency for each LSM-tree configuration usingconstant data arrivals, which is shown in Figure 10b. Withtiering, both the fair and greedy schedulers are able to pro-vide a stable write throughput with small write latencies.With leveling, the fair scheduler causes large write latencieswhen the size ratio becomes larger, as we have seen before.In contrast, the greedy scheduler is always able to providea stable write throughput along with small write latencies.This again confirms that LSM-trees, despite their size ratios,

456

0 1800 3600 5400 7200Elapsed Time (s)

0

20

40W

rite

Thro

ughp

ut (k

ops/

s) (a) Instantaneous Write Throughput


0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

Num

of D

isk

Com

pone

nts (b) Number of Disk Components


50 90 95 99 99.9 99.99Percentile (%)

10 2

101

Writ

e La

tenc

y (s

)

(c) Percentile Write Latencies


Figure 8: Running Phase of Tiering Merge Policy (95% Load)

0 1800 3600 5400 7200Elapsed Time (s)

0

10

20

Writ

e Th

roug

hput

(kop

s/s) (a) Instantaneous Write Throughput


0 1800 3600 5400 7200Elapsed Time (s)

0.0

5.0N

um o

f Dis

k C

ompo

nent

s (b) Number of Disk Components


50 90 95 99 99.9 99.99Percentile (%)

10 2

101

Writ

e La

tenc

y (s

)

(c) Percentile Write Latencies


Figure 9: Running Phase of Leveling Merge Policy (95% Load)

2 4 6 8 10Size Ratio

0

10

20

30

Writ

e Th

roug

hput

(kop

s/s)

(a) Testing Phase: MaximumWrite Throughput

levelingtiering

2 4 6 8 10Size Ratio

10 3

10 1

101

Writ

e La

tenc

y (s

)

(b) Running Phase: 99%Percentile Write Latency

leveling + fairleveling + greedytiering + fairtiering + greedy

Figure 10: Impact of Size Ratio on Write Stalls

50 90 95 99 99.999.99Percentile (%)

0.0

0.1

0.2

0.3

Writ

e La

tenc

y (s

)

(a) Tiering Merge Policy

local + fairlocal + greedyglobal + fair global + greedy

50 90 95 99 99.999.99Percentile (%)

0

200

Writ

e La

tenc

y (s

)

(b) Leveling Merge Policy

local + fairlocal + greedyglobal + fair global + greedy

Figure 11: Impact of Enforcing Component Con-straints on Percentile Write Latencies

can provide a high write throughput with a small variancewith an appropriately chosen merge scheduler.

Benefit of Global Component Constraints. We nextevaluated the benefit of global component constraints interms of minimizing write stalls. We additionally includeda variation of the fair and greedy schedulers that enforceslocal component constraints, that is, 2 components per levelfor leveling and 2 · T components per level for tiering.

The resulting write latencies are shown in Figure 11. Ingeneral, local component constraints have little impact on

0 1800 3600 5400 7200Elapsed Time (s)

0

5

10

15

Writ

e Th

roug

hput

(kop

s/s)

(a) Instantaneous WriteThroughput

LimitNo Limit

50 90 95 99 99.999.99Percentile (%)

10 5

10 3

10 1

101

Writ

e La

tenc

y (s

)

(b) Percentile WriteLatencies

LimitNo Limit

Figure 12: Running Phase with Burst Data Arrivals

tiering since its merge time per level is relatively stable.However, the resulting write latencies for leveling becomemuch large due to the inherent variance of its merge times.Moreover, local component constraints have a larger nega-tive impact on the greedy scheduler. The greedy schedulerprefers small merges, which may not be able to completedue to possible violations of the constraint at the next level.This in turn causes longer stalls and thus larger percentilewrite latencies. In contrast, global component constraintsbetter absorb these variances, reducing the write latencies.

Benefits of Processing Writes As Quickly As Pos-sible. We further evaluated the benefit of processing writesas quickly as possible. We used the leveling merge policywith a bursty data arrival process that alternates between anormal arrival rate of 2000 records/s for 25 minutes and ahigh arrival rate of 8000 records/s for 5 minutes. We evalu-ated two variations of the greedy scheduler. The first varia-tion processes writes as quickly as possible (denoted as “NoLimit”), as we did before. The second variation enforces amaximum in-memory write rate of 4000 records/s (denotedas “Limit”) to avoid write stalls.

The instantaneous write throughput and the percentilewrite latencies of the two variations are shown in Figures

457

0 1800 3600 5400 7200Elapsed Time (s)

0

20

40Q

uery

Thr

ough

put (

kops

/s)

(a) Point Lookup

fair schedulergreedy scheduler

0 1800 3600 5400 7200Elapsed Time (s)

0.0

1.0

2.0

Que

ry T

hrou

ghpu

t (ko

ps/s

)

(b) Short Range Query


0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

60

Que

ry T

hrou

ghpu

t (op

s/m

in)

(c) Long Range Query


Figure 13: Instantaneous Query Throughput of Tiering Merge Policy

0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

Que

ry T

hrou

ghpu

t (ko

ps/s

)

(a) Point Lookup


0 1800 3600 5400 7200Elapsed Time (s)

0.0

1.0

2.0Q

uery

Thr

ough

put (

kops

/s)

(b) Short Range Query


0 1800 3600 5400 7200Elapsed Time (s)

0

20

40

60

Que

ry T

hrou

ghpu

t (op

s/m

in)

(c) Long Range Query


Figure 14: Instantaneous Query Throughput of Leveling Merge Policy

12a and 12b respectively. As Figure 12a shows, delayingwrites avoids write stalls and the resulting write through-put is more stable over time. However, this causes largerwrite latencies (Figure 12b) since delayed writes must bequeued. In contrary, writing as quickly as possible causesoccasional write stalls but still minimizes overall write la-tencies. This confirms our previous analysis that processingwrites as quickly as possible minimizes write latencies.

Impact on Query Performance. Finally, since thepoint of having data is to query it, we evaluated the im-pact of the fair and greedy schedulers on concurrent queryperformance. We evaluated three types of queries, namelypoint lookups, short scans, and long scans. A point lookupaccesses 1 record given a primary key. A short scan queryaccesses 100 records and a long scan query accesses 1 millionrecords. In each experiment, we executed one type of queryconcurrently with concurrent updates with constant arrivalrates as before. To maximize query performance while en-suring that LSM flush and merge operations receive enoughI/O bandwidth, we used 8 query threads for point lookupsand short scans and used 4 query threads for long scans.

The instantaneous query throughput under tiering andleveling is depicted in Figures 13 and 14 respectively. Due tospace limitations, we omit the average query latency, whichcan be computed by dividing the number of query threads bythe query throughput. As the results show, leveling has sim-ilar point lookup throughput to tiering because Bloom filtersare able to filter out most unnecessary I/Os, but it has muchbetter range query throughput than tiering. The greedyscheduler always improves query performance by minimiz-ing the number of components. Among the three types ofqueries, point lookups and short scans benefit more fromthe greedy scheduler since these two types of queries aremore sensitive to the number of disk components. In con-trast, long scans incur most of their I/O cost at the largestlevel. Moreover, tiering benefits more from the greedy sched-

uler than leveling because tiering has more disk components.Note that with leveling, there is a drop in query throughputunder the fair scheduler at around 5400s, even though thereis little difference in the number of disk components betweenthe fair and greedy schedulers. This drop is caused by writestalls during that period, as was seen in the instantaneouswrite throughput of Figure 9a. After the LSM-tree recoversfrom write stalls, it attempts to write as much data as possi-ble to catch up, which negative impacts query performance.

In [42], we additionally evaluated the impact of forcingSSD writes regularly on query performance. Even thoughthis optimization has some small negative impact on querythroughput, it significantly reduces the percentile latenciesof small queries, e.g., point lookups and small scans, by 5xto 10x because the large forces at the end of merges areinstead broken down into many smaller ones.

6. PARTITIONED MERGESWe now examine the write stall behavior of partitioned

LSM-trees using our two-phase approach. In a partitionedLSM-tree, a large disk component is range-partitioned intomultiple small files and each merge operation only processesa small number of files with overlapping ranges. Since mergesalways happen immediately once a level is full, a single-threaded scheduler could be sufficient to minimize writestalls. In the reminder of this section, we will evaluate Lev-elDB’s single-threaded scheduler.

6.1 LevelDB’s Merge SchedulerLevelDB’s merge scheduler is single-threaded. It com-

putes a score for each level and selects the level with thelargest score to merge. Specifically, the score for Level 0is computed as the total number of flushed components di-vided by the minimum number of flushed components tomerge. For a partitioned level (1 and above), its score is de-fined as the total size of all files at this level divided by the

458

0 1800 3600 5400 7200Elapsed Time (s)

0369

1215

Writ

e Th

roug

hput

(kop

s/s) (a) Testing Phase

round-robinchoose-best

0 1800 3600 5400 7200Elapsed Time (s)

0369

1215

Writ

e Th

roug

hput

(kop

s/s) (b) Running Phase


Figure 15: Instantaneous Write Throughput underTwo-Phase Evaluation of Partitioned LSM-tree

configured maximum size. A merge operation is scheduledif the largest score is at least 1, which means that the se-lected level is full. If a partitioned level is chosen to merge,LevelDB selects the next file to merge in a round-robin way.

LevelDB only restricts the number of flushed componentsat Level 0. By default, the minimum number of flushedcomponents to merge is 4. The processing of writes will beslowed down or stopped of the number of flushed componentreaches 8 and 12 respectively. Since we have already shownin Section 5.1.2 that processing writes as quickly as possiblereduces write latencies, we will only use the stop threshold(12) in our evaluation.

Experimental Evaluation. We have implemented Lev-elDB’s partitioned leveling merge policy and its merge sched-uler inside AsterixDB for evaluation. Similar to LevelDB,the minimum number of flushed components to merge wasset at 4 and the stop threshold was set at 12 components.Unless otherwise noted, the maximum size of each file wasset at 64MB. The memory component size was set at 128MBand the base size of Level 1 was set at 1280MB. The sizeratio was set at 10. For the experimental dataset with 100million records, this results in a 4-level LSM-tree where thelargest level is nearly full. To minimize write stalls causedby flushes, we used two memory components and a separateflush thread. We further evaluated the impact of two widelyused merge selection strategies on write stalls. The round-robin strategy chooses the next file to merge in a round-robinway. The choose-best strategy [53] chooses the file with thefewest overlapping files at the next level.

We used our two-phase approach to evaluate this parti-tioned LSM-tree design. The instantaneous write through-put during the testing phase is shown in Figure 15a, wherethe write throughput of both strategies decreases over timedue to more frequent stalls. Moreover, under the uniformupdate workload, the alternative selection strategies havelittle impact on the overall write throughput, as reportedin [38]. During the testing phase, we used a constant arrivalprocess to evaluate write stalls. The instantaneous writethroughput of both strategies is shown in Figure 15b. Asthe result shows, in both cases write stalls start to occurafter time 6000s. This suggests that the measured writethroughput during the testing phase is not sustainable.

6.2 Measuring Sustainable Write ThroughputOne problem with LevelDB’s score-based merge scheduler

is that it merges as many components at Level 0 as possibleat once. To see this, suppose that the minimum number ofmergeable components at Level L0 is T0 and that the max-imum number of components at Level 0 is T ′0. During thetesting phase, where writes pile up as quickly as possible, the

L0

L1

L3(a) Expected LSM-tree

base size

basesize ⋅ T*basesize ⋅ TL2

merge ×T,

basesize ⋅ 𝑇,./𝑇,

basesize ⋅ T*basesize ⋅ T ⋅ 𝑇,./𝑇,

merge ×T′,

(b) Actual LSM-tree

Figure 16: Problem of Score-Based Merge Scheduler

merge scheduler tends to merge the maximum possible num-ber of components T ′0 instead of just T0 at once. Becauseof this, the LSM-tree will eventually transit from the ex-pected shape (Figure 16a) to the actual shape (Figure 16b),where T is the size ratio of the partitioned levels. Note thatthe largest level is not affected since its size is determinedby the number of unique entries, which is relatively stable.Even though this elastic design dynamically increases theprocessing rate as needed, it has the following problems.

Unsustainable Write Throughput. The measuredmaximum write throughput is based on merging T ′0 flushedcomponents at Level 0 at once. However, this is likely tocause write stalls during the running phase since flushes can-not further proceed.

Suboptimal Trade-Offs. The LSM-tree structure inFigure 16b is no longer making optimal performance trade-offs since the size ratios between its adjacent levels are notthe same anymore [45]. By adjusting the sizes of interme-diate levels so that adjacent levels have the same size ratio,one can improve both write throughput and space utilizationwithout affecting query performance.

Low Space Utilization. One motivation for industrialsystems to adopt partitioned LSM-trees is their higher spaceutilization [30]. However, the LSM-tree depicted in Fig-ure 16b violates this performance guarantee because the ra-tio of wasted space increases from 1/T to T ′0/T0 · 1/T .

Because of these problems, the measured maximum writethroughput cannot be used in the long-term. We propose asimple solution to address these problems. During the test-ing phase, we always merge exactly T0 components at Level0. This ensures that merge preferences will be given equallyto all levels so that the LSM-tree will stay in the expectedshape (Figure 16a). Then, during the running phase, theLSM-tree can elastically merge more components at Level 0as needed to absorb write bursts.

To verify the effectiveness of the proposed solution, werepeated the previous experiments on the partitioned LSM-tree. During the testing phase, the LSM-tree always merged4 components at Level 0 at once. The measured instanta-neous write throughput is shown in Figure 17a, which is30% lower than that of the previous experiment. Duringthe running phase, we used a constant arrival process basedon this lower write throughput. The resulting instantaneouswrite throughput is shown in Figure 17b, where the LSM-tree successfully maintains a sustainable write throughputwithout any write stalls, which in turn results in low writelatencies (not shown in the figure). This confirms that Lev-elDB’s single-threaded scheduler is sufficient to minimizewrite stalls, given that a single merge thread can fully utilizethe I/O bandwidth budget.

After fixing the unsustainable write throughput problemof LevelDB, we further evaluated the impact of partitionsize on the write stalls of partitioned LSM-trees. In this

459

0 1800 3600 5400 7200Elapsed Time (s)

0369

1215

Writ

e Th

roug

hput

(kop

s/s) (a) Testing Phase


0 1800 3600 5400 7200Elapsed Time (s)

0369

1215

Writ

e Th

roug

hput

(kop

s/s) (b) Running Phase


Figure 17: Instantaneous Write Throughput underTwo-Phase Evaluation of Partitioned LSM-tree withthe Proposed Solution

8MB 64MB512MB 4GB 32GBPartition Size

0

5

10

15

Writ

e Th

roug

hput

(kop

s/s)

(a) Testing Phase: MaximumWrite Throughput


8MB 64MB512MB 4GB 32GBPartition Size

10 3

10 1

101

103

Writ

e La

tenc

y (s

)

(b) Running Phase: 99%Percentile Write Latency


Figure 18: Impact of Partition Size on Write Stalls

experiment, we varied the size of each partitioned file from8MB to 32GB so that partitioned merges effectively transitinto full merges. The maximum write throughput duringthe running phase and the 99th percentile write latenciesduring the testing phase are shown in Figures 18a and 18brespectively. Even though the partition size has little impacton the overall write throughput, a large partition size cancause large write latencies since we have shown in Section 5that a single-threaded scheduler is insufficient to minimizewrite stalls for full merges. Most implementations of par-titioned LSM-trees today already choose a small partitionsize to bound the temporary space occupied by merges. Wesee here that one more reason to do so is to minimize writestalls under a single-threaded scheduler.

7. LESSONS AND INSIGHTSHaving studied and evaluated the write stall problem for

various LSM-tree designs, here we summarize the lessonsand insights observed from our evaluation.

The LSM-tree’s write latency must be measuredproperly. The out-of-place update nature of LSM-trees hasintroduced the write stall problem. Throughout our evalua-tion, we have seen cases where one can obtain a higher butunsustainable write throughput. For example, the greedyscheduler would report a higher write throughput by starv-ing large merges, and LevelDB’s merge scheduler would re-port a higher but unsustainable write throughput by dy-namically adjusting the shape of the LSM-tree. Based onour findings, we argue that in addition to the testing phase,used by existing LSM research, an extra running phase mustbe performed to evaluate the usability of the measured max-imum write throughput. Moreover, the write latency mustbe measured properly due to queuing. One solution is touse the proposed two-phase evaluation approach to evaluate

the resulting write latencies under high utilization, wherethe arrival rate is close to the processing rate.

Merge scheduling is critical to minimizing writestalls. Throughout our evaluation of various LSM-tree de-signs, including bLSM [51], full merges, and partitionedmerges, we have seen that merge scheduling has a criticalimpact on write stalls. Comparing these LSM-tree designsin general depends on many factors and is beyond the scopeof this paper; here we have focused on how to minimize writestalls for each LSM-tree design.

bLSM [51], an instance of full merges, introduces a so-phisticated spring-and-gear merge scheduler to bound theprocessing latency of LSM-trees. However, we found thatbLSM still has large variances in its processing rate, lead-ing to large write latencies under high arrival rates. Amongthe three evaluated schedulers, namely single-threaded, fair,and greedy, the single-threaded scheduler should not be usedin practical systems due to the long stalls caused by largemerges. The fair scheduler should be used when measuringthe maximum throughput because it provides fairness to allmerges. The greedy scheduler should be used at runtimesince it better minimizes the number of disk components,both reducing write stalls and improving query performance.Moreover, as an important design choice, global componentconstraints better minimizes write stalls.

Partitioned merges simplify merge scheduling by breakinglarge merges into many smaller ones. However, we found anew problem that the measured maximum write throughputof LevelDB is unsustainable because it dynamically adjuststhe size ratios under write-intensive workloads. After fix-ing this problem, a single-threaded scheduler with a smallpartition size, as used by LevelDB, is sufficient for deliv-ering low write latencies under high utilization. However,fixing this problem reduced the maximum write throughputof LevelDB roughly one-third in our evaluation.

For both full and partitioned merges, processing writes asquickly as possible better minimizes write latencies. Finally,with proper merge scheduling, all LSM-tree designs can in-deed minimize write stalls by delivering low write latenciesunder high utilizations.

8. CONCLUSIONIn this paper, we have studied and evaluated the write

stall problem for various LSM-tree designs. We first pro-posed a two-phase approach to use in evaluating the impactof write stalls on percentile write latencies using a combi-nation of closed and open system testing models. We thenidentified and explored the design choices for LSM mergeschedulers. For full merges, we proposed a greedy sched-uler that minimizes write stalls. For partitioned merges, wefound that a single-threaded scheduler is sufficient to pro-vide a stable write throughput but that the maximum writethroughput must be measured properly. Based on thesefindings, we have shown that performance variance mustbe considered together with write throughput to ensure theactual usability of the measured throughput.Acknowledgments. We thank Neal Young, Dongxu Zhao,and anonymous reviewers for their helpful feedback on thetheorems in this paper. This work has been supported byNSF awards CNS-1305430, IIS-1447720, IIS-1838248, andCNS-1925610 along with industrial support from Amazon,Google, and Microsoft and support from the Donald BrenFoundation (via a Bren Chair).

460

9. REFERENCES[1] AsterixDB. https://asterixdb.apache.org/.

[2] Cassandra. http://cassandra.apache.org/.

[3] Compaction stalls: something to make better inRocksDB. http://smalldatum.blogspot.com/2017/01/compaction-stalls-something-to-make.html.

[4] HBase. https://hbase.apache.org/.

[5] LevelDB. http://leveldb.org/.

[6] Read- and latency-optimized log structured mergetree. https://github.com/sears/bLSM/.

[7] RocksDB. http://rocksdb.org/.

[8] SyllaDB. https://www.scylladb.com/.

[9] Tarantool. https://www.tarantool.io/.

[10] TPC-C. http://www.tpc.org/tpcc/.

[11] WiredTiger. http://www.wiredtiger.com/.

[12] YCSB change log.https://github.com/brianfrankcooper/YCSB/blob/master/core/CHANGES.md.

[13] M. Y. Ahmad and B. Kemme. Compactionmanagement in distributed key-value datastores.PVLDB, 8(8):850–861, 2015.

[14] S. Alsubaiee et al. AsterixDB: A scalable, open sourceBDMS. PVLDB, 7(14):1905–1916, 2014.

[15] M. Armbrust et al. Piql: Success-tolerant queryprocessing in the cloud. PVLDB, 5(3):181–192, 2011.

[16] M. Armbrust et al. Generalized scale independencethrough incremental precomputation. In ACMSIGMOD, pages 625–636, 2013.

[17] O. Balmau et al. FloDB: Unlocking memory inpersistent key-value stores. In European Conference onComputer Systems (EuroSys), pages 80–94, 2017.

[18] O. Balmau et al. TRIAD: Creating synergies betweenmemory, disk and log in log structured key-valuestores. In USENIX Annual Technical Conference(ATC), pages 363–375, 2017.

[19] B. H. Bloom. Space/time trade-offs in hash codingwith allowable errors. CACM, 13(7):422–426, July1970.

[20] G. Candea et al. A scalable, predictable join operatorfor highly concurrent data warehouses. PVLDB,2(1):277–288, 2009.

[21] Z. Cao et al. On the performance variation in modernstorage stacks. In USENIX Conference on File andStorage Technologies (FAST), pages 329–343, 2017.

[22] M. J. Carey. AsterixDB mid-flight: A case study inbuilding systems in academia. In ICDE, pages 1–12,2019.

[23] S. Chaudhuri et al. Variance aware optimization ofparameterized queries. In ACM SIGMOD, pages531–542. ACM, 2010.

[24] B. F. Cooper et al. Benchmarking cloud servingsystems with YCSB. In ACM SoCC, pages 143–154,2010.

[25] N. Dayan et al. Monkey: Optimal navigable key-valuestore. In ACM SIGMOD, pages 79–94, 2017.

[26] N. Dayan et al. Optimal Bloom filters and adaptivemerging for LSM-trees. ACM TODS, 43(4):16:1–16:48,Dec. 2018.

[27] N. Dayan and S. Idreos. Dostoevsky: Betterspace-time trade-offs for LSM-tree based key-valuestores via adaptive removal of superfluous merging. In

ACM SIGMOD, pages 505–520, 2018.

[28] N. Dayan and S. Idreos. The log-structuredmerge-bush & the wacky continuum. In ACMSIGMOD, pages 449–466, 2019.

[29] J. Dean and L. A. Barroso. The tail at scale. CACM,56:74–80, 2013.

[30] S. Dong et al. Optimizing space amplification inRocksDB. In CIDR, volume 3, page 3, 2017.

[31] R. Grover and M. J. Carey. Data ingestion inAsterixDB. In EDBT, pages 605–616, 2015.

[32] M. Harchol-Balter. Performance modeling and designof computer systems: queueing theory in action.Cambridge University Press, 2013.

[33] G. Huang et al. X-Engine: An optimized storageengine for large-scale E-commerce transactionprocessing. In ACM SIGMOD, pages 651–665, 2019.

[34] J. Huang et al. Statistical analysis of latency throughsemantic profiling. In European Conference onComputer Systems (EuroSys), pages 64–79, 2017.

[35] J. Huang et al. A top-down approach to achievingperformance predictability in database systems. InACM SIGMOD, pages 745–758, 2017.

[36] Y. Li et al. Tree indexing on solid state drives.PVLDB, 3(1-2):1195–1206, 2010.

[37] Y. Li et al. Enabling efficient updates in KV storagevia hashing: Design and performance evaluation.ACM Transactions on Storage (TOS), 15(3):20, 2019.

[38] H. Lim et al. Towards accurate and fast evaluation ofmulti-stage log-structured designs. In USENIXConference on File and Storage Technologies (FAST),pages 149–166, 2016.

[39] L. Lu et al. WiscKey: Separating keys from values inSSD-conscious storage. In USENIX Conference onFile and Storage Technologies (FAST), pages 133–148,2016.

[40] C. Luo and M. J. Carey. Efficient data ingestion andquery processing for LSM-based storage systems.PVLDB, 12(5):531–543, 2019.

[41] C. Luo and M. J. Carey. LSM-based storagetechniques: a survey. The VLDB Journal, 2019.

[42] C. Luo and M. J. Carey. On performance stability inLSM-based storage systems (extended version).CoRR, abs/1906.09667, 2019.

[43] C. Luo et al. Umzi: Unified multi-zone indexing forlarge-scale HTAP. In EDBT, pages 1–12, 2019.

[44] F. Mei et al. SifrDB: A unified solution forwrite-optimized key-value stores in large datacenter.In ACM SoCC, pages 477–489, 2018.

[45] P. O’Neil et al. The log-structured merge-tree(LSM-tree). Acta Inf., 33(4):351–385, 1996.

[46] M. A. Qader et al. A comparative study of secondaryindexing techniques in LSM-based NoSQL databases.In ACM SIGMOD, pages 551–566, 2018.

[47] P. Raju et al. PebblesDB: Building key-value storesusing fragmented log-structured merge trees. In ACMSOSP, pages 497–514, 2017.

[48] V. Raman et al. Constant-time query processing. InICDE, pages 60–69, 2008.

[49] K. Ren et al. SlimDB: A space-efficient key-valuestorage engine for semi-sorted data. PVLDB,10(13):2037–2048, 2017.

461

[50] R. Sears and E. Brewer. Stasis: Flexible transactionalstorage. In Symposium on Operating Systems Designand Implementation (OSDI), pages 29–44, 2006.

[51] R. Sears and R. Ramakrishnan. bLSM: A generalpurpose log structured merge tree. In ACM SIGMOD,pages 217–228, 2012.

[52] D. Teng et al. LSbM-tree: Re-enabling buffer cachingin data management for mixed reads and writes. InIEEE International Conference on DistributedComputing Systems (ICDCS), pages 68–79, 2017.

[53] R. Thonangi and J. Yang. On log-structured merge forsolid-state drives. In ICDE, pages 683–694, 2017.

[54] P. Unterbrunner et al. Predictable performance forunpredictable workloads. PVLDB, 2(1):706–717, 2009.

[55] X. Wang and M. J. Carey. An IDEA: An ingestionframework for data enrichment in AsterixDB.PVLDB., 12(11):1485–1498, 2019.

[56] T. Yao et al. A light-weight compaction tree to reduceI/O amplification toward efficient key-value stores. InInternational Conference on Massive Storage Systemsand Technology (MSST), 2017.

[57] C. Yunpeng et al. LDC: a lower-level drivencompaction method to optimize SSD-orientedkey-value stores. In ICDE, pages 722–733, 2019.

[58] Y. Zhang et al. ElasticBF: Fine-grained and elasticbloom filter towards efficient read for LSM-tree-basedKV stores. In USENIX Workshop on Hot Topics inStorage and File Systems (HotStorage), 2018.

462

On Performance Stability in LSM-based Storage Systems · LSM disk component is range-partitioned...

Documents

Transcript of On Performance Stability in LSM-based Storage Systems · LSM disk component is range-partitioned...