PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers ›...

118
PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research

Transcript of PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers ›...

Page 1: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

PebblesDB: Building Key-Value Stores using Fragmented Log

Structured Merge Trees

Pandian Raju1, Rohan Kadekodi1, Vijay Chidambaram1,2, Ittai Abraham2

1The University of Texas at Austin2VMware Research

Page 2: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

What is a key-value store?

• Store any arbitrary value for a given key

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

2

Page 3: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions:• Point lookups:• Range Queries:

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

3

Page 4: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions: put(key, value)• Point lookups:• Range Queries:

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

4

Page 5: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions: put(key, value)• Point lookups: get(key)• Range Queries:

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

5

Page 6: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

What is a key-value store?

• Store any arbitrary value for a given key

• Insertions: put(key, value)• Point lookups: get(key)• Range Queries: get_range(key1, key2)

123

124

Keys{“name”:“JohnDoe”,“age”:25}

{“name”:“RossGel”,“age”:28}

Values

6

Page 7: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Key-Value Stores - widely used

• Google’s BigTable powers Search, Analytics, Maps and Gmail• Facebook’s RocksDB is used as storage engine in production

systems of many companies

7

Page 8: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write-optimized data structures• Log Structured Merge Tree (LSM) is a write-optimized data structure

used in key-value stores• Provides high write throughput with good read throughput, but

suffers high write amplification

8

Page 9: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

• Log Structured Merge Tree (LSM) is a write-optimized data structure used in key-value stores • Provides high write throughput with good read throughput, but

suffers high write amplification• Write amplification - Ratio of amount of write IO to amount of user

data

KV-storeClient10GB

Userdata

IftotalwriteI/Ois200GB

Writeamplification=20

9

Write-optimized data structures

Page 10: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

• Inserted 500M key-value pairs• Key: 16 bytes, Value: 128 bytes• Total user data: ~45 GB

450

300

600

900

1200

1500

1800

2100

RocksDB LevelDB PebblesDB UserData

WriteIO(G

B)

Write amplification in LSM based KV stores

10

Page 11: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

• Inserted 500M key-value pairs• Key: 16 bytes, Value: 128 bytes• Total user data: ~45 GB

1868(42x)

1222(27x)

756(17x)

450

300

600

900

1200

1500

1800

2100

RocksDB LevelDB PebblesDB UserData

WriteIO(G

B)

11

Write amplification in LSM based KV stores

Page 12: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Why is write amplification bad?

• Reduces the write throughput• Flash devices wear out after limited write cycles

(Intel SSD DC P4600 – can last ~5 years assuming ~5 TB write per day)

RocksDB can write ~500 GB of user data per day to a SSD to last 1.25 years

Data source: https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-1-6tb-2-5inch-3d1.html12

Page 13: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

PebblesDB

Built using new data structure Fragmented Log-Structured Merge Tree

High performance write-optimized key-value store

Achieves 3-6.7x higher write throughput and 2.4-3xlesser write amplification compared to RocksDB

Gets the highest write throughput and least write amplification as a backend store to MongoDB

13

Page 14: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

14

Page 15: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

15

Page 16: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Log Structured Merge Tree (LSM)

Data is stored both in memory and storage

Memory

Storage

In-memory

16

File1

Page 17: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Writesaredirectlyputtomemory

In-memoryMemory

Storage

Write(key,value)

17

File1

Log Structured Merge Tree (LSM)

Page 18: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Memory

File1

File2

In-memory data is periodically written as files to storage (sequential I/O)

In-memory

18

Storage

Log Structured Merge Tree (LSM)

Page 19: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Files on storage are logically arranged in different levels

In-memoryMemory

Level0

Level1

Leveln

19

Storage

Log Structured Merge Tree (LSM)

Page 20: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Compaction pushes data to higher numbered levels

In-memoryMemory

Level0

Level1

Leveln

20

Storage

Log Structured Merge Tree (LSM)

Page 21: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Files are sorted and have non-overlapping key ranges

In-memoryMemory

1.…12 15….19 25….75 79….99

Searchusingbinarysearch

Level0

Level1

Leveln

21

Storage

Log Structured Merge Tree (LSM)

Page 22: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Level 0 can have files with overlapping (but sorted) key ranges

In-memoryMemory

2….57 23….78Level0

Level1

Leveln

Limitonnumberoflevel0files

22

Storage

Log Structured Merge Tree (LSM)

Page 23: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Max files in level 0 is configured to be 2

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

In-memory58….68

Level1re-writecounter:1

23

Storage

Page 24: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Level 0 has 3 files (> 2), which triggers a compaction

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

24

Storage

Page 25: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

* Files are immutable * Sorted non-overlapping files

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

25

Storage

Page 26: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Set of overlapping files between levels 0 and 1

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

26

Storage

Page 27: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

27

Storage

Set of overlapping files between levels 0 and 1

Page 28: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1

28

Storage

Set of overlapping files between levels 0 and 1

Page 29: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

1….2347….6824….461….68

Write amplification: Illustration

Compacting level 0 with level 1

Memory

2….37 23….48

1….12 15….25 39….62 77….95

Level0

Level1

Leveln

58….68

In-memory

Level1re-writecounter:1Level1re-writecounter:2

29

Storage

Page 30: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Level 0 is compacted

Memory

1….23 24….46 47….68 77….95

Level0

Level1

Leveln

In-memory

Level1re-writecounter:2

30

Storage

Page 31: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Data is being flushed as level 0 files after some Write operations

Memory

1….23 24….46 47….68 77….95

Level0

Level1

Leveln

10….3317….531….121

Level1re-writecounter:2

31

Storage

Page 32: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Compacting level 0 with level 1

Memory

1….23 24….46 47….68 77….95

Level0

Level1

Leveln

10….33 17….53 1….121

Level1re-writecounter:2

32

Storage

Page 33: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

92….12162….9031….601….30

Write amplification: Illustration

Memory

Level0

Level1

Leveln

1….121 Level1re-writecounter:2Level1re-writecounter:3

33

Storage

Compacting level 0 with level 1

Page 34: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification: Illustration

Existing data is re-written to the same level (1) 3 times

Memory

1….30 31….60 62….90 92….121

Level0

Level1

Leveln

Level1re-writecounter:3

34

Storage

Page 35: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Root cause of write amplification

Rewriting data to the same levelmultiple times

To maintain sorted non-overlapping files in each level

35

Page 36: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

36

Page 37: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Naïve approach to reduce write amplification

• Just append the file to the end of next level• Many (possibly all) overlapping files within a level

• Affects the read performance

1….89 6….915….65 9….99 1….102 1…2718….95Leveli

(all files have overlapping key ranges)

37

Page 38: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Partially sorted levels

• Hybrid between all non-overlapping files and all overlapping files• Inspired from Skip-List data structure• Concrete boundaries (guards) to group together overlapping files

1….12 18….3113….34 42….65 72….8745….5640….47Leveli

(filesofsamecolorcanhaveoverlappingkeyranges)

38

13 35 70

Page 39: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Fragmented Log-Structured Merge Tree

Novel modification of LSM data structure

Uses guards to maintain partially sorted levels

Writes data only once per level in most cases

39

Page 40: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

FLSM structure

Note how files are logically grouped within guards

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

40

Storage

Page 41: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Guards get more fine grained deeper into the tree

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

41

Storage

FLSM structure

Page 42: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

How does FLSM reduce write amplification?

42

Page 43: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

In-memory

How does FLSM reduce write amplification?

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

15 70

40 7015 95

30….68

Max files in level 0 is configured to be 2

43

Storage

Page 44: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

2….1415….68

Compacting level 0

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

30….68

2….68

44

15

Storage

How does FLSM reduce write amplification?

Page 45: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

15….59

2….14 15….68

Fragmented files are just appended to next level

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

45

15

Storage

How does FLSM reduce write amplification?

Page 46: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

15….592….14 15….68

Guard 15 in Level 1 is to be compacted

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

15….68

46

Storage

How does FLSM reduce write amplification?

Page 47: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

15….3940….68

2….14

Files are combined, sorted and fragmented

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

15….68

47

40

Storage

How does FLSM reduce write amplification?

Page 48: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

15….39 40….68

2….14

Fragmented files are just appended to next level

Memory

1….12

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15

40 7015 95

77….87 82….95

70

48

40

Storage

How does FLSM reduce write amplification?

Page 49: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

FLSM maintains partially sorted levels to efficiently reduce the search space

How does FLSM reduce write amplification?

FLSM doesn’t re-write data to the same levelin most cases

How does FLSM maintain read performance?

49

Page 50: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Selecting Guards

50

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

Page 51: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Selecting Guards

51

1 1e+9Keyspace

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

Page 52: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Selecting Guards

52

1 1e+9Keyspace

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

Page 53: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Selecting Guards

• Guards are chosen randomly and dynamically• Dependent on the distribution of data

53

1 1e+9Keyspace

Page 54: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Operations: Write

FLSM structure

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Put(1,“abc”)Write(key,value)

54

Storage

Page 55: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

FLSM structure

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

55

Storage

Operations: Get

Page 56: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Search level by level starting from memory

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

56

Storage

Operations: Get

Page 57: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

All level 0 files need to be searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

57

Storage

Operations: Get

Page 58: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Level 1: File under guard 15 is searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

58

Storage

Operations: Get

Page 59: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Level 2: Both the files under guard 15 are searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Get(23)

59

Storage

Operations: Get

Page 60: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

High write throughput in FLSM• Compaction from memory to level 0 is stalled• Writes to memory is also stalled

Memory

Storage1….37 18….48Level0

In-memory

2….98 23….48

Write(key,value)

Ifrateofinsertionishigherthanrateofcompaction,writethroughputdependsontherateofcompaction

60

Page 61: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

High write throughput in FLSM• Compaction from memory to level 0 is stalled• Writes to memory is also stalled

Memory

Storage1….37 18….48Level0

In-memory

2….98 23….48

Write(key,value)

Ifrateofinsertionishigherthanrateofcompaction,writethroughputdependsontherateofcompaction

61

FLSMhasfastercompaction becauseoflesserI/Oandhencehigherwritethroughput

Page 62: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Challenges in FLSM

• Every read/range query operation needs to examine multiple files per level• For example, if every guard has 5 files, read latency is

increased by 5x (assuming no cache hits)

Trade-off between write I/O and read performance

62

Page 63: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

63

Page 64: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

PebblesDB

• Built by modifying HyperLevelDB (±9100 LOC) to use FLSM• HyperLevelDB, built over LevelDB, to provide improved

parallelism and compaction• API compatible with LevelDB, but not with RocksDB

64

Page 65: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

65

Page 66: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

66

BloomfilterIskey25

present?Definitelynot

Possiblyyes

Page 67: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

1….12 15….39 82….95Level1

15 70

BloomFilterBloomFilterBloomFilterBloomFilter

77….97 Maintainedin-memory

67

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

Page 68: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

1….12 15….39 82….95Level1

15 70

BloomFilterBloomFilterBloomFilterBloomFilter

77….97 Maintainedin-memory

68

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter

PebblesDBreads samenumberoffilesasanyLSMbasedstore

Page 69: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge (get/range query): Multiple files in a guard• Get() performance is improved using file level bloom filter• Range query performance is improved using parallel threads

and better compaction

69

Page 70: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

70

Page 71: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Evaluation

Micro-benchmarks

71

LowmemorySmalldataset

Crashrecovery

CPUandmemoryusage

Agedfilesystem

Realworldworkloads- YCSB

NoSQLapplications

Page 72: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Evaluation

Micro-benchmarks

72

LowmemorySmalldataset

Crashrecovery

CPUandmemoryusage

Agedfilesystem

Realworldworkloads- YCSB

NoSQLapplications

Page 73: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Real world workloads - YCSB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

73

Page 74: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

74

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

Page 75: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA - 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE - 100%writesRunE - 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

75

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

Page 76: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA - 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

76

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

Page 77: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE - 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

77

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

Page 78: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

35.08Ko

ps/s

25.8Kop

s/s

33.98Ko

ps/s

22.41Ko

ps/s

57.87Ko

ps/s

34.06Ko

ps/s

5.8Ko

ps/s

32.09Ko

ps/s

952.93GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE - 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

78

Real world workloads - YCSB• Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark• Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB

Page 79: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

NoSQL stores - MongoDB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

79

Page 80: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

80

NoSQL stores - MongoDB

Page 81: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

81

NoSQL stores - MongoDB

Page 82: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

82

NoSQL stores - MongoDB

Page 83: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

83

NoSQL stores - MongoDB

Page 84: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

20.73Ko

ps/s

9.95Kop

s/s

15.52Ko

ps/s

19.69Ko

ps/s

23.53Ko

ps/s

20.68Ko

ps/s

0.65Kop

s/s

9.78Kop

s/s

426.33GB

0

0.5

1

1.5

2

2.5

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Wire

dTiger

• YCSB on MongoDB, a widely used key-value store• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes

84

NoSQL stores - MongoDB

PebblesDBcombineslowwriteIOofWiredTigerwithhighperformanceofRocksDB

Page 85: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Outline

• Log-Structured Merge Tree (LSM)• Fragmented Log-Structured Merge Tree (FLSM)• Building PebblesDB using FLSM• Evaluation• Conclusion

85

Page 86: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Conclusion

• PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees• Increases write throughput and reduces write IO at the same time• Obtains 6X the write throughput of RocksDB

• As key-value stores become more widely used, there have been several attempts to optimize them• PebblesDB combines algorithmic innovation (the FLSM data

structure) with careful systems building

86

Page 87: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

https://github.com/utsaslab/pebblesdb

Page 88: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

https://github.com/utsaslab/pebblesdb

Page 89: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Backup slides

89

Page 90: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Operations: Seek

• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries

between 5 and 18)

Get(1)Level 0 – 1, 2, 100, 1000Level 1 – 1, 5, 10, 2000Level 2 – 5, 300, 500

90

Page 91: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Operations: Seek

• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries

between 5 and 18)

Seek(200)Level 0 – 1, 2, 100, 1000Level 1 – 1, 5, 10, 2000Level 2 – 5, 300, 500

91

Page 92: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Operations: Seek

• Seek(target): Returns the smallest key in the database which is >= target• Used for range queries (for example, return all entries

between 5 and 18)

92

Page 93: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Operations: Seek

FLSM structure

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Seek(23)

93

Storage

Page 94: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Operations: Seek

All levels and memtable need to be searched

Memory

2….37 23….48

1….12 15….59 77….87 82….95

2….8 15….2316….32 70….90 96….9945….65

Level0

Level1

Level2

In-memory

15 70

40 7015 95

Seek(23)

94

Storage

Page 95: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

BloomfilterIskey25

present?Definitelynot

Possiblyyes95

Page 96: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

1….12 15….39 82….95Level1

15 70

Get(97)True

BloomFilterBloomFilterBloomFilterBloomFilter

77….97 Maintainedin-memory

96

Page 97: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

1….12 15….39 82….95Level1

15 70

Get(97)False True

BloomFilterBloomFilterBloomFilterBloomFilter

77….97

97

Page 98: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB

• Challenge with reads: Multiple sstable reads per level• Optimized using sstable level bloom filters• Bloom filter: determine if an element is in a set

1….12 15….39 82….95Level1

15 70

BloomFilterBloomFilterBloomFilterBloomFilter

77….97

PebblesDBreads atmostonefileperguardwithhighprobability98

Page 99: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB• Challenge with seeks: Multiple sstable reads per level• Parallel seeks: Parallel threads to seek() on files in a guard

1….12 15….39 77….97 82….95Level1

15 70

Seek(85)

Thread1 Thread2

99

Page 100: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Optimizations in PebblesDB• Challenge with seeks: Multiple sstable reads per level• Parallel seeks: Parallel threads to seek() on files in a guard• Seek based compaction: Triggers compaction for a level

during a seek-heavy workload• Reduce the average number of sstables per guard• Reduce the number of active levels

SeekbasedcompactionincreaseswriteI/O butasatrade-offtoimproveseekperformance

100

Page 101: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Tuning PebblesDB

• PebblesDB characteristics like• Increase in write throughput,• decrease in write amplification and• overhead of read/seek operationall depend on one parameter, maxFilesPerGuard (default 2 in PebblesDB)

• Setting this to a very high value favors write throughput• Setting this to a very low value favors read throughput

101

Page 102: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Horizontal compaction

• Files compacted within the same level for the last two levels in PebblesDB• Some optimizations to prevent huge increase in write IO

102

Page 103: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Experimental setup

• Intel Xeon 2.8 GHz processor• 16 GB RAM• Running Ubuntu 16.04 LTS with the Linux 4.4 kernel• Software RAID0 over 2 Intel 750 SSDs (1.2 TB each)• Datasets in experiments 3x bigger than DRAM size

103

Page 104: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Write amplification

7.2GB

100.7GB

756GB

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10M 100M 500M

WriteIOra

tiowrtPe

bblesD

B

Numberofkeysinserted

• Inserted different number of keys with key size 16 bytes and value size 128 bytes

104

Page 105: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Micro-benchmarks

11.72Ko

ps/s

6.89Kop

s/s

7.5Ko

ps/s

0

0.5

1

1.5

2

2.5

3

Random-Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• Used db_bench tool that ships with LevelDB• Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB• Number of read/seek operations: 10M

105

Page 106: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Micro-benchmarks

239.05Kop

s/s

11.72Ko

ps/s

6.89Kop

s/s

7.5Ko

ps/s

126.2Ko

ps/s

0

0.5

1

1.5

2

2.5

3

Seq-Writes Random-Writes Reads Range-Queries Deletes

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• Used db_bench tool that ships with LevelDB• Inserted 50M key-value pairs with key size 16 bytes and value size 1 KB• Number of read/seek operations: 10M

106

Page 107: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Multi threaded micro-benchmarks

44.4Kop

s/s

40.2Kop

s/s

38.8Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads MixedThroug

hputra

tiowrtHy

perLevelDB

Benchmark

• Writes – 4 threads each writing 10M• Reads – 4 threads each reading 10M• Mixed – 2 threads writing and 2 threads reading (each 10M)

107

Page 108: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Small cached dataset• Insert 1M key-value pairs with 16 bytes key and 1 KB value• Total data set (~1 GB) fits within memory• PebblesDB-1: with maximum one file per guard

108

45.25Ko

ps/s

205.76Kop

s/s

205.34Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads Range-QueriesThroug

hputra

tiowrtHy

perLevelDB

Benchmark

Page 109: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Small key-value pairs• Inserted 300M key-value pairs• Key 16 bytes and 128 bytes value

109

44.48Ko

ps/s

6.34Kop

s/s

6.31Kop

s/s

0

0.5

1

1.5

2

2.5

3

3.5

Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

Page 110: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Aged FS and KV store

17.37Ko

ps/s

5.65Kop

s/s

6.29Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• File system aging: Fill up 89% of the file system• KV store aging: Insert 50M, delete 20M and update 20M key-value

pairs in random order

110

Page 111: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Low memory micro-benchmark

27.78Ko

ps/s

2.86Kop

s/s

4.37Kop

s/s

0

0.5

1

1.5

2

2.5

Writes Reads Range-Queries

Throug

hputra

tiowrtHy

perLevelDB

Benchmark

• 100M key-value pairs with 1KB (~65 GB data set)• DRAM was limited to 4 GB

111

Page 112: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Impact of empty guards

• Inserted 20M key-value pairs (0 to 20M) in random order with value size 512 bytes• Incrementally inserted new 20M keys after deleting the older

keys• Around 9000 empty guards at the start of the last iteration• Read latency did not reduce with the increase in empty

guards

112

Page 113: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

22.08Ko

ps/s

21.85Ko

ps/s

31.17Ko

ps/s

32.75Ko

ps/s

38.02Ko

ps/s

7.62Kop

s/s

0.37Kop

s/s

19.11Ko

ps/s

1349.5GB

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

LoadA RunA RunB RunC RunD LoadE RunE RunF TotalIO

Throug

hputra

tiowrt

Hype

rLevelDB

• HyperDex – distributed key-value store from Cornell• Inserted 20M key-value pairs with 1 KB value size and 10M operations

LoadA- 100%writesRunA- 50%reads,50%writesRunB- 95%reads,5%writesRunC- 100%reads

RunD- 95%reads(latest),5%writesLoadE- 100%writesRunE- 95%rangequeries,5%writesRunF- 50%reads,50%read-modify-writes 113

NoSQL stores - HyperDex

Page 114: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

CPU usage

• Median CPU usage by inserting 30M keys and reading 10M keys• PebblesDB: ~171%• Other key-value stores: 98-110%• Due to aggressive compaction, more CPU operations due to

merging multiple files in a guard

114

Page 115: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Memory usage

• 100M records (16 bytes key, 1 KB value) – 106 GB data set• 300 MB memory space• 0.3% of data set size

• Worst case: 100M records (16 bytes key, 16 bytes value) ~3.2 GB• 9% of data set size

115

Page 116: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Bloom filter calculation cost

• 1.2 sec per GB of sstable• 3200 files – 52 GB – 62 seconds

116

Page 117: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Impact of different optimizations

• Sstable level bloom filter improve read performance by 63%• PebblesDB without optimizations for seek – 66%

117

Page 118: PebblesDB: Building Key-Value Stores using Fragmented Log ... › ~vijay › papers › pebblesdb-sosp17-slides.pdfPebblesDB: Building Key-Value Stores using Fragmented Log Structured

Thank you!Questions?

118