Implementing a key/value store
-
Upload
benjamin-joyen-conseil -
Category
Engineering
-
view
60 -
download
0
Transcript of Implementing a key/value store
![Page 1: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/1.jpg)
50 AVENUE DES CHAMPS-ÉLYSÉES 75008 PARIS > FRANCE > WWW.OCTO.COM
Implementing a Key / Value Store
BluckDB
BJC - BOF - 15/12/16
github.com/BenJoyenConseil/bluckdb
@BenJoyeConseil
![Page 2: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/2.jpg)
Pourquoi ...
◉ Comprendre les mécanismes des bases modernes
◉ Explorer les algos et les structures de donnée
◉ Faire du “bas niveau”
◉ Apprendre Go
◉ Ne pas être à poils lors de la formation HBase
Réinventer la roue ?
![Page 3: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/3.jpg)
>01 Situer le kv store
![Page 4: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/4.jpg)
Souvent présenté comme ça
Storage engine !
![Page 5: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/5.jpg)
◉ LevelDB (LSM-Tree)
◉ RocksDB (LSM-Tree)
◉ WiredTiger (LSM-Tree)
◉ ForestDB (HB+Trie)
◉ InnoDB (B+Tree)
◉ BoltDB (B+Tree)
◉ Kyoto Cabinet (Hashtable)
◉ BluckDB (Hashtable)
◉ ...
database
Exemples de KV store comme moteur de stockage
LevelDB ... InnoDB
Server
MongoDB / MySQL / Riak / Lucene / ...
File system
![Page 6: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/6.jpg)
Cockroachdb utilise RocksDB comme moteur de stockage
![Page 7: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/7.jpg)
Quotable quote
All models are wrong but some are useful
— George Box
Models
![Page 8: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/8.jpg)
![Page 9: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/9.jpg)
>02 Deep Dive
![Page 10: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/10.jpg)
Le design
1. Data storage abstraction
2. Data structure (index)
3. Memory management (page / block management, free space)
4. String / byte slice
5. Iterator / Cursor
6. Lock management
7. Comparator
Top 7 des composants dans un kv store (ref : article topito)
![Page 11: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/11.jpg)
L’interface
type KVStore interface {
Get(k string) string
Put(k, v string)
Delete(k string)
}
![Page 12: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/12.jpg)
First implem’
◉ Simple hashmap (separate chaining)
◉ Persistent store :
> Put -> append to file
> Get -> foreach line, split(‘:’)
MVP
bucket files (fixed number)
hash(k) % numBucket (static hashing)
File bucket 1
File bucket 2
File bucket 3 append
hash(k) % 3
![Page 13: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/13.jpg)
First implem’
◉ Benchmarks persistent store
BenchmarkPutNaiveDiskKVStore-4 200000 6250 ns/op -> 6,2 µs
BenchmarkGetNaiveDiskKVStore-4 30 44017416 ns/op -> 44 ms
◉ Benchmarks in-memory hashmap
BenchmarkPutHashMap-4 1000000 1385 ns/op -> 1,3 µs
BenchmarkGetHashMap-4 2000000 711 ns/op -> 0,7 µs
MVP
![Page 14: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/14.jpg)
Quotable quote
There’s clearly a trade-off between reads and writes, and it’s the mixing of
the two that causes all of the interesting challenges
— Adrian Colyer
![Page 15: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/15.jpg)
>03 Ok, on fait un vrai design maintenant ?
Hashtables are arguably the single most important data structure known
to mankind.
— Steve Yegge
![Page 16: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/16.jpg)
Le design
1. Data storage abstraction -> SSD Page 4k + Record
2. Data structure (index) -> Hashtable (extendible hash)
3. Memory management (page / block management, free space) -> mmap + custom
4. String / byte slice -> Go string native conv to []byte slice
5. Iterator / Cursor -> Pattern iterator
6. Lock management -> à l’extérieur
7. Comparator -> multi-level comparator (key length > hash > byte)
Top 7 des composants dans un kv store (ref : article topito)
![Page 17: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/17.jpg)
Record layout
type Record interface {
key() []byte
val() []byte
valLen() uint16 // min 0
keyLen() uint16 // max 65536
}
type ByteRecord []byte
r := ByteRecord(byteArray[204 : 249])
overhead : 4 bytes
![Page 18: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/18.jpg)
Record layout
func (r ByteRecord) Write(key, val string) {
...
copy( r[ : ], key)
copy( r[ lenKey : ], val )
binary.LittleEndian.PutUint16( r[ total : ], lenVal )
binary.LittleEndian.PutUint16( r[ total + RECORD_HEADER_SIZE : ], lenKey )
}
serialization
... k e y v a l u e 0x5 0x0 0x3 0x0 ...
![Page 19: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/19.jpg)
Page layout
type Page []byte
const (
PAGE_SIZE = 4096
PAGE_USE_OFFSET = 4094
PAGE_LOCAL_DEPTH_OFFSET = 4092
)
func (p Page) Use() int {
return int( binary.LittleEndian.Uint16( p[PAGE_USE_OFFSET : ] ) )
}
...
Record1 Record2 Record3 Record1 v2 LD USE
![Page 20: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/20.jpg)
Extendible Hashing algorithm
dynamic hashing
◉ Une fonction de Hash qui génère des résultats sur un large segment — typiquement un int32
◉ Préfixe du résultat de la fonction de Hash pour calculer l’indice dans la table d’adresse.
◉ Plusieurs entrées dans la table d’adresse peuvent pointer sur la même page
![Page 21: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/21.jpg)
Extendible Hashing algorithm
p0
p0
p1
p2
..00
..01
..10
..11
...0100 key, value
...1101 key, value
ld=1
GD=2
...0110 key, value
...1110 key, value
...0111 key, value
...1011 key, value
ld=2
ld=2
page 1
page 2
page 3
Page
addresses
![Page 22: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/22.jpg)
Extendible Hashing algorithm
Après le split
p0
p4
p1
p2
..00
..01
..10
..11
...0100 key, value
ld=2
GD=2
...0110 key, value
...1110 key, value
...0111 key, value
...1011 key, value
ld=2
ld=2
page 1
page 2
page 3
Page
addresses
...1101 key, value
ld=2
page 4
![Page 23: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/23.jpg)
Extendible Hashing algorithm
Après le expand
p0
p4
p1
p2
p0
p4
p1
p2
.000
.001
.010
.011
.100
.101
.110
.111
...0100 key, value
ld=2
GD=3
...0110 key, value
...1110 key, value
...0111 key, value
...1011 key, value
ld=2
ld=2
page 1
page 2
page 3
...1101 key, value
ld=2
page 4
![Page 24: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/24.jpg)
Directory layout
func (dir *Directory) extendibleHash(k util.Hashable) int {
return k.Hash() & ((1 << dir.Gd) - 1)
}
func (dir *Directory) getPage(k string) (Page) {
hash := dir.extendibleHash(util.Key(k))
id := dir.Table[hash]
offset := id * PAGE_SIZE
return Page(dir.data[offset : offset + PAGE_SIZE])
}
dir := &Directory{
Table: []int{0, 1, 3, 2, 0, 1, 3, 2},
Gd: 2,
LastPageId: 3,
data: []byte{...},
}Mmap
![Page 25: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/25.jpg)
Core
Data Storage
Memory Management
Design
Bluck server
http.ListenBluckstore
open()
close()
DirectoryMmap
persistMeta()
KVStore
Get(k)
Put(k,v)
Delete(k)
PageGC()
ld()
Lock Management
RWMutex
File
RecordWrite()
Iterator
![Page 26: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/26.jpg)
>04 On parle des perf ?
Memory maps are the best thing known to mankind after hash tables.
— Emmanuel Goossaert
![Page 27: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/27.jpg)
Memory mapping
◉ Mmap file -> []byte
> 0 copy, pas de passage de user space à kernel space
> Pas de buffer à gérer pour le flush
> Pas de block cache
◉ Pré-allocation des pages pour accélérer mmap
Optims
![Page 28: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/28.jpg)
Chronologie des évolutions
Les benchmarks (Go)
Get Put
Mmap 2344 5452
Update 2881 10532
Iterator 8796 11102
ByteRecord 1819 9982
GOB serde 1874 3206
Reverse 1406 1529
Flush Meta 1398 2786
Pre
allocation1408 1359
![Page 29: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/29.jpg)
Les fonctionnalités qui font tout PÉTER !!!
◉ Update (in place ?)
◉ Metadata (consistency)
◉ Delete (shift? scan ?)
◉ Concurrency & Isolation
◉ Big record
![Page 30: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/30.jpg)
$ go tool pprof
Avant :
Benchmark => itérations : 200.000 8796 ns/op 3376 B/op 106 allocs/op
(pprof) top10
1590ms of 1890ms total (84.13%)
Showing top 10 nodes out of 77 (cum >= 50ms)
flat flat% sum% cum cum%
500ms 26.46% 26.46% 920ms 48.68% runtime.mallocgc
260ms 13.76% 40.21% 260ms 13.76% runtime.heapBitsSetType
140ms 7.41% 47.62% 1630ms 86.24% github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get
140ms 7.41% 55.03% 1430ms 75.66% runtime.convT2I
120ms 6.35% 61.38% 120ms 6.35% runtime.memclr
120ms 6.35% 67.72% 120ms 6.35% runtime.memmove
cpu
![Page 31: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/31.jpg)
$ go tool pprof
Après :
Benchmark => itérations : 2.000.000 756 ns/op 16 B/op 1 allocs/op
(pprof) top10
2.33s of 2.36s total (98.73%)
Dropped 9 nodes (cum <= 0.01s)
Showing top 10 nodes out of 12 (cum >= 2.36s)
flat flat% sum% cum cum%
2.21s 93.64% 93.64% 2.33s 98.73% github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get
0.06s 2.54% 96.19% 0.11s 4.66% runtime.mallocgc
0.03s 1.27% 97.46% 0.03s 1.27% runtime.heapBitsSetType
0.02s 0.85% 98.31% 0.02s 0.85% runtime.scanobject
0.01s 0.42% 98.73% 0.12s 5.08% runtime.newobject
cpu
![Page 32: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/32.jpg)
Recap des compromis
◉ Mmap avec pré-allocation des pages : supprimer des enregistrements ne libère
pas l’espace disque
◉ Extendible Hashing & hashtable : performance pour l’accès, mais la persistance
des meta appart est très coûteuse et risquée (d’un point de vue cohérence)
◉ Delete : marquer les kv comme “deleted” augmente fortement le phénomène Write
Amplification versus shifting déplace beaucoup de données
◉ Update : “in place” il faut gérer la défragmentation versus append only il faut faire
du GC
◉ Concurrence : faire un hashtable threadsafe complexifie beaucoup le code (Mutex)
![Page 33: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/33.jpg)
Recap des compromis
◉ La structure de donnée (Hashtable, LSM-Tree, B+Tree, Trie, etc...) définit le(s)
compromis
> Latence lecture
> Latence écriture
> Range scan
> Degré d’isolation
> Cohérence
> Haute dispo
◉ Choisissez !!
![Page 34: Implementing a key/value store](https://reader033.fdocuments.net/reader033/viewer/2022051504/58f0ac291a28ab3e3d8b4569/html5/thumbnails/34.jpg)
BluckDB
github.com/BenJoyenConseil/bluckdb
@BenJoyeConseil
Fork it