Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground...

46

Transcript of Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground...

Page 1: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?
Page 2: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Scaling the Btrfs Free Space Cache

Omar SandovalVault 2016

Page 3: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Outline

Background

Design Background

Identifying the Problem

SolutionThe Free Space Tree

Results

Questions

Page 4: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Background

Page 5: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

What is Btrfs?

• General-purpose filesystem• Advanced features

• First-class snapshots• Checksumming of all data and metadata• Transparent compression• Integrated multi-device support (RAID)• Incremental backups with send/receive

• Actively developed

Page 6: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Contributors$ git log --since=”January 1, 2015” --pretty=format:”%an” --no-merges -- fs/btrfs

Filipe Manana Dan Carpenter Borislav Petkov Mike Snitzer

David Sterba Forrest Liu Andreas Gruenbacher Mel Gorman

Zhao Lei Yang Dongsheng Alex Lyakas Matthew Wilcox

Qu Wenruo Tsutomu Itoh Zach Brown Konstantin Khebnikov

Anand Jain Michal Hocko Yauhen Kharuzhy Jan Kara

Josef Bacik Kirill A. Shutemov Yaowei Bai Guenter Roeck

Omar Sandoval Jiri Kosina Vladimir Davydov Greg Kroah-Hartman

Chris Mason Daniel Dressler Tom Van Braeckel Geert Uytterhoeven

Zhaolei Wang Shilong Sudip Mukherjee Gabríel Arthúr Pétursson

Liu Bo Satoru Takeuchi Shilong Wang Dmitry Monakhov

Chandan Rajendra Naohiro Aota Shaohua Li Deepa Dinamani

Dongsheng Yang Luis de Bethencourt Shan Hai Davide Italiano

Alexandru Moise Kinglong Mee Sebastian Andrzej Siewior Dave Jones

Mark Fasheh Kent Overstreet Sasha Levin Darrick J. Wong

Eric Sandeen Justin Maggard Sam Tygier Colin Ian King

Jeff Mahoney Jens Axboe Robin Ruede Chengyu Song

Byongho Lee Holger Hoffstätte Rasmus Villemoes chandan r

Christoph Hellwig Gui Hecheng Quentin Casasnovas Ashish Samant

Al Viro Fabian Frederick Pranith Kumar Arnd Bergmann

chandan David Howells Oleg Nesterov Adam Buchbinder

Geliang Tang Christian Engelmayer NeilBrown

Page 7: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Btrfs at Facebook

• GlusterFS• Testing on other tiers

• Web servers• Build machines• More...

Page 8: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Design Background

Page 9: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Copy-on-Write B-trees

• B+-trees with no leaf chaining• Lock coupling• Relaxed balancing contraints, proactive balancing• Lazy reference-counting• Update in memory, write out in periodic commits• Due to O. Rodeh

Page 10: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Copy-on-Write B-tree Insertion

1, 4, 10

1, 2 5, 6, 7 10, 11

Page 11: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Copy-on-Write B-tree Insertion

1, 4, 10

1, 2 5, 6, 7 10, 11

1, 4, 10

10, 11, 19

Page 12: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Copy-on-Write B-tree Deletion

1, 4, 10

1, 2 5, 6, 7 10, 11

Page 13: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Copy-on-Write B-tree Deletion

1, 4, 10

1, 2 5, 6, 7 10, 11

1, 4, 10

5, 7

Page 14: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Btrfs B-trees

• Generic data structure• Three main structures:

• Block header: header of every block in the B-tree, contains checksum, flags, filesystem ID,generation, etc.

• Key : comprises 64-bit object ID, 8-bit type, and 64-bit offset• Item: comprises key, 32-bit offset, and 32-bit size

• Nodes contain only keys and block pointers to children• Leaves contain array of items and data

Page 15: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-tree Nodes

struct btrfs_disk_keyobjectid=256 type=INODE_ITEM offset=0

Page 16: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-tree Nodes

struct btrfs_key_ptrstruct btrfs_disk_key

256 INODE_ITEM 0blockptr=46976991232 generation=51107

Page 17: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-tree Nodes

struct btrfs_node

struct btrfs_headerstruct btrfs_key_ptr

struct btrfs_disk_key256 INODE_ITEM 0

46976991232 51107

struct btrfs_key_ptrstruct btrfs_disk_key

261 DIR_ITEM 35889527346976319488 51107

...

Page 18: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-tree Leaves

struct btrfs_itemstruct btrfs_disk_key

256 INODE_ITEM 0offset=16123 size=160

Page 19: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-tree Leaves

struct btrfs_leaf

struct btrfs_headerstruct btrfs_item

struct btrfs_disk_key256 INODE_ITEM 0

16123 160

struct btrfs_itemstruct btrfs_disk_key

256 INODE_REF 25616111 12

... struct btrfs_inode_ref struct btrfs_inode_item

Page 20: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-tree Structure

struct btrfs_nodestruct btrfs_header struct btrfs_key_ptr struct btrfs_key_ptr ...

... ...

struct btrfs_nodestruct btrfs_header struct btrfs_key_ptr struct btrfs_key_ptr ...

struct btrfs_leafstruct btrfs_header struct btrfs_item struct btrfs_item ... data data

...

Page 21: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-trees Used in Btrfs

• Root tree• FS trees• Extent tree• Checksum tree• Chunk tree, device tree• Relocation tree, log tree, quota tree, UUID tree

• Free space tree: subject of this talk!

Page 22: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

B-trees Used in Btrfs

• Root tree• FS trees• Extent tree• Checksum tree• Chunk tree, device tree• Relocation tree, log tree, quota tree, UUID tree• Free space tree: subject of this talk!

Page 23: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Identifying the Problem

Page 24: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Symptoms

• Large (tens of terabytes), busy filesystems on GlusterFS• Stalls during commits, sometimes seconds long• Narrowed down the problem to write-out of the free space cache

Page 25: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Terminology

• Chunk: physical unit of allocation, 1 GBfor data, 256 MB for metadata

• Block group: logical unit of allocation,comprises one or more chunks

• Extent: range of filesystem used for aparticular purpose

Chunks

Disk 1 Disk 2

Block group

Extent

Figure: Basic terminology

Page 26: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Extent Tree

• Tracks allocated space with references• Three types of items:

• Block group items• Extent items• Metadata items

EXTENT_ITEM EXTENT_ITEM

BLOCK_GROUP_ITEM

Figure: Data block group

Page 27: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Extent Tree

• Tracks allocated space with references• Three types of items:

• Block group items• Extent items• Metadata items

METADATA_ITEM METADATA_ITEM METADATA_ITEM

BLOCK_GROUP_ITEM

Figure: Metadata block group

Page 28: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Free Space Cache

• Track free space in memory after loading a block group• Red-black tree• Hybrid of extents + bitmaps

• Loading directly from the extent tree is slow• Instead, dump the in-memory cache to disk

• Special free space inodes, one per block group• Compact, very fast to load• Has to be to written out in entirety

Page 29: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Delayed Refs

• Don’t want to update extent tree on diskfor every operation

• Instead, track updates in memory(red-black tree)

• Run at commit time as a batch

32

16 128

8 24

add 2 drop 1 add 1

Figure: Delayed refs tree

Page 30: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Diagnosis

• Large, busy filesystem can dirty many block groups• Have to write out the free space cache for each one• Run during the critical section of the commit

• Blocks other operations

Page 31: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Solution

Page 32: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

False Start

• Start writing out free space cache outside of critical section• Redo block groups that got dirtied again during that window• Still possible for a lot to get dirtied again

• Back to square one

Page 33: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

The Free Space Tree

Page 34: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Key Ideas

• Use another B-tree instead• Inverse of the extent tree

Page 35: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

On-Disk Format

• Per-block group free space info• Extent count• Flags

• Free space extents• Start• Length

• Free space bitmaps• Start• Length• Bitmap

FREE_SPACE_EXTENT FREE_SPACE_EXTENT

EXTENT_ITEM EXTENT_ITEM

Figure: Free space tree extents

Page 36: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

On-Disk Format

• Per-block group free space info• Extent count• Flags

• Free space extents• Start• Length

• Free space bitmaps• Start• Length• Bitmap

FREE_SPACE_BITMAP FREE_SPACE_BITMAP

EXTENT_ITEM EXTENT_ITEM

Figure: Free space tree bitmaps

Page 37: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Maintaining the Free Space Tree

• Piggy-back updates on delayed-refs• Get batching for free

• Convert between extents or bitmaps• Thresholds for whichever is more space-efficient• Buffer to avoid thrashing between formats

Page 38: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Results

Page 39: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Measuring Commit Overhead

• Use fallocate and unlink to dirty a lot of block groups• Sparse filesystem image on loop device to allow for exaggerated sizes (50TB of dirtiedspace)

• Measure time spent committing, esp. in critical section

Page 40: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Commit Overhead

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Free space cache Free space tree No cache

Tim

e s

pent in

critical section p

er

tran

sactio

n (

ms)

Free space commit overhead

Page 41: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Commit Overhead

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Free space tree No cache

Tim

e s

pent in

critical section p

er

tran

sactio

n (

ms)

Free space commit overhead

Page 42: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Measuring Load Overhead

• Create a fragmented filesystem with fs_mark

• Unmount, remount• Run fs_mark again• Measure time spent loading space cache

Page 43: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Load Overhead

0

50

100

150

200

250

Free space cache Free space tree No cache

Tota

l tim

e s

pent cachin

g b

lock g

roup

s (

ms)

Free space loading overhead

Page 44: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Try It Out

mount -o space_cache=v2(Since Linux v4.5)

Page 45: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?

Questions

Page 46: Scaling the Btrfs Free Space Cache · Vault2016. Outline Background DesignBackground IdentifyingtheProblem Solution TheFreeSpaceTree Results Questions. Background. WhatisBtrfs?