Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits...

40
2019 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved. 1 Breaking the Metadata Bottleneck: the Exascale Filesystem DeltaFS as a LANL and Carnegie Mellon Collaboration Qing Zheng Carnegie Mellon University

Transcript of Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits...

Page 1: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

2019 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved. 1

Breaking the Metadata Bottleneck: the Exascale Filesystem DeltaFS as a LANL and Carnegie Mellon Collaboration

Qing ZhengCarnegie Mellon University

Page 2: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Qing ZhengChuck Cranor, Greg Ganger, Garth Gibson, George Amvrosiadis

Bradley Settlemyer†, Gary Grider†

Carnegie Mellon University†Los Alamos National Laboratory

The Exascale Filesystem DeltaFS as aLANL and CMU Collaboration

Breaking�the�Metadata�Bottleneck:

Page 3: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Everyone Loves Fast Storage

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

DeltaFS: 20,000x faster than FS today

3Image from http://esp.igpp.ucla.edu illustrating earth’s magnetic field under the influence of the solar wind.

Page 4: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Everyone Loves Fast Storage

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

DeltaFS: 20,000x faster than FS todayHow long does it take to

insert 2 trillion particle filesinto a fs directory?

57days

2mins

OR4

Page 5: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Existing FS uses Dedicated Resources

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Figure shows CMU’s NASD (OSD) design (now ANSI T10), root of many today’s distributed filesystem designs.

File

syst

em

Clie

nts

Salable Object Storage

MDSMetadata Server (MDS)

5

Page 6: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

MDS often a Bottleneck

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

File

syst

em

Clie

nts

Salable Object Storage

MDS

It could take FOREVER to finish all metadata ops

6

Page 7: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

MDS often a Bottleneck

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

File

syst

em

Clie

nts

Salable Object Storage

MDS

It could take FOREVER to finish all metadata ops

57days

7

Page 8: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Common Ways for Stronger MDS

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

File

syst

em

Clie

nts

Salable Object Storage

MDSA) Better Representation

B) Better Namespace Partitioning

C) Deeper Layering

8

Page 9: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

We Could Build Something Like This

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

File

syst

em

Clie

nts

Salable Object Storage

MDSLSM

Met

adat

a Ca

che

“MDS 2.0”

Namespace spread across 2 servers

MDSLSM

LSM-Trees for high write throughput

A caching tier for fast reads

9

Page 10: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

File

syst

em

Clie

nts

Salable Object Storage

MDSLSM

Met

adat

a Ca

che

MDS 2.0

Namespace spread across 2 servers

MDSLSM

LSM-Tree for fast write throughput

A caching tier for fast reads

Need 800 servers if each can do 10 million file creates/s.

Might work but would be

EXTREMELY INEFFICIENTin delivering 1 trillion file creates in 2 mins

10

Page 11: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Budget is Fixed for Each Machine

More MDS nodes means less compute nodesMDS not busy all the time

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Compute Nodes(e.g., 10K)

Storage Nodes(e.g., 100)

MDS(e.g., 4 )

8009K

11

Page 12: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Budget is Fixed for Each Machine

We blame the bar that separates the nodesA waste: unable to use MDS nodes to run jobs

A much bigger waste: unable to utilize compute nodes to process metadata

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Compute Nodes(e.g., 10K)

Storage Nodes(e.g., 100)

MDS(e.g., 4 )

8009K

12

Page 13: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

A BOLD idea: having filesystems run directly on job nodes (DeltaFS)

Page 14: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Shared Object Storage

Today: A Dedicated MDS Per Machine

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Job1

Job2

MDSImg

A shared namespace

Persistent state

14

Page 15: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Better: Dynamically Instantiating MDS for Jobs

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Shared Object Storage

Job1

MDS1

Job2

MDS2Img2

Img1MDS

No dedicated MDS

15

Page 16: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Immediate Benefits from No Dedicated MDSSimplified cluster designNo need to pool resources for MDS during cluster planning

No false sharingMy cache entries do not get invalidated by someone else’s activities

Highly agile scalabilityLarger jobs can devote more resources to MDS

Better resource utilizationWould-be idle CPU cycles can be utilized to process metadata

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 16

Page 17: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Does this really work for my applications?

Page 18: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Three Types of Interaction

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

No sharingDifferent jobs access different

sets of files

Concurrent sharingMultiple jobs read & write a

same set of files

Sequential sharingOne job’s output is another job’s

input

Works trivially today: 1 dedicated MDS, 1 global namespace

But a global namespace is not always required for existing jobs to work

18

Page 19: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Job1

MDS1

Job2MDS2

Img2

Img1

UnderlyingStorage

Unrelated Jobs Do not Have to See Each Other’s Data

19

Page 20: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Concurrent Sharing? Connect to the LeaderOne use case: user monitoring such as “ls -l” & “tail -F”

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Underlying Storage

Job1MDS1Img1

Job3

Job2 LeaderFollower

Follower

20

Page 21: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Need Another Job’s Data? Just Mount it & Carry on

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Job2

Job1

Img

Underlying Storage

MOUNT

UNM

OUNT

21

Page 22: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Mount Many If Necessary

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Job3

Job1

Img1

Underlying Storage

Img2

Job2

MOUNT 1

MOUNT 2

22

Page 23: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

A namespace is as good as a global namespace if a job sees all related data

Page 24: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Re-imagining filesystems for future

Page 25: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Machine-Oriented v.s. Job-Oriented

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

A component of a machine

Always ON, centralizedUses a fixed set of dedicated nodesLong-standingAccessible from every node of a machineA shared FS image per machine

Runs background activities (e.g., reorganizing indexes for fast reads)One piece of code

A component of a running job

Dynamically instantiated by jobsHighly agile: scales with job allocationsTransient: lives within a jobPrivate: accessed only by a jobNo false sharing: one per job

No jitters: all background FS work is scheduled by jobsSoftware-defined: code optimized for the work at hand

25

Page 26: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Machine-Oriented v.s. Job-Oriented

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

A component of a machine

Always ON, centralizedUses a fixed set of dedicated nodesLong-standingAccessible from every node of a machineA shared FS image per machine

Runs background activities (e.g., reorganizing indexes for fast reads)One piece of code

A component of a running job

Dynamically instantiated by jobsHighly agile: scales with job allocationsTransient: lives within a jobPrivate: accessed only by a jobNo false sharing: one per job

No jitters: all background FS work is scheduled by jobsSoftware-defined: code optimized for the work at hand

26

Page 27: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Each job can be viewed as a process groupA group of processes self-found their MDS service

Decoupling MDS from the Machine

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Job

Proc

ess

1

Proc

ess

2

Proc

ess

n

MD

S…

Job

Proc

ess

1

Proc

ess

2

Proc

ess

n

MDS MDSMDS

One option: MDS runs as a separate job processDecoupled from the machine

Another option: MDS runs as library within processesAgain, decoupled from the machine

27

Page 28: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

When a job ends, its FS “service” goes with itData stays in the underlying storage

Transient Service, Persistent Data

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Job

Proc

ess

1

Proc

ess

2

Proc

ess

n

MDS

Job

Proc

ess

1

Proc

ess

2

Proc

ess

n

MDS MDSMDS

28

Underlying Storage

Page 29: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Each Job Acts as a Function

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Creates a new FS image as output

Takes one or more FS images as input

λ1

2

No side effect329

Page 30: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Keeping input immutable so that they can be shared in a scalable way

Job

Log-Structured: Each Job Appends Changes to a Log

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Newer changesMetadata log Total “Δ” of state

Input Output

Each FS image essentially a pointer to a logical log

Delta, Diff

30

chg1

chg3

chg4

chg2

Page 31: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Turtles All the Way Down Reading from an FS image is searching through a DAG of “Δ”s

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 31

1 32

DA

CB

JobJob

D

A

C

B

DCAB

Job

Resolve conflicts using a job-specified ordering

Merging & flattening for fast readsvia log compaction

Order matters

Page 32: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Log compaction reduces search depth & reclaims spaceOften time-consuming

User Pays for Speed (by Scheduling Log Compactions)

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.

Traditional: done by a dedicated MDSJitters or wasted work

Better: explicitly scheduled by appsPredictable high performance

32

Job1

Job2

Job1Job2

I’ll handle everything:“meh” speed for everyone

MDSOptimized for bulk

insertion

MDS1Job3

Log compaction

MDS2

Fast reads

MDS3

Page 33: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

How does my job find its input data?

Page 34: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

It’s All about Mapping Names to Data User specifies names; a mechanism handles the mapping

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 34

LANL’s Cray-1 (left) and Trinity computer (right), https://www.lanl.gov/asci/platforms/index.php

20151976The good old days: a job control system does

the mappingToday: a global filesystem namespace does

the mapping

Page 35: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

A New Kind of Mapper: Filesystem Image RegistryWorks like github.com, jobs “git-clone” their input datasets

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 35

Job 2Job 1

One FS image registry (can be many)

Publish1Access3

Get info by name

2

Publication & collection may be automated by workflow engines

Unde

rlying

St

orag

e

Img

A manifest, not the image itself

Page 36: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Which Registry did I Use? Ask a Catalog Service Is it github.com or bitbucket.org?

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 36

A catalog server (again, can be many) Indexes Subscribe

Job

1Search2

Get info3

Subscribe

Related talk: LANL’s catalog service GUFI by Dominic MannoSession 63, 2pm Wed, Lafayette room

Page 37: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Sounds Good. Remind me Why Perf. is Better…1. More CPUsAble to use more resources to do FS work2. More EfficientNo false sharing, less synchronization, better caching3. Software-DefinedSmart clients, simple storage

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 37

Page 38: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

Example: Making a Needle-in-a-Haystack Hero

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 38

Underlying Storage

12 billion file inserts/s

A job using 100K CPU cores w/ an embedded FS

Up-to 5000x faster queries than bulk scans

Under the hood: a) leveraged idle CPU cycles,b) deep writeback buffering, c) optimized storage layout

Page 39: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

ConclusionExisting FS clients sync too often with serversSynchronization of anything global should be avoided at extreme scales

Removing servers forces us to review what’s necessaryEnabling sequential sharing is where filesystems shrine

Need radically different models for shared storageA job-oriented filesystem scales better in many computing scenarios

2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 39

Page 40: Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits from No Dedicated MDS Simplified cluster design No need to pool resources for MDS during

2019 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved. 40

Thank you.