Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits...
Transcript of Qing Zheng Breaking the Metadata Bottleneck · Img1 MDS No dedicated MDS 15. Immediate Benefits...
2019 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved. 1
Breaking the Metadata Bottleneck: the Exascale Filesystem DeltaFS as a LANL and Carnegie Mellon Collaboration
Qing ZhengCarnegie Mellon University
Qing ZhengChuck Cranor, Greg Ganger, Garth Gibson, George Amvrosiadis
Bradley Settlemyer†, Gary Grider†
Carnegie Mellon University†Los Alamos National Laboratory
The Exascale Filesystem DeltaFS as aLANL and CMU Collaboration
Breaking�the�Metadata�Bottleneck:
Everyone Loves Fast Storage
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
DeltaFS: 20,000x faster than FS today
3Image from http://esp.igpp.ucla.edu illustrating earth’s magnetic field under the influence of the solar wind.
Everyone Loves Fast Storage
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
DeltaFS: 20,000x faster than FS todayHow long does it take to
insert 2 trillion particle filesinto a fs directory?
57days
2mins
OR4
Existing FS uses Dedicated Resources
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Figure shows CMU’s NASD (OSD) design (now ANSI T10), root of many today’s distributed filesystem designs.
File
syst
em
Clie
nts
Salable Object Storage
MDSMetadata Server (MDS)
5
MDS often a Bottleneck
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
File
syst
em
Clie
nts
Salable Object Storage
MDS
It could take FOREVER to finish all metadata ops
6
MDS often a Bottleneck
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
File
syst
em
Clie
nts
Salable Object Storage
MDS
It could take FOREVER to finish all metadata ops
57days
7
Common Ways for Stronger MDS
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
File
syst
em
Clie
nts
Salable Object Storage
MDSA) Better Representation
B) Better Namespace Partitioning
C) Deeper Layering
8
We Could Build Something Like This
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
File
syst
em
Clie
nts
Salable Object Storage
MDSLSM
Met
adat
a Ca
che
“MDS 2.0”
Namespace spread across 2 servers
MDSLSM
LSM-Trees for high write throughput
A caching tier for fast reads
9
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
File
syst
em
Clie
nts
Salable Object Storage
MDSLSM
Met
adat
a Ca
che
MDS 2.0
Namespace spread across 2 servers
MDSLSM
LSM-Tree for fast write throughput
A caching tier for fast reads
Need 800 servers if each can do 10 million file creates/s.
Might work but would be
EXTREMELY INEFFICIENTin delivering 1 trillion file creates in 2 mins
10
Budget is Fixed for Each Machine
More MDS nodes means less compute nodesMDS not busy all the time
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Compute Nodes(e.g., 10K)
Storage Nodes(e.g., 100)
MDS(e.g., 4 )
8009K
11
Budget is Fixed for Each Machine
We blame the bar that separates the nodesA waste: unable to use MDS nodes to run jobs
A much bigger waste: unable to utilize compute nodes to process metadata
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Compute Nodes(e.g., 10K)
Storage Nodes(e.g., 100)
MDS(e.g., 4 )
8009K
12
A BOLD idea: having filesystems run directly on job nodes (DeltaFS)
Shared Object Storage
Today: A Dedicated MDS Per Machine
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Job1
Job2
MDSImg
A shared namespace
Persistent state
14
Better: Dynamically Instantiating MDS for Jobs
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Shared Object Storage
Job1
MDS1
Job2
MDS2Img2
Img1MDS
No dedicated MDS
15
Immediate Benefits from No Dedicated MDSSimplified cluster designNo need to pool resources for MDS during cluster planning
No false sharingMy cache entries do not get invalidated by someone else’s activities
Highly agile scalabilityLarger jobs can devote more resources to MDS
Better resource utilizationWould-be idle CPU cycles can be utilized to process metadata
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 16
Does this really work for my applications?
Three Types of Interaction
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
No sharingDifferent jobs access different
sets of files
Concurrent sharingMultiple jobs read & write a
same set of files
Sequential sharingOne job’s output is another job’s
input
Works trivially today: 1 dedicated MDS, 1 global namespace
But a global namespace is not always required for existing jobs to work
18
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Job1
MDS1
Job2MDS2
Img2
Img1
UnderlyingStorage
Unrelated Jobs Do not Have to See Each Other’s Data
19
Concurrent Sharing? Connect to the LeaderOne use case: user monitoring such as “ls -l” & “tail -F”
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Underlying Storage
Job1MDS1Img1
Job3
Job2 LeaderFollower
Follower
20
Need Another Job’s Data? Just Mount it & Carry on
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Job2
Job1
Img
Underlying Storage
MOUNT
UNM
OUNT
21
Mount Many If Necessary
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Job3
Job1
Img1
Underlying Storage
Img2
Job2
MOUNT 1
MOUNT 2
22
A namespace is as good as a global namespace if a job sees all related data
Re-imagining filesystems for future
Machine-Oriented v.s. Job-Oriented
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
A component of a machine
Always ON, centralizedUses a fixed set of dedicated nodesLong-standingAccessible from every node of a machineA shared FS image per machine
Runs background activities (e.g., reorganizing indexes for fast reads)One piece of code
A component of a running job
Dynamically instantiated by jobsHighly agile: scales with job allocationsTransient: lives within a jobPrivate: accessed only by a jobNo false sharing: one per job
No jitters: all background FS work is scheduled by jobsSoftware-defined: code optimized for the work at hand
25
Machine-Oriented v.s. Job-Oriented
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
A component of a machine
Always ON, centralizedUses a fixed set of dedicated nodesLong-standingAccessible from every node of a machineA shared FS image per machine
Runs background activities (e.g., reorganizing indexes for fast reads)One piece of code
A component of a running job
Dynamically instantiated by jobsHighly agile: scales with job allocationsTransient: lives within a jobPrivate: accessed only by a jobNo false sharing: one per job
No jitters: all background FS work is scheduled by jobsSoftware-defined: code optimized for the work at hand
26
Each job can be viewed as a process groupA group of processes self-found their MDS service
Decoupling MDS from the Machine
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Job
Proc
ess
1
Proc
ess
2
Proc
ess
n
MD
S…
Job
Proc
ess
1
Proc
ess
2
Proc
ess
n
…
MDS MDSMDS
One option: MDS runs as a separate job processDecoupled from the machine
Another option: MDS runs as library within processesAgain, decoupled from the machine
27
When a job ends, its FS “service” goes with itData stays in the underlying storage
Transient Service, Persistent Data
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Job
Proc
ess
1
Proc
ess
2
Proc
ess
n
MDS
Job
Proc
ess
1
Proc
ess
2
Proc
ess
n
MDS MDSMDS
28
Underlying Storage
Each Job Acts as a Function
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Creates a new FS image as output
Takes one or more FS images as input
λ1
2
No side effect329
Keeping input immutable so that they can be shared in a scalable way
Job
Log-Structured: Each Job Appends Changes to a Log
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Newer changesMetadata log Total “Δ” of state
Input Output
Each FS image essentially a pointer to a logical log
Delta, Diff
30
chg1
chg3
chg4
chg2
Turtles All the Way Down Reading from an FS image is searching through a DAG of “Δ”s
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 31
1 32
DA
CB
JobJob
D
A
C
B
DCAB
Job
Resolve conflicts using a job-specified ordering
Merging & flattening for fast readsvia log compaction
Order matters
Log compaction reduces search depth & reclaims spaceOften time-consuming
User Pays for Speed (by Scheduling Log Compactions)
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved.
Traditional: done by a dedicated MDSJitters or wasted work
Better: explicitly scheduled by appsPredictable high performance
32
Job1
Job2
Job1Job2
I’ll handle everything:“meh” speed for everyone
MDSOptimized for bulk
insertion
MDS1Job3
Log compaction
MDS2
Fast reads
MDS3
How does my job find its input data?
It’s All about Mapping Names to Data User specifies names; a mechanism handles the mapping
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 34
LANL’s Cray-1 (left) and Trinity computer (right), https://www.lanl.gov/asci/platforms/index.php
20151976The good old days: a job control system does
the mappingToday: a global filesystem namespace does
the mapping
A New Kind of Mapper: Filesystem Image RegistryWorks like github.com, jobs “git-clone” their input datasets
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 35
Job 2Job 1
One FS image registry (can be many)
Publish1Access3
Get info by name
2
Publication & collection may be automated by workflow engines
Unde
rlying
St
orag
e
Img
A manifest, not the image itself
Which Registry did I Use? Ask a Catalog Service Is it github.com or bitbucket.org?
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 36
A catalog server (again, can be many) Indexes Subscribe
Job
1Search2
Get info3
Subscribe
Related talk: LANL’s catalog service GUFI by Dominic MannoSession 63, 2pm Wed, Lafayette room
Sounds Good. Remind me Why Perf. is Better…1. More CPUsAble to use more resources to do FS work2. More EfficientNo false sharing, less synchronization, better caching3. Software-DefinedSmart clients, simple storage
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 37
Example: Making a Needle-in-a-Haystack Hero
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 38
Underlying Storage
12 billion file inserts/s
A job using 100K CPU cores w/ an embedded FS
Up-to 5000x faster queries than bulk scans
Under the hood: a) leveraged idle CPU cycles,b) deep writeback buffering, c) optimized storage layout
ConclusionExisting FS clients sync too often with serversSynchronization of anything global should be avoided at extreme scales
Removing servers forces us to review what’s necessaryEnabling sequential sharing is where filesystems shrine
Need radically different models for shared storageA job-oriented filesystem scales better in many computing scenarios
2019 Storage Developer Conference © Carnegie Mellon University. All Rights Reserved. 39
2019 Storage Developer Conference. © Carnegie Mellon University. All Rights Reserved. 40
Thank you.