Hopsfs 10x HDFS performance
-
Upload
jim-dowling -
Category
Technology
-
view
101 -
download
0
Transcript of Hopsfs 10x HDFS performance
![Page 1: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/1.jpg)
HopsFS: 10X your HDFS with NDB
Jim Dowling Associate Prof @ KTH
Senior Researcher @ SICSCEO @ Logical Clocks AB
Oracle, Stockholm, 6th September 2016
www.hops.io @hopshadoop
![Page 2: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/2.jpg)
Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Ermias Gebremeskel, Antonios Kouzoupis.
Alumni: Vasileios Giannokostas, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
![Page 3: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/3.jpg)
Marketing 101: Celebrity Endorsements
*Turing Award Winner 2014, Father of Distributed Systems
Hi!I’m Leslie Lamport* and even though you’re not using Paxos, I approve
this product.
![Page 4: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/4.jpg)
Bill Gates’ biggest product regret?*
![Page 5: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/5.jpg)
Windows Future Storage (WinFS*)
*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/
![Page 6: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/6.jpg)
6
Hadoop in Context
Data ProcessingSpark, MapReduce, Flink, Presto,
Tensorflow
StorageHDFS, MapR, S3, Collossus, WAS
Resource ManagementYARN, Mesos, Borg
MetadataHive, Parquet, Authorization, Search
![Page 7: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/7.jpg)
7
HDFS v2
DataNodes (up to ~5K)
HDFS Client
Journal Nodes Zookeeper
ActiveNameNode
StandbyNameNode
Asynchronous Replication of EditLogAgreement on the Active NameNodeSnapshots (fsimage) - cut the EditLog
(ls, rm, mv, cp,stat, rm, chown, copyFromLocal,copyFromRemote,chmod, etc)
![Page 8: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/8.jpg)
The NameNode is the Bottleneck for Hadoop
8
![Page 9: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/9.jpg)
9
Max Pause times for NameNode Heap Sizes*
Max Pause-Times (ms)
100
1000
10000
10
JVM Heap Size (GB)
50 75 100 150
Unopti
mized
Optimized
*OpenJDK or Oracle JVM
![Page 10: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/10.jpg)
10
NameNode and Decreasing Memory Costs
Size (GB)
250
500
1000
Year
2016 2017 2018 2019
Projected Max NameNode JVM Heap Size
2020
0
750
Size of RAM in a COTS $7,000 Rack Server
![Page 11: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/11.jpg)
11
Externalizing the NameNode State• Problem:NameNode not scaling up with lower RAM prices
• Solution:Move the metadata off the JVM Heap
• Move it where?An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.
• MySQL Cluster (NDB)
![Page 12: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/12.jpg)
12
HopsFS Architecture
NameNodes
NDB
Leader
HDFS Client
DataNodes
![Page 13: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/13.jpg)
13
Pluggable DBs: Data Abstraction Layer (DAL)
NameNode(Apache v2)
DAL API(Apache v2)
NDB-DAL-Impl(GPL v2)
Other DB(Other License)
hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar
![Page 14: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/14.jpg)
The Global Lock in the NameNode
14
![Page 15: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/15.jpg)
HDFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Namespace & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush
![Page 16: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/16.jpg)
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
Done RPCs
ackIds
leases…
DAL-ImplDAL API
HARD PART
![Page 17: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/17.jpg)
17
Concurrency Model: Implicit Locking
• Serializabile FS ops using implicit locking of subtrees.
[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]
![Page 18: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/18.jpg)
18
Preventing Deadlock and Starvation
• Acquire FS locks in agreed order using FS Hierarchy. • Block-level operations follow the same agreed order.• No cycles => Freedom from deadlock• Pessimistic Concurrency Control ensures progress
/user/jim/myFilemv
read
block_reportDataNodeNameNodeClient
![Page 19: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/19.jpg)
Per Transaction Cache• Reusing the HDFS codebase resulted in too many roundtrips to the database per transaction.
• We cache intermediate transaction results at NameNodes (i.e., snapshot).
![Page 20: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/20.jpg)
20
Sometimes, Transactions Just ain’t Enough• Large Subtree Operations (delete, mv, set-quota) can’t always be executed in a single Transaction.
• 4-phase Protocol• Isolation and Consistency• Aggressive batching• Transparent failure handling• Failed ops retried on new NN.• Lease timeout for failed clients.
![Page 21: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/21.jpg)
Leader Election using NDB• Leader to coordinate replication/lease management• NDB as shared memory for Leader Election of NN.
21[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]
![Page 22: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/22.jpg)
22
Path Component Caching• The most common operation in HDFS is resolving pathnames to inodes- 67% of operations in Spotify’s Hadoop workload
• We cache recently resolved inodes at NameNodes so that we can resolve them using a single batch primary key lookup.- We validate cache entries as part of transactions- The cache converts O(N) round trips to the database to O(1)
for a hit for all inodes in a path.
![Page 23: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/23.jpg)
Path Component Caching• Resolving a path of length N gives O(N) round-trips• With our cache, O(1) round-trip for a cache hit
/user/jim/myFile
NDB
getInode(0, “user”) getInode
(1, “jim”) getInode(2, “myFile”)
NameNode
/user/jim/myFile
NDB
validateInodes([(0, “user”), (1,”jim”),(2,”myFile”)])
NameNode
CachegetInodes(“/user/jim/myFile”)
![Page 24: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/24.jpg)
24
Hotspots• Mikael saw 1-2 maxed out LDM threads• Partitioning by parent inodeId meant fantastic performance for ‘ls’- Partition-pruned index scans- At high load hotspots appeared at the
top of the directory hierarchy• Current Solution:
- Cache the root inode at NameNodes- Pseudo-random partition key for top-level
directories, but keep partition by parent inodeId at lower levels
- At least 4x throughput increase!
/
/Users /Projects
/NSA /MyProj
/Dataset1 /Dataset2
![Page 25: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/25.jpg)
Scalable Blocking Reporting• On 100PB+ clusters, internal maintenance protocol traffic makes up much of the network traffic
• Block Reporting - Leader Load Balances- Work-steal when exiting
safe-mode
SafeBlocks
DataNodes
NameNodes
NDB
Leader
Blocks
work steal
![Page 26: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/26.jpg)
HopsFS Performance
26
![Page 27: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/27.jpg)
27
HopsFS Metadata Scaleout
Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
![Page 28: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/28.jpg)
28
Spotify Workload
![Page 29: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/29.jpg)
29
HopsFS Throughput (Spotify Workload - PM)
Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
![Page 30: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/30.jpg)
30
HopsFS Throughput (Spotify Workload - PM)
Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
![Page 31: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/31.jpg)
31
HopsFS Throughput (Spotify Workload - AM)
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
![Page 32: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/32.jpg)
32
Per Operation HopsFS Throughput
![Page 33: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/33.jpg)
33
NDB Performance Lessons• NDB is quite stable!• ClusterJ is (nearly) good enough
- sun.misc.Cleaner has trouble keeping up at high throughput – OOM for ByteBuffers
- Transaction hint behavior not respected- DTO creation time affected by Java Reflection- Nice features would be:• Projections• Batched scan operations support• Event API
• Event API and Asynchronous API needed for performance in Hops-YARN
![Page 34: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/34.jpg)
34
Heterogeneous Storage in HopsFS• Storage Types in HopsFS: Default, EC-RAID5, SSD
- Default: 3X overhead - triple replication on spinning disks
- SSD: 3X overhead - triple replication on SSDs
- EC-RAID5: 1.4X overhead with low reconstruction overhead!
![Page 35: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/35.jpg)
35
Erasure Coding
HDFS File (Sealed)
d0 d1 d2 d3 d4 d5 p0 p1 p1
overhead
(6+3)/6 = 1.5X
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9d1
0
d1
1p0 p1 p2 p3 (12+4)/16= 1.33X
RS(6,3)
RS(12,4)
![Page 36: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/36.jpg)
host9
d0 d1 d2 d3 d4 p0
Global/Local Reconstruction with EC-RAID5
36
d0 d1 d2 d3 d4 p0Block0 Block9
Block10 Block11 Block12 Block13
host0
host10 host10 host10 host10
ZFS RAID-ZZFS RAID-Z
(10+2+2)/10 = 1.4X
(10+2+4)/10 = 1.6X
RS(10,2) LR(5,1).RS(10,4)LR(5,1).
![Page 37: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/37.jpg)
37
ePipe: Indexing HopsFS’ Namespace
Free-Text Search
NDBElasticSearch
Polyglot PersistenceThe Distributed Database is the Single Source of Truth.
Foreign keys ensure the integrity of Extended Metadata.
MetaDataDesigner
MetaDataEntry
NDB Event API
![Page 38: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/38.jpg)
Hops-YARN
38
![Page 39: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/39.jpg)
39
YARN Architecture
NodeManagers
YARN Client
Zookeeper Nodes
ResourceMgr StandbyResourceMgr
1. Master-Slave Replication of RM State2. Agreement on the Active ResourceMgr
![Page 40: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/40.jpg)
40
NDB
ResourceManager– Monolithic but Modular
ApplicationMasterService
ResourceTrackerService
Scheduler
ClientService
YARN Client
AdminService
Security
Cluster State
HopsResourceTracker
Cluster State
HopsScheduler
NodeManagerNodeManagerYARN Client App MasterApp Master
ResourceManager
~2k ops/s ~10k ops/s
ClusterJ Event API
![Page 41: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/41.jpg)
41
Hops-YARN Architecture
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers Leader Election forFailed Scheduler
![Page 42: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/42.jpg)
Hopsworks
42
![Page 43: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/43.jpg)
43
Hopsworks – Project-Based Multi-Tenancy• A project is a collection of
- Users with Roles- HDFS DataSets- Kafka Topics- Notebooks, Jobs
• Per-Project quotas- Storage in HDFS- CPU in YARN• Uber-style Pricing
• Sharing across Projects- Datasets/Topics
projectdataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
![Page 44: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/44.jpg)
Hopsworks – Dynamic Roles
44
NSA__Alice
Authenticate
Users__Alice
Glassfish
HopsFS
HopsYARN
ProjectsSecure
Impersonation
Kafka
X.509 Certificates
![Page 45: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/45.jpg)
45
SICS ICE - www.hops.siteA 2 MW datacenter research and test environment
Purpose: Increase knowledge, strengthen universities, companies and researchers
R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters
![Page 46: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/46.jpg)
46
Karamel/Chef for Automated Installation
Google Compute Engine BareMetal
![Page 47: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/47.jpg)
47
Summary• HopsFS is the world’s fastest, most scalable HDFS implementation
• Powered by NDB, the world’s fastest database • Thanks to Mikael, Craig, Frazer, Bernt and others• Still room for improvement….
www.hops.io
![Page 48: Hopsfs 10x HDFS performance](https://reader036.fdocuments.net/reader036/viewer/2022081503/58ee8ba31a28abe5528b4607/html5/thumbnails/48.jpg)
Hops[Hadoop For Humans]
Join us!http://github.com/hopshadoop