Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server...
Transcript of Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server...
![Page 1: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/1.jpg)
Petabyte Scale Out RocksCEPH as Replacement for Openstack's Swift and Co.
Dr. Udo SeidelManager of Linux Strategy @ Amadeus
![Page 2: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/2.jpg)
2
Agenda
Background
Architecture of CEPH
CEPH Storage
OpenStack
Dream Team: CEPH and Openstack
Summary
![Page 3: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/3.jpg)
Background
![Page 4: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/4.jpg)
4
Me :-)
• Teacher of mathematics and physics
• PhD in experimental physics
• Started with Linux in 1996
• Linux/UNIX trainer
• Solution engineer in HPC and CAx environment
• Head of the Linux Strategy team @Amadeus
![Page 5: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/5.jpg)
5
CEPH – What?
• So-called parallel distributed cluster file system
• Started as part of PhD studies at UCSC
• Public announcement in 2006 at 7th OSDI
• File system shipped with Linux kernel since 2.6.34
• Name derived from pet octopus – cephalopods
![Page 6: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/6.jpg)
6
Shared File Systems – Short Intro
• Multiple server access the same data
• Different approaches
‒ Network based, e.g. NFS, CIFS
‒ Clustered
‒ Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2
‒ Distributed parallel, e.g. Lustre .. and CEPH
![Page 7: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/7.jpg)
Architecture of CEPH
![Page 8: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/8.jpg)
8
CEPH and Storage
• Distributed file system => distributed storage
• Does not use traditional disks or RAID arrays
• Does use so-called OSDs
‒ Object based Storage Devices
‒ Intelligent disks
![Page 9: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/9.jpg)
9
Object Based Storage I
• Objects of quite general nature
‒ Files
‒ Partitions
• ID for each storage object
• Separation of metadata operation and storing file data
• HA not covered at all
• Object based Storage Devices
![Page 10: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/10.jpg)
10
Object Based Storage II
• OSD software implementation
‒ Usual an additional layer between between computer and storage
‒ Presents object-based file system to the computer
‒ Use a “normal” file system to store data on the storage
‒ Delivered as part of CEPH
• File systems: Lustre, EXOFS
![Page 11: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/11.jpg)
11
CEPH – The Full Architecture I
• 4 Components
‒ Object-based Storage Devices
‒ Any computer
‒ Form a cluster (redundancy and load balancing)
‒ Metadata Servers
‒ Any computer
‒ Form a cluster (redundancy and load balancing)
‒ Cluster Monitors
‒ Any computer
‒ Clients ;-)
![Page 12: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/12.jpg)
12
CEPH – The Full Architecture II
OSD Cluster
MDS ClusterCEPH Clients
Meta DataOperation
Metadata I/O
Data I/O
CEPH
VFS
POSIX
User
Linux kernel
User-space
![Page 13: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/13.jpg)
CEPH Storage
![Page 14: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/14.jpg)
14
CEPH and OSD
• Userland implementation
• Any computer can act as OSD
• Uses XFS or BTRFS as native file system
‒ Since 2009
‒ Before self-developed EBOFS
‒ Provides functions of OSD-2 standard
‒ Copy-on-write
‒ Snapshots
• No redundancy on disk or even computer level
![Page 15: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/15.jpg)
15
CEPH and OSD – File Systems
• XFS strongly recommended
• BTRFS planned as default
‒ Non-default configuration for mkfs
• EXT4 possible
‒ XATTR (size) is key -> EXT4 less recommended
‒ Leveldb to bridge existing gaps
![Page 16: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/16.jpg)
16
OSD Failure Approach
• Any OSD expected to fail
• New OSD dynamically added/integrated
• Data distributed and replicated
• Redistribution of data after change in OSD landscape
![Page 17: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/17.jpg)
17
Data Distribution
• File stripped
• File pieces mapped to object IDs
• Assignment of so-called placement group to object ID
‒ Via hash function
‒ Placement group (PG): logical container of storage objects
• Calculation of list of OSDs out of PG
‒ CRUSH algorithm
![Page 18: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/18.jpg)
18
CRUSH I
• Controlled Replication Under Scalable Hashing (CRUSH)
• Considers several information
‒ Cluster setup/design
‒ Actual cluster landscape/map
‒ Placement rules
• Pseudo random -> quasi statistical distribution
• Cannot cope with hot spots
• Clients, MDS and OSD can calculate object location
![Page 19: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/19.jpg)
19
CRUSH II
...
…
… … ...…
File
Objects
PGs
OSDs
(ino, ono) → oid
hash(oid) → pgid
CRUSH(pgid) → (osd1, osd2)
![Page 20: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/20.jpg)
20
Data Replication
• N-way replication
‒ N OSDs per placement group
‒ OSDs in different failure domains
‒ First non-failed OSD in PG -> primary
• Read and write to primary only
‒ Writes forwarded by primary to replica OSDs
‒ Final write commit after all writes on replica OSD
• Replication traffic within OSD network
![Page 21: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/21.jpg)
21
CEPH clustermap I
• Objects: computers and containers
• Container: bucket for computers or containers
• Each object has ID and weight
• Maps physical conditions
‒ Rack location
‒ Fire cells
![Page 22: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/22.jpg)
22
CEPH clustermap II
• Reflects data rules
‒ Number of copies
‒ Placement of copies
• Updated version sent to OSDs
‒ OSDs distribute cluster map within OSD cluster
‒ OSD re-calculates via CRUSH PG membership
‒ Data responsibilities
‒ Order: primary or replica
‒ New I/O accepted after information sync
![Page 23: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/23.jpg)
23
CEPH - RADOS
• Reliable Autonomic Distributed Object Storage (RADOS)
• Direct access to OSD cluster via librados
‒ Supported: C, C++, Java, Python, Ruby, PHP
• Drop/skip of POSIX layer (CEPHFS) on top
• Visible to all CEPH members => shared storage
![Page 24: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/24.jpg)
24
CEPH Block Device
• Aka RADOS Block Device (RBD)
• Second part of the kernel code (since 2.6.37)
• RADOS storage exposed as block device (dev/rbd)
‒ qemu/KVM storage driver via librados/librdb
• Alternative to:
‒ Shared SAN/iSCSI for HA environments
‒ Storage HA solutions for qemu/KVM and Xen
• Shipped with SUSE® Cloud
![Page 25: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/25.jpg)
25
RADOS – A Part of the Picture
dev/rbd qemu/kvm
RBD protocol
CEPH-osd
![Page 26: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/26.jpg)
26
CEPH Object Gateway (RGW)
• Aka RADOS Gateway
• RESTful API
‒ Amazon S3 -> s3 tools work
‒ OpenStack SWIFT interface
• Proxy HTTP to RADOS
• Tested with apache and lighthttpd
![Page 27: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/27.jpg)
27
CEPH – Takeaways
• Scalable
• Flexible configuration
• No SPOF, but based on commodity hardware
• Accessible via different interfaces
![Page 28: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/28.jpg)
OpenStack
![Page 29: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/29.jpg)
29
What?
• Infrastructure as as Service (IaaS)
• 'Open source version' of AWS :-)
• New versions every 6 months
‒ Current (2013.1.4) is called Havana
‒ Previous (2013.1.3) is called Grizzly
• Managed by OpenStack Foundation
![Page 30: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/30.jpg)
30
OpenStack Architecture
![Page 31: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/31.jpg)
31
OpenStack Components
• Keystone - identity
• Glance - image
• Nova - compute
• Cinder - block storage
• Swift - object storage
• Quantum - network
• Horizon - dashboard
![Page 32: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/32.jpg)
32
About Glance
• There since almost the beginning
• OpenStack's Image store
‒ Server
‒ Disk
• Several formats possible
‒ Raw
‒ Qcow2
‒ ...
• Shared/non-shared
![Page 33: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/33.jpg)
33
About Cinder
• OpenStack's block storage
• Quite new
‒ Part of nova before
‒ Separated since Folsom
• For compute instances
• Managed by OpenStack users
• Several storage back-ends possible
![Page 34: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/34.jpg)
34
About Swift
• Since the beginnig :-)
• Replace Amazon S3 -> cloud storage
‒ Scalable
‒ Redundant
• OpenStack's object store
‒ Proxy
‒ Object
‒ Container
‒ Account
‒ Auth
![Page 35: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/35.jpg)
Dream Team: CEPH and OpenStack
![Page 36: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/36.jpg)
36
Why CEPH in the first place?
• Full blown storage solution
‒ Support
‒ Operational model
‒ Ready for 'cloud' type
• Separation of duties
• One storage solution for different storage needs
Open source!
![Page 37: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/37.jpg)
37
Integration
• Full integration with RADOS since Folsom
• Two parts
‒ Authentication
‒ Technical access
• Both parties must be aware
‒ CEPH components installed in OpenStack environment
• Independent for each of the storage components
‒ Different RADOS pools
![Page 38: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/38.jpg)
38
Integration – Authentication
• CEPH part
‒ Key rings (ceph auth)
‒ Configuration (ceph.conf)
‒ For cinder and glance
• OpenStack part
‒ Keystone
‒ Only for Swift
‒ Needs RadosGW
![Page 39: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/39.jpg)
39
Integration – Access to RADOS/RBD I
• Via API/libraries => no CEPHFS needed
• Easy for Glance/Cinder
‒ Glance
‒ CEPH keyring configuration
‒ Update of ceph.conf
‒ Update of Glance API configuration
‒ Cinder
‒ CEPH keyring configuration
‒ Update of ceph.conf
‒ Update of Cinder API configuration
![Page 40: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/40.jpg)
40
Integration – Access to RADOS/RBD II
• More work needed for Swift
• Again: CEPHFS not needed
‒ CEPH Object Gateway
‒ Webserver
‒ RGW software
‒ Keystone certificates
‒ Does not support v2 of Swift authentication
‒ Keystone authentication
‒ Endlist configuration -> to RADOSGW
![Page 41: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/41.jpg)
41
Integration – The Full Picture
RADOS
qemu/kvm
CEPH Object Gateway CEPH Block Device
SwiftKeystone CinderGlance Nova
![Page 42: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/42.jpg)
42
Integration – Pitfalls
• CEPH versions not in sync
• Authentication
• CEPH Object Gateway setup
![Page 43: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/43.jpg)
43
Why CEPH – Reviewed
• Previous arguments still valid :-)
• Modular usage
• High integration
• Built-in high availability
• One way to handle the different storage entities
• No need for POSIX compatible interface
• Works even with other IaaS implementations
![Page 44: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/44.jpg)
Summary
![Page 45: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/45.jpg)
45
Takeaways
• Sophisticated storage engine
• Mature OSS distributed storage solution
‒ Other use cases
‒ Can be used elsewhere
• Full integration in OpenStack
![Page 46: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/46.jpg)
![Page 47: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/47.jpg)
Unpublished Work of SUSE. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.
![Page 48: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/48.jpg)
Backup Slides
![Page 49: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/49.jpg)
49
CEPHFS – File System Part of CEPH
• Replacement of NFS or other DFS
• Storage just a part
CEPH.ko CEPH-fuse libCEPHFS
CEPH DFS protocol
CEPH-osd
CEPH-mds
![Page 50: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/50.jpg)
50
CEPHFS – Client View
• The first kernel part of CEPH
• Unusual kernel implementation
‒ “light” code
‒ Almost no intelligence
• Communication channels
‒ To MDS for meta data operation
‒ To OSD to access file data
![Page 51: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/51.jpg)
51
CEPHFS Caches
• Per design‒ OSD: Identical to access of BTRFS
‒ Client: own caching
• Concurrent write access‒ Caches discarded
‒ Caching disabled -> synchronous I/O
• HPC extension of POSIX I/O ‒ O_LAZY
![Page 52: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/52.jpg)
52
CEPH and Meta Data Server (MDS)
• MDS form a cluster
• don’t store any data
‒ Data stored on OSD
‒ Journaled writes with cross MDS recovery
• Change to MDS landscape
‒ No data movement
‒ Only management information exchange
• Partitioning of name space
‒ Overlaps on purpose
![Page 53: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/53.jpg)
53
Dynamic sub tree Partitioning
• Weighted subtrees per MDS
• “load” of MDS re-belanced
MDS1 MDS2 MDS3 MDS4 MDS5
![Page 54: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/54.jpg)
54
Meta Data Management
• Small set of meta data
‒ No file allocation table
‒ Object names based on inode numbers
• MDS combines operations
‒ Single request for readdir() and stat()
‒ stat() information cached
![Page 55: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/55.jpg)
55
CEPH Monitors
• Status information of CEPH components critical
• First contact point for new clients
• Monitor track changes of cluster landscape
‒ Update cluster map
‒ Propagate information to OSDs
![Page 56: Petabyte Scale Out Rocks - SUSE Linux...Shared File Systems – Short Intro • Multiple server access the same data • Different approaches ‒ Network based, e.g. NFS, CIFS ‒](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed50391ad38025d974e43ef/html5/thumbnails/56.jpg)
56
CEPH Storage – All In One
CEPH.ko CEPH-fuse libCEPHFS /dev/rbd qemu/kvm
CEPHObject
Gatewaylibrados
CEPH DFS protocol RBD protocol
CEPH-osd
CEPH-mds