Replication, History, and Grafting in the Ori File System
Transcript of Replication, History, and Grafting in the Ori File System
Replication, History,and Grafting in theOri File SystemAli José Mashtizadeh, Andrea Bittau, Yifeng Frank Huang, David Mazières
Stanford University
Managed Storage
$1/GB/Year
$5-10/GB+
Local Storage $0.04/GB
What’s missing? Data management
•Availability – Data is always live.
•Accessibility – Data is globally accessible.
•Durability – Data is never lost. (History, Snapshots, Backup)
•Usability – Collaboration and version control are easy
Ori File System
Goal: All the benefits of Managed Storage, implemented with hardware you already own.
Local Storage $0.04/GB
Two Main Usage Models
Personal storage Shared storage
Public Folders Public Folders
Managed storage limitations todayBandwidth
- Limited by WAN bandwidthPrivacyStorage cost
- $ per GB of managed solutionsPoor integration of replication, versioning & sharing
- Copying files across machines- Apple Time Machine, Windows 8 File History,
Applications implement their own versioning- Emailing documents, Distributed version control
Idea: Leverage trends to do better
Fast LANsBig disks
Mobile storage
Disk vs WAN Throughput Growth
1
10
100
1000
10000
100000
1990 2013
Gro
wth
(lo
g sc
ale)
Internet Speed Disk Space
468x TransferTime Gap!
14 hours 278 daysTransfer time:
Ori design principles
Store not just files but file history
- Take advantage of disk space
Replicate files and history widely
- Make replication easy and instantaneous
- No master replica (OK if any device fails)
- Uses LAN speed and disk space
Use history for sharing
Ori Provides
Replication
History
File Sharing with History(Grafting)
Recovery
Public Folders
History
SFSRO/Git-like Data Model
Content Addressable Storage
SHA-256 Hash
Globally unique namespace
Deduplication
Tree
Tree Tree
Tree
Blob
(shared)
Blob
(fragment)
Blob
(fragment)
Blob
Large
Blob
CommitOlder
Commit. . .
Tree
Apply DVCS Techniques
Merge diverging replicas
Detect conflicts
- No magic bullets for all file types
- Make “merge base” available
- 3-way merge line-oriented files
Provide convenient tools
- History, snapshots, branches, …
Storage Layout
Objects are deduplicated, compressed, and stored
Log structured storage (files on your local file system)
Index used to lookup object locations
ReplicationSimplify data management
Today
Backup Centralized File StorageDropbox
SCP/Rsync/Airdrop
Egalitarian Replication
Replication subsumes backup
Crash!
Recover with Replication
Background Fetch optimization makes replica creation feel instantaneous
Replication in Ori
Opportunistic replication (Use LAN)- Bulk transport over SSH
Automatic device discovery and synchronization- UDP multicast messages – 5 second interval- Set a cluster name and symmetric key- Protected by AES-CBC
Replicate Deltas
Delta consists of acollection of objects
Versioning makesreplication easy!
Δ
Δ
Tree
Tree Tree
Tree
Blob
(shared)
Blob
(fragment)
Blob
(fragment)
Blob
Large
Blob
CommitOlder
Commit. . .
Tree
Delta
Protocol
Content Addressable Storage:
Objects are identical on disk and wire
- No rewriting of objects
Reference Counting:
Decompress metadata to update reference counts
- Decompression is faster than compression
Distributed Fetch
Fast LAN(Gbps)
WAN (Mbps)
UnrelatedFile System
Depends on content addressable storageTrade off Storage for Bandwidth
GraftingFile Sharing with History
Collaboration Today
Over EmailCloudVersionControl
File Sharing with Versioning
We want the file system to manage versioning and sharing
Require no forethought in setting up version control
No more insane naming: Presentation_Alice_Final_Bob_2_Final.pptx
Grafting in Ori
A 1
B 1 B 2 A 1* A 2* A 3*
A 2 A 3
B 3
B 3*Alice:
Bob:
Commit History
Alice’s LatestSnapshot
Cross repository links
Alice’s LatestSnapshot
Conflicts in Ori
Detects conflicts using history
Automatic merging when possible
Otherwise, provide files for 3-way mergefile, file:conflict, file:base
Conflicts rarely occur in single user model
Conflicts more likely with Grafts
– merges are explicit
Mobile DevicesSneakernets!
Today: Device space underutilized
iCloud, Google Drive,Office 365/SkyDrive
Data Carriers: Phone Storage Space
0
20
40
60
80
100
120
140
Oct
-06
Feb
-08
Jul-
09
No
v-1
0
Ap
r-1
2
Au
g-1
3
Dec
-14
Cap
acit
y (G
B)
Fast wireless networks
1
10
100
1000
10000
Oct-95 Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12 Dec-14
Ban
dw
idth
(M
bp
s)
Per-stream Bandwidth
802.11
802.11g802.11n
802.11ac
802.11b
802.11ad
4-8 Streams (MIMO)
Sneakernets
Sneakernets
Sneakernets
Average Commute in US: 25 MinutesCarry 16 GB Storage
5.2 Gbps Effective Bandwidth
“ Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.”
- Andrew S. Tanenbaum
Performance
Performance
File system benchmarks: Filebench
Network file system: Source code build
* Everything measured on an SSD, except the network benchmark
File system in User Space (FUSE)
Ori is built using FUSE
Baseline against the FUSE loopback
Compare: ext4, ori, loopback
Kernel
BenchmarkFUSE Driver
(orifs, loopback)
SSD
FUSE Kernel
ModuleExt4
User Space
Architecture
libOri
LocalStorage
HttpStorage
SSHStorageConnection
Manager
…
Object Storage(Packfiles)
Blob Tree Commit
Index Metadata
ext4
orifs (FUSE Driver)
Staging Area(Data Cache)
FS Metadata – In Memory(directories, fstat)
Staging Area(File Data Only)
Filebench: Synthetic Workloads
0
0.5
1
1.5
2
2.5
fileserver webserver varmail webproxy networkfs
Op
erat
ion
s/s
(No
rmal
ized
)
ext4 ori loopback
Higher is better
*
Ori vs NFS: Remote compile
20.4519.45
11.3316.04
0
10
20
30
40
50
60
Tim
e (s
)
LAN (1 Gbps)NFSv3
NFSv4
Ori
Ori w/BF
54.85
44.07
15.319.34
0
10
20
30
40
50
60
Tim
e (s
)
WAN (2/20 Mbps – 17 ms)
Lower is better40% longer 23% longer
BF = On-demand Background Fetch
Related Work
Network File Systems – AFP, CIFS, LBFS, NFS, Shark, …
Distributed File Systems – AFS, …
Disconnected File Systems – Coda, Ficus, JetFile, Intermezzo, …
Archival File Systems – Elephant, Plan 9, WAFL, Wayback, ZFS, …
Version Control – Git, Mercurial, …
Application Solutions – Bayou, Dropbox, …
Lessons Learned
Hardware and use cases have evolved
File systems need to catch up!
Replication is no longer just for data-centers
Keeping file history should be the default
Mobile devices create an opportunity for better solutions
- Fast LAN, Large Storage, Sneakernets
Future Work
Application Support for Merging on Ori
API ComplicationsMerges can surprise applications and usersEvent notification?
Integrating Grafting and Orisync
Authentication
Questions?Visit: http://ori.scs.stanford.edu/
Available for OS X, Linux, and FreeBSD
See paper for details on additional features
Backup Slides
Mobile Device Battery Life
Use 802.11 (or USB) – Better for battery life
Some platforms have:
- Periodic callbacks (opportunistic optimize battery life)
- Geofencing callbacks (wake up when arriving at a location)
Bonnie: IO Benchmark
0
50000
100000
150000
200000
250000
300000
16K read 16K write 16K rewrite
Op
erat
ion
s Pe
r Se
con
d
ext4 ori loopback
Higher is better
Distributed Fetch - Performance
7.75
132.05
170.79
020406080
100120140160180
DistributedPull
PartiallyDistributed
Pull
RemotePull
Tim
e (s
)
Remote pull of Python 3.2.3 source
Peer either has Python 2.7.3 or 3.2.3
Source
Internet110ms
290/530KB up/down
Destination
Nearby Peer
Ori vs NFS
NFSv3 NFSv4 Ori Ori on-demand
LAN WAN LAN WAN LAN WAN LAN WAN
Replicate 0.49 s 2.93 s
Configure 8.14 s 21.52 s 7.25 s 15.54 s 0.66 s 0.66 s 1.01 s 1.33 s
Build 12.32 s 33.33 s 12.20 s 28.54 s 9.50 s 9.55 s 11.45 s 12.77 s
Snapshot 0.19 s 0.19 s 2.72 s 3.37 s
Push 0.49 s 1.58 s 0.85 s 1.89 s
Total 20.45 s 54.85 s 19.45 s 44.07 s 11.33 s 15.30 s 16.04 s 19.34 s