Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011...
![Page 1: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/1.jpg)
Distributed SystemsTutorial 9 – Windows Azure Storage
written by Alex LibovBased on SOSP 2011 presentation
winter semester, 2011-2012
![Page 2: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/2.jpg)
Windows Azure Storage (WAS)
2
A scalable cloud storage system In production since November 2008 used inside Microsoft for applications such as
social networking search, serving video, music and game content, managing medical records and more
Thousands of customers outside Microsoft Anyone can sign up over the Internet to use
the system.
![Page 3: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/3.jpg)
WAS Abstractions
3
Blobs – File system in the cloud Tables – Massively scalable structured
storage Queues – Reliable storage and delivery of
messages A common usage pattern is incoming and
outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.
![Page 4: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/4.jpg)
4
Design goals Highly Available with Strong Consistency
Provide access to data in face of failures/partitioning
Durability Replicate data several times within and across
data centers Scalability
Need to scale to exabytes and beyond Provide a global namespace to access data around
the world Automatically load balance data to meet peak
traffic demands
![Page 5: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/5.jpg)
Global Partitioned Namespace
5
http(s)://AccountName.<service>.core.windows.net/PartitionName/ObjectName <service> can be a blob, table or queue. AccountName is the customer selected account name for accessing
storage. The Account name specifies the data center where the data is stored. An application may use multiple AccountNames to store its data
across different locations.
PartitionName locates the data once a request reaches the storage cluster When a PartitionName holds many objects, the ObjectName identifies
individual objects within that partition The system supports atomic transactions across objects with the
same PartitionName value The ObjectName is optional since, for some types of data, the
PartitionName uniquely identifies the object within the account.
![Page 6: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/6.jpg)
Storage Stamps
6
A storage stamp is a cluster of N racks of storage nodes.
Each rack is built out as a separate fault domain with redundant networking and power.
Clusters typically range from 10 to 20 racks with 18 disk-heavy storage nodes per rack.
The first generation storage stamps hold approximately 2PB of raw storage each.
The next generation stamps hold up to 30PB of raw storage each.
![Page 7: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/7.jpg)
High Level Architecture
7
Storage Stamp
LB
StorageLocation Service
Data access
Partition Layer
Front-Ends
Stream LayerIntra-stamp replication
Storage Stamp
LB
Partition Layer
Front-Ends
Stream LayerIntra-stamp replication
Inter-stamp (Geo) replication
Access blob storage via the URL: http://<account>.blob.core.windows.net/
![Page 8: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/8.jpg)
Storage Stamp Architecture – Stream Layer
8
Append-only distributed file system All data from the Partition Layer is stored into files
(extents) in the Stream layer An extent is replicated 3 times across different fault and
upgrade domains With random selection for where to place replicas
Checksum all stored data Verified on every client read
Re-replicate on disk/node/rack failure or checksum mismatch
M
Extent Nodes (EN)
Paxos
M
MStream Layer(DistributedFile System)
![Page 9: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/9.jpg)
Storage Stamp Architecture – Partiton Layer
9
Provide transaction semantics and strong consistency for Blobs, Tables and Queues
Stores and reads the objects to/from extents in the Stream layer
Provides inter-stamp (geo) replication by shipping logs to other stamps
Scalable object index via partitioning
PartitionServer
PartitionServer
PartitionServer
PartitionServer
PartitionMaster
Lock Service
Partition Layer
![Page 10: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/10.jpg)
Storage Stamp Architecture – Front End Layer
10
Stateless Servers Authentication + authorization Request routing
![Page 11: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/11.jpg)
Storage Stamp Architecture
11
M
Extent Nodes (EN)
Paxos
Front End Layer
FE
Incoming Write Request
M
M
Partition
Server
Partition
Server
Partition
Server
Partition
Server
Partition
Master
FE FE FE FE
Lock Servi
ce
Ack
Partition Layer
Stream Layer
![Page 12: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/12.jpg)
Partition Layer – Scalable Object Index
12
100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp Need to efficiently enumerate, query, get, and
update them Traffic pattern can be highly dynamic
Hot objects, peak load, traffic bursts, etc
Need a scalable index for the objects that can Spread the index across 100s of servers Dynamically load balance
Dynamically change what servers are serving each part of the index based on load
![Page 13: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/13.jpg)
13
Scalable Object Index via Partitioning Partition Layer maintains an internal Object Index Table
for each data abstraction Blob Index: contains all blob objects for all accounts in a
stamp Table Entity Index: contains all table entities for all accounts
in a stamp Queue Message Index: contains all messages for all accounts
in a stamp
Scalability is provided for each Object Index Monitor load to each part of the index to determine hot spots Index is dynamically split into thousands of Index
RangePartitions based on load Index RangePartitions are automatically load balanced across
servers to quickly adapt to changes in load
![Page 14: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/14.jpg)
AccountName
ContainerName
BlobName
aaaa aaaa aaaaa
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
zzzz zzzz zzzzz
• Split index into RangePartitions based on load
• Split at PartitionKey boundaries
• PartitionMap tracks Index RangePartition assignment to partition servers
• Front-End caches the PartitionMap to route user requests
• Each part of the index is assigned to only one Partition Server at a time
Storage Stamp
Partition
Server
Partition
Server
AccountName
ContainerName
BlobName
richard videos tennis
……… ……… ………
……… ……… ………
zzzz zzzz zzzzz
AccountName
ContainerName
BlobName
harry pictures sunset
……… ……… ………
……… ……… ………
richard videos soccer
Partition
Server
Partition
Master
Partition Layer – Index Range Partitioning
Front-EndServer
PS 2 PS 3
PS 1
A-H: PS1H’-R: PS2R’-Z: PS3
A-H: PS1H’-R: PS2R’-Z: PS3
Partition
Map
Blob Index
Partition Map
AccountName
ContainerName
BlobName
aaaa aaaa aaaaa
……… ……… ………
……… ……… ………
harry pictures sunrise A-H
R’-Z
H’-R
![Page 15: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/15.jpg)
15
Partition Layer – RangePartition A RangePartition uses a Log-Structured Merge-Tree to
maintain its persistent data. RangePartition consists of its own set of streams in the
stream layer, and the streams belong solely to a given RangePartition
Metadata Stream – The metadata stream is the root stream for a RangePartition. The PM assigns a partition to a PS by providing the name of the
RangePartition’s metadata stream Commit Log Stream – Is a commit log used to store the
recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.
Row Data Stream – Stores the checkpoint row data and index for the RangePartition.
![Page 16: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/16.jpg)
16
Stream Layer Append-Only Distributed File System Streams are very large files
Has file system like directory namespace Stream Operations
Open, Close, Delete Streams Rename Streams Concatenate Streams together Append for writing Random reads
![Page 17: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/17.jpg)
Extent E2 Extent E3
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Stream Layer Concepts
Block Min unit of
write/read Checksum Up to N bytes (e.g.
4MB)
Extent Unit of replication Sequence of
blocks Size limit (e.g.
1GB) Sealed/unsealed
Stream Hierarchical
namespace Ordered list of
pointers to extents Append/
Concatenate
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Extent E4
Stream //foo/myfile.dataPtr E1
Ptr E2
Ptr E3
Ptr E4
sealed sealed sealed unsealedExtent E1
![Page 18: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/18.jpg)
Creating an Extent
SMSMStrea
m Maste
r
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
Create Stream/Extent
Allocate Extent replica set
Primary Secondary ASecondary B
EN1 PrimaryEN2, EN3 Secondary
![Page 19: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/19.jpg)
Replication Flow
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
Append
Primary Secondary ASecondary B
Ack
EN1 PrimaryEN2, EN3 Secondary
![Page 20: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/20.jpg)
www.buildwindows.com
Providing Bit-wise Identical Replicas
• Want all replicas for an extent to be bit-wise the same, up to a committed length• Want to store pointers from the partition layer index to an
extent+offset• Want to be able to read from any replica
• Replication flow• All appends to an extent go to the Primary• Primary orders all incoming appends and picks the offset for
the append in the extent• Primary then forwards offset and data to secondaries• Primary performs in-order acks back to clients for extent
appends• Primary returns the offset of the append in the extent• An extent offset can commit back to the client once all replicas
have written that offset and all prior offsets have also already been completely written
• This represents the committed length of the extent
![Page 21: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/21.jpg)
?
Dealing with Write FailuresFailure during append1. Ack from primary lost when going back to partition layer
Retry from partition layer can cause multiple blocks to be appended (duplicate records)
2. Unresponsive/Unreachable Extent Node (EN) Append will not be acked back to partition layer Seal the failed extent Allocate a new extent and append immediately
Stream //foo/myfile.datPtr E1
Ptr E2
Ptr E3
Ptr E4
Extent E5
Ptr E5
Extent E1Extent E2 Extent E3Extent E4
![Page 22: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/22.jpg)
Extent Sealing (Scenario 1)
SMSMStrea
m Maste
r
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Append
Primary Secondary ASecondary B
Ask for current length120120
Sealed at 120
Seal Extent
Seal Extent
![Page 23: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/23.jpg)
Extent Sealing (Scenario 1)
SMSMStrea
m Maste
r
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Primary Secondary ASecondary B
Sync with SM120
Sealed at 120
Seal Extent
![Page 24: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/24.jpg)
Extent Sealing (Scenario 2)
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Append
Primary Secondary ASecondary B
Ask for current length120
Sealed at 100
Seal Extent
100
Seal Extent
![Page 25: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/25.jpg)
Extent Sealing (Scenario 2)
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Primary Secondary ASecondary B
Sync with SM
Sealed at 100
Seal Extent
100
![Page 26: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/26.jpg)
Providing Consistency for Data Streams
SMSMSM
EN 1 EN 2 EN 3
Primary Secondary ASecondary B
Partition
Server
Network partition• PS can talk to EN3• SM cannot talk to EN3
For Data Streams, Partition Layer only reads from offsets returned from successful appends Committed on all replicas Row and Blob Data Streams
Offset valid on any replica
Safe to read from EN3
![Page 27: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/27.jpg)
Providing Consistency for Log Streams
SMSMSM
EN 1 EN 2 EN 3
Primary Secondary ASecondary B
Partition
Server
Check commit length Logs are used on partition load
Commit and Metadata log streams
Check commit length first Only read from Unsealed replica if all
replicas have the same commit length
A sealed replica Check commit lengthSeal Extent
Use EN1, EN2 for loading
Network partition• PS can talk to EN3• SM cannot talk to EN3
![Page 28: Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.](https://reader030.fdocuments.net/reader030/viewer/2022033106/56649d405503460f94a197b1/html5/thumbnails/28.jpg)
Summary Highly Available Cloud Storage with Strong
Consistency Scalable data abstractions to build your
applications Blobs – Files and large objects Tables – Massively scalable structured storage Queues – Reliable delivery of messages
More information at: http://www.sigops.org/sosp/sosp11/current/
2011-Cascais/11-calder-online.pdf