Computer Architecture and MEmory systems Lab.
Transcript of Computer Architecture and MEmory systems Lab.
![Page 1: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/1.jpg)
Computer Architecture and MEmory systems Lab.
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and
Large-Scale SSD Array Systems
Sungjoon Koh, Jie Zhang, Miryeong Kwon, Jungyeon Yoon, David Donofrio, Nam Sung Kim and Myoungsoo Jung
![Page 2: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/2.jpg)
Take-away
• Motivation: Distributed systems are starting to adopt erasure coding as a fault tolerance mechanism instead of replication due to its storage overheads.
![Page 3: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/3.jpg)
Take-away
• Motivation: Distributed systems are starting to adopt erasure coding as a fault tolerance mechanism instead of replication due to its storage overheads.
• Goal: Understanding system characteristics of online erasure coding by analyzing and comparing them with those of replication.
![Page 4: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/4.jpg)
Take-away
• Motivation: Distributed systems are starting to adopt erasure coding as a fault tolerance mechanism instead of replication due to its storage overheads.
• Goal: Understanding system characteristics of online erasure coding by analyzing and comparing them with those of replication.
• Observations in Online Erasure Coding:1) Up to 13× I/O performance degradation compared to replication.2) 50% CPU usage and lots of context switches.3) Up to 700× I/O amplification more than total request volumes.4) Up to 500× network traffics among the storage nodes compared to total
request amount.
![Page 5: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/5.jpg)
Take-away
• Motivation: Distributed systems are starting to adopt erasure coding as a fault tolerance mechanism instead of replication due to its storage overheads.
• Goal: Understanding system characteristics of online erasure coding by analyzing and comparing them with those of replication.
• Observations in Online Erasure Coding:1) Up to 13× I/O performance degradation compared to replication.2) 50% CPU usage and lots of context switches.3) Up to 700× I/O amplification more than total request volumes.4) Up to 500× network traffics among the storage nodes compared to total
request amount.
• Summary of Our Work:• We observe and measure various overheads imposed by online erasure
coding quantitatively on a distributed system that consists of 52 SSDs.• Collect block-level traces from all-flash array based storage clusters, which
can be downloaded freely.
![Page 6: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/6.jpg)
Overall Results
Read Write Read Write Read Write Read Write Read Write Read Write
Throughput LatencyCPU
Utilization
Relative
Context
Switch
Private
Network
Overhead
I/O
Amplification
05
10253035505560
57.7
10.4
37.8
8.712.6
1.9
10.77.6
1.50.14
4KB Random Request
3-Replication (3-Rep.)
RS (10, 4)
No
rmalized
to
Re
plicati
on
0.67
Almost 0
![Page 7: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/7.jpg)
Overall Results
Read Write Read Write Read Write Read Write Read Write Read Write
Throughput LatencyCPU
Utilization
Relative
Context
Switch
Private
Network
Overhead
I/O
Amplification
05
10253035505560
57.7
10.4
37.8
8.712.6
1.9
10.77.6
1.50.14
4KB Random Request
3-Replication (3-Rep.)
RS (10, 4)
No
rmalized
to
Re
plicati
on
0.67
Almost 0
![Page 8: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/8.jpg)
Overall Results
Read Write Read Write Read Write Read Write Read Write Read Write
Throughput LatencyCPU
Utilization
Relative
Context
Switch
Private
Network
Overhead
I/O
Amplification
05
10253035505560
57.7
10.4
37.8
8.712.6
1.9
10.77.6
1.50.14
4KB Random Request
3-Replication (3-Rep.)
RS (10, 4)
No
rmalized
to
Re
plicati
on
0.67
Almost 0
![Page 9: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/9.jpg)
Overall Results
Read Write Read Write Read Write Read Write Read Write Read Write
Throughput LatencyCPU
Utilization
Relative
Context
Switch
Private
Network
Overhead
I/O
Amplification
05
10253035505560
57.7
10.4
37.8
8.712.6
1.9
10.77.6
1.50.14
4KB Random Request
3-Replication (3-Rep.)
RS (10, 4)
No
rmalized
to
Re
plicati
on
0.67
Almost 0
![Page 10: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/10.jpg)
Overall Results
Read Write Read Write Read Write Read Write Read Write Read Write
Throughput LatencyCPU
Utilization
Relative
Context
Switch
Private
Network
Overhead
I/O
Amplification
05
10253035505560
57.7
10.4
37.8
8.712.6
1.9
10.77.6
1.50.14
4KB Random Request
3-Replication (3-Rep.)
RS (10, 4)
No
rmalized
to
Re
plicati
on
0.67
Almost 0
![Page 11: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/11.jpg)
Overall Results
Read Write Read Write Read Write Read Write Read Write Read Write
Throughput LatencyCPU
Utilization
Relative
Context
Switch
Private
Network
Overhead
I/O
Amplification
05
10253035505560
57.7
10.4
37.8
8.712.6
1.9
10.77.6
1.50.14
4KB Random Request
3-Replication (3-Rep.)
RS (10, 4)
No
rmalized
to
Re
plicati
on
0.67
Almost 0
![Page 12: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/12.jpg)
Introduction
![Page 13: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/13.jpg)
Demands on scalable, high performancedistributed storage system
Introduction
![Page 14: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/14.jpg)
Employing SSDs to HPC & DC Systems
HPC & DC
![Page 15: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/15.jpg)
SSD
Employing SSDs to HPC & DC Systems
HPC & DC
![Page 16: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/16.jpg)
SSDHDD
Employing SSDs to HPC & DC Systems
HPC & DC
![Page 17: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/17.jpg)
SSDHDDHigher bandwidthShorter latency &
Lower power consumption
Employing SSDs to HPC & DC Systems
HPC & DC
![Page 18: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/18.jpg)
• Typically, storage systems have regular failures.
Storage System Failures
![Page 19: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/19.jpg)
• Typically, storage systems have regular failures.
1) Storage failure
ex) Facebook reports; Up to 3% HDDs fails each day(Ref. M. Sathiamoorthy et al., “Xoringelephants: Novel erasure codes for big data,” in PVLDB, 2013.)
Storage System Failures
![Page 20: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/20.jpg)
• Typically, storage systems have regular failures.
1) Storage failure
Although SSDs have higher reliability than HDDs, daily failure cannot be ignored.
ex) Facebook reports; Up to 3% HDDs fails each day(Ref. M. Sathiamoorthy et al., “Xoringelephants: Novel erasure codes for big data,” in PVLDB, 2013.)
Storage System Failures
![Page 21: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/21.jpg)
• Typically, storage systems have regular failures.
1) Storage failure
Although SSDs have higher reliability than HDDs, daily failure cannot be ignored.
2) Network switch errors, power outages, and soft/hard errors
ex) Facebook reports; Up to 3% HDDs fails each day(Ref. M. Sathiamoorthy et al., “Xoringelephants: Novel erasure codes for big data,” in PVLDB, 2013.)
Storage System Failures
![Page 22: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/22.jpg)
• Typically, storage systems have regular failures.
1) Storage failure
Although SSDs have higher reliability than HDDs, daily failure cannot be ignored.
2) Network switch errors, power outages, and soft/hard errors
ex) Facebook reports; Up to 3% HDDs fails each day(Ref. M. Sathiamoorthy et al., “Xoringelephants: Novel erasure codes for big data,” in PVLDB, 2013.)
“So we need fault tolerance mechanism.”
Storage System Failures
![Page 23: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/23.jpg)
Traditional fault tolerance mechanism.
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
![Page 24: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/24.jpg)
Traditional fault tolerance mechanism.
• Simple and effective way to make system resilient.
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
![Page 25: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/25.jpg)
Traditional fault tolerance mechanism.
• Simple and effective way to make system resilient.
× High storage overheads (3x).
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
![Page 26: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/26.jpg)
Traditional fault tolerance mechanism.
• Simple and effective way to make system resilient.
× High storage overheads (3x).
× Especially for SSD, replication
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
![Page 27: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/27.jpg)
Traditional fault tolerance mechanism.
• Simple and effective way to make system resilient.
× High storage overheads (3x).
× Especially for SSD, replication
1) causes high expense because of SSD’s high cost per GB.
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
![Page 28: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/28.jpg)
Traditional fault tolerance mechanism.
• Simple and effective way to make system resilient.
× High storage overheads (3x).
× Especially for SSD, replication
1) causes high expense because of SSD’s high cost per GB.
2) causes performance degradation due to SSD’s specific characteristics .
ex) Garbage collection, wearing out…
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
![Page 29: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/29.jpg)
Traditional fault tolerance mechanism.
• Simple and effective way to make system resilient.
× High storage overheads (3x).
× Especially for SSD, replication
1) causes high expense because of SSD’s high cost per GB.
2) causes performance degradation due to SSD’s specific characteristics .
ex) Garbage collection, wearing out…
Replication
Data Replica Replica
Fault Tolerance Mechanisms in Distributed System
Need an alternative method to reduce the storage overheads.
![Page 30: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/30.jpg)
Alternative method of replication.
Erasure coding
Fault Tolerance Mechanisms in Distributed System
Data Chunks Coding Chunks
Encode
![Page 31: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/31.jpg)
Alternative method of replication.
• Lower storage overheads than replication.
Erasure coding
Fault Tolerance Mechanisms in Distributed System
Data Chunks Coding Chunks
Encode
![Page 32: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/32.jpg)
Alternative method of replication.
• Lower storage overheads than replication.
× High reconstruction costs. (well known problem)
Erasure coding
Fault Tolerance Mechanisms in Distributed System
Data Chunks Coding Chunks
Encode
![Page 33: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/33.jpg)
Alternative method of replication.
• Lower storage overheads than replication.
× High reconstruction costs. (well known problem)
ex) Facebook cluster with EC increases network traffics by more than 100TB in a
day.
Erasure coding
Fault Tolerance Mechanisms in Distributed System
Data Chunks Coding Chunks
Encode
![Page 34: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/34.jpg)
Alternative method of replication.
• Lower storage overheads than replication.
× High reconstruction costs. (well known problem)
ex) Facebook cluster with EC increases network traffics by more than 100TB in a
day.
Many researches try to reduce reconstruction costs.
Erasure coding
Fault Tolerance Mechanisms in Distributed System
Data Chunks Coding Chunks
Encode
![Page 35: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/35.jpg)
Alternative method of replication.
• Lower storage overheads than replication.
× High reconstruction costs. (well known problem)
ex) Facebook cluster with EC increases network traffics by more than 100TB in a
day.
Many researches try to reduce reconstruction costs.
Erasure coding
Fault Tolerance Mechanisms in Distributed System
Data Chunks Coding Chunks
Encode
We observed significant overheadsimposed during I/O services in distributed
system employing erasure codes.
![Page 36: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/36.jpg)
• Reed-Solomon: Erasure coding algorithm.
• Ceph: Distributed system used in this research.
- Architecture
- Data path
- Storage stack
Background
![Page 37: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/37.jpg)
• The most famous erasure coding algorithm.
Reed-Solomon
![Page 38: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/38.jpg)
• The most famous erasure coding algorithm.
• Divide data into k equal data chunks and generates m coding chunks.
Reed-Solomon
D0
D1...
Dk
Data
C0
C1...
CmData
ChunksCodingChunks
![Page 39: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/39.jpg)
• The most famous erasure coding algorithm.
• Divide data into k equal data chunks and generates m coding chunks.
• Encoding: Multiplication of a generator matrix and data chunks as a vector.
Reed-Solomon
DataChunks
D0
D1
D2
D3
D0
D1
D2
D3
C0
C1
GeneratorMatrix
C2
CodingChunks
![Page 40: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/40.jpg)
• The most famous erasure coding algorithm.
• Divide data into k equal data chunks and generates m coding chunks.
• Encoding: Multiplication of a generator matrix and data chunks as a vector.
• Stripe: k data chunks.
Reed-Solomon
DataChunks
D0
D1
D2
D3
D0
D1
D2
D3
C0
C1
GeneratorMatrix
C2
Stripe
![Page 41: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/41.jpg)
• The most famous erasure coding algorithm.
• Divide data into k equal data chunks and generates m coding chunks.
• Encoding: Multiplication of a generator matrix and data chunks as a vector.
• Stripe: k data chunks.
• Write Reed-Solomon with k data chunks and m coding chunks as “RS(k,m)”.
Reed-Solomon
DataChunks
D0
D1
D2
D3
D0
D1
D2
D3
C0
C1
GeneratorMatrix
C2
RS(4,3)
![Page 42: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/42.jpg)
• The most famous erasure coding algorithm.
• Divide data into k equal data chunks and generates m coding chunks.
• Encoding: Multiplication of a generator matrix and data chunks as a vector.
• Stripe: k data chunks.
• Write Reed-Solomon with k data chunks and m coding chunks as “RS(k,m)”.
• Can be recovered from m failures.
Reed-Solomon
DataChunks
D0
D1
D2
D3
D0
D1
D2
D3
C0
C1
GeneratorMatrix
C2
3 failures
![Page 43: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/43.jpg)
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
![Page 44: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/44.jpg)
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
![Page 45: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/45.jpg)
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
• Client nodes are connected to the storage nodes through “public network”.
![Page 46: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/46.jpg)
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
• Client nodes are connected to the storage nodes through “public network”.
• Storage nodes are connected through “private network”.
![Page 47: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/47.jpg)
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
• Client nodes are connected to the storage nodes through “public network”.
• Storage nodes are connected through “private network”.
• Each storage node consists of several object storage device daemons (OSDs) and monitors.
![Page 48: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/48.jpg)
• OSDs handle read/write services.
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
• Client nodes are connected to the storage nodes through “public network”.
• Storage nodes are connected through “private network”.
• Each storage node consists of several object storage device daemons (OSDs) and monitors.
• OSDs handle read/write services.
![Page 49: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/49.jpg)
• OSDs handle read/write services.
App
libRBD
libRADOS
App App
libRBD
libRADOS
App
Client Nodes
Public Network
Node 0 Node n
Monitor Monitor
PrivateNetwork
Storage
Ceph Architecture
• Client nodes are connected to the storage nodes through “public network”.
• Storage nodes are connected through “private network”.
• Each storage node consists of several object storage device daemons (OSDs) and monitors.
• OSDs handle read/write services.
• Monitors manage the access permissions and the status of multiple OSDs.
![Page 50: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/50.jpg)
Data Path
file
Object libRBDObject
1. File/block is handled as an object.
![Page 51: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/51.jpg)
Data Path
file
Object libRBD
libRADOS
Pool APool Boid
HASH()pgid
Object
1. File/block is handled as an object.
2. Object is assigned to placement group
(PG) consists of several OSDs according
to the result of hash function.
01
n
17,18,19,11
PG
![Page 52: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/52.jpg)
Data Path
file
Object libRBD
libRADOS
Pool APool Boid
HASH()pgid
CRUSH()
Object
1. File/block is handled as an object.
2. Object is assigned to placement group
(PG) consists of several OSDs according
to the result of hash function.
3. CRUSH algorithm determines primary
OSD in PG.
01
n
17,18,19,11
PG
![Page 53: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/53.jpg)
Data Path
file
Object libRBD
libRADOS
Pool APool Boid
HASH()pgid
CRUSH()
primaryObject
1. File/block is handled as an object.
2. Object is assigned to placement group
(PG) consists of several OSDs according
to the result of hash function.
3. CRUSH algorithm determines primary
OSD in PG.
4. Object is sent to primary OSD.01
n
17,18,19,11
PG
![Page 54: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/54.jpg)
Data Path
file
Object libRBD
libRADOS
Pool APool Boid
HASH()pgid
CRUSH()
primaryObjectData 1 Data 2 Data 3 Data 4
1. File/block is handled as an object.
2. Object is assigned to placement group
(PG) consists of several OSDs according
to the result of hash function.
3. CRUSH algorithm determines primary
OSD in PG.
4. Object is sent to primary OSD.
5. Primary OSD sends object to another
OSDs (secondary, tertiary, …) as a form
of replica/chunk depending on the fault
tolerance mechanism.
01
n
17,18,19,11
PG
![Page 55: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/55.jpg)
Data Path
file
Object libRBD
libRADOS
Pool APool Boid
HASH()pgid
CRUSH()
primaryObjectData 1 Data 2 Data 3 Data 4
1. File/block is handled as an object.
2. Object is assigned to placement group
(PG) consists of several OSDs according
to the result of hash function.
3. CRUSH algorithm determines primary
OSD in PG.
4. Object is sent to primary OSD.
5. Primary OSD sends object to another
OSDs (secondary, tertiary, …) as a form
of replica/chunk depending on the fault
tolerance mechanism.
“In detail”
01
n
17,18,19,11
PG
![Page 56: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/56.jpg)
Storage Stack
Primary OSDObject
![Page 57: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/57.jpg)
Storage Stack
Primary OSD
Client Messenger
Object
![Page 58: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/58.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
Object
![Page 59: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/59.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
Object
![Page 60: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/60.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
Object
![Page 61: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/61.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
Object
![Page 62: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/62.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
Object
![Page 63: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/63.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
Cluster Messenger
Object
Chunk/Replica
![Page 64: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/64.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
SSD SSD
Cluster Messenger
Object
Chunk/Replica
![Page 65: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/65.jpg)
Storage Stack
Primary OSD
Client Messenger
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
Dispatcher
PrimaryLogPG
PG BackendReplicated EC
SSD SSD
Cluster Messenger
Object
“Implemented in user space.”
User space
Kernel space
Chunk/Replica
![Page 66: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/66.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
![Page 67: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/67.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
![Page 68: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/68.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
![Page 69: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/69.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
3) Actual amount of reads & writes served from disks.
![Page 70: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/70.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
3) Actual amount of reads & writes served from disks.
4) Private network traffics.
![Page 71: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/71.jpg)
Object Management in Erasure Coding
Observe that erasure coding has different object management
scheme with replication.
![Page 72: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/72.jpg)
Object Management in Erasure Coding
Observe that erasure coding has different object management
scheme with replication.
- To manage data/coding chunks.
![Page 73: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/73.jpg)
Object Management in Erasure Coding
Observe that erasure coding has different object management
scheme with replication.
- To manage data/coding chunks.
- Two phases: object initialization & object update.
![Page 74: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/74.jpg)
Object Management in Erasure Coding
Observe that erasure coding has different object management
scheme with replication.
- To manage data/coding chunks.
- Two phases: object initialization & object update.
i) Object initialization.
<4MB Object>
𝑘KB Write
![Page 75: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/75.jpg)
Object Management in Erasure Coding
Observe that erasure coding has different object management
scheme with replication.
- To manage data/coding chunks.
- Two phases: object initialization & object update.
i) Object initialization.
<4MB Object>
𝑘KB Write
Dummy Data
![Page 76: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/76.jpg)
Object Management in Erasure Coding
Observe that erasure coding has different object management
scheme with replication.
- To manage data/coding chunks.
- Two phases: object initialization & object update.
i) Object initialization.
<4MB Object>
𝑘KB Write
Dummy DataGenerate object with
coding chunks
![Page 77: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/77.jpg)
Object Management in Erasure Coding
ii) Object update.
Data Chunk 2
To be updated
![Page 78: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/78.jpg)
Object Management in Erasure Coding
ii) Object update.
Data Chunk 1
Data Chunk 2
Data Chunk 3
Data Chunk 4
Data Chunk 5
Data Chunk 0
i) Read whole stripe
To be updated
![Page 79: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/79.jpg)
Object Management in Erasure Coding
ii) Object update.
Data Chunk 1
Data Chunk 2
Data Chunk 3
Data Chunk 4
Data Chunk 5
Data Chunk 0
To be updated
ii) Generate coding chunks
Coding Chunk 0
Coding Chunk 1
Coding Chunk 2
![Page 80: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/80.jpg)
Object Management in Erasure Coding
ii) Object update.
Data Chunk 1
Data Chunk 2
Data Chunk 3
Data Chunk 4
Data Chunk 5
Data Chunk 0
To be updated
Coding Chunk 0
Coding Chunk 1
Coding Chunk 2
iii) Write to storage
![Page 81: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/81.jpg)
Workload Description
3-replication, RS(6,3), RS(10,4)
![Page 82: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/82.jpg)
Workload Description
Micro benchmark: Flexible I/O (FIO)
Request Size (KB) 1, 2, 4, 8, 16, 32, 64, 128
Access Type Sequential Random
Pre-write X O X O
Operation Type Write Read Wrie Write Read Write
3-replication, RS(6,3), RS(10,4)
![Page 83: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/83.jpg)
Workload Description
Micro benchmark: Flexible I/O (FIO)
Request Size (KB) 1, 2, 4, 8, 16, 32, 64, 128
Access Type Sequential Random
Pre-write X O X O
Operation Type Write Read Wrie Write Read Write
3-replication, RS(6,3), RS(10,4)
![Page 84: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/84.jpg)
Workload Description
Micro benchmark: Flexible I/O (FIO)
Request Size (KB) 1, 2, 4, 8, 16, 32, 64, 128
Access Type Sequential Random
Pre-write X O X O
Operation Type Write Read Wrie Write Read Write
3-replication, RS(6,3), RS(10,4)
![Page 85: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/85.jpg)
Workload Description
Micro benchmark: Flexible I/O (FIO)
Request Size (KB) 1, 2, 4, 8, 16, 32, 64, 128
Access Type Sequential Random
Pre-write X O X O
Operation Type Write Read Wrie Write Read Write
3-replication, RS(6,3), RS(10,4)
![Page 86: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/86.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
3) Actual amount of reads & writes served from disks.
4) Private network traffics.
![Page 87: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/87.jpg)
Performance Comparison (Sequential Write)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
0
4
8
124KB
2KB1KB
3-Rep. RS 6, 3 RS 10, 4
Th
rou
gh
pu
t (M
B/s
)
- worse in RS (max)
- longer in RS (max)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
1200 3-Rep.
RS 6, 3
RS 10, 4
La
ten
cy
(m
se
c)
![Page 88: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/88.jpg)
Performance Comparison (Sequential Write)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
0
4
8
124KB
2KB1KB
3-Rep. RS 6, 3 RS 10, 4
Th
rou
gh
pu
t (M
B/s
)
- Significant performance degradation in Reed-Solomon.
- worse in RS (max)
- longer in RS (max)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
1200 3-Rep.
RS 6, 3
RS 10, 4
La
ten
cy
(m
se
c)
![Page 89: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/89.jpg)
Performance Comparison (Sequential Write)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
0
4
8
124KB
2KB1KB
3-Rep. RS 6, 3 RS 10, 4
Th
rou
gh
pu
t (M
B/s
)
X 11.3 (MAX)
- worse in RS (max)
- Significant performance degradation in Reed-Solomon.
- Throughput: 11.3× worse in RS (max)
- longer in RS (max)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
1200 3-Rep.
RS 6, 3
RS 10, 4
La
ten
cy
(m
se
c)
![Page 90: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/90.jpg)
Performance Comparison (Sequential Write)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
0
4
8
124KB
2KB1KB
3-Rep. RS 6, 3 RS 10, 4
Th
rou
gh
pu
t (M
B/s
)
X 11.3 (MAX)
- longer in RS (max)
- worse in RS (max)
- Significant performance degradation in Reed-Solomon.
- Latency: 12.9× longer in RS (max)
- longer in RS (max)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
1200 3-Rep.
RS 6, 3
RS 10, 4
La
ten
cy
(m
se
c)
X 12.9 (MAX)
![Page 91: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/91.jpg)
Performance Comparison (Sequential Write)
4KB8KB
16KB0
20
40
60 3-Rep.
RS 6, 3
RS 10, 4
Th
rou
gh
pu
t (M
B/s
)
- longer in RS (max)
- worse in RS (max)
- Significant performance degradation in Reed-Solomon.
- Degradation in request size 4~16KB is not acceptable
- longer in RS (max)
4KB8KB
16KB0
200
400
600
800
1000
1200
1400 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
![Page 92: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/92.jpg)
Performance Comparison (Sequential Write)
4KB8KB
16KB0
20
40
60 3-Rep.
RS 6, 3
RS 10, 4
Th
rou
gh
pu
t (M
B/s
)
- longer in RS (max)
- worse in RS (max)
- Significant performance degradation in Reed-Solomon.
- Degradation in request size 4~16KB is not acceptable
- Computation for encoding, data management, and additional network traffic
causes degradation in erasure coding.
4KB8KB
16KB0
200
400
600
800
1000
1200
1400 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
![Page 93: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/93.jpg)
Performance Comparison (Sequential Read)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
0
8
16
24
324KB
2KB
1KB
Th
rou
gh
pu
t (M
B/s
) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
50
100
150 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
- worse in RS (4KB)
- longer in RS (4KB)
![Page 94: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/94.jpg)
Performance Comparison (Sequential Read)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
0
8
16
24
324KB
2KB
1KB
Th
rou
gh
pu
t (M
B/s
) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
50
100
150 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
- Performance degradation in Reed-Solomon.
- worse in RS (4KB)
- longer in RS (4KB)
![Page 95: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/95.jpg)
Performance Comparison (Sequential Read)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
0
8
16
24
324KB
2KB
1KB
Th
rou
gh
pu
t (M
B/s
) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
50
100
150 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Throughput: 3.4× worse in RS (4KB)
- longer in RS (4KB)
X 3.4
![Page 96: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/96.jpg)
Performance Comparison (Sequential Read)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
0
8
16
24
324KB
2KB
1KB
Th
rou
gh
pu
t (M
B/s
) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
50
100
150 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
- longer in RS (4KB)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Latency: 3.4× longer in RS (4KB)
- longer in RS (4KB)
X 3.4 X 3.4
![Page 97: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/97.jpg)
Performance Comparison (Sequential Read)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
0
8
16
24
324KB
2KB
1KB
Th
rou
gh
pu
t (M
B/s
) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
50
100
150 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
- longer in RS (4KB)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Even though there was no failure, performance degradation occurred.
- longer in RS (4KB)
![Page 98: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/98.jpg)
Performance Comparison (Sequential Read)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
200
400
600
800
1000
0
8
16
24
324KB
2KB
1KB
Th
rou
gh
pu
t (M
B/s
) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
50
100
150 3-Rep. RS 6, 3 RS 10, 4
La
ten
cy
(m
se
c)
- longer in RS (4KB)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Even though there was no failure, performance degradation occurred.
- Caused by RS-concatenation, which generates extra data transfers.
![Page 99: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/99.jpg)
Performance Comparison (Sequential Read)
- longer in RS (4KB)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Even though there was no failure, performance degradation occurred.
- Caused by RS-concatenation, which generates extra data transfers.
Data Chunk 1
Data Chunk 2
Data Chunk 3
Data Chunk 4
Data Chunk 5
Data Chunk 0
“RS-Concatenation”
![Page 100: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/100.jpg)
Performance Comparison (Sequential Read)
- longer in RS (4KB)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Even though there was no failure, performance degradation occurred.
- Caused by RS-concatenation, which generates extra data transfers.
Data Chunk 1
Data Chunk 2
Data Chunk 3
Data Chunk 4
Data Chunk 5
Data Chunk 0
“RS-Concatenation”
![Page 101: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/101.jpg)
Performance Comparison (Sequential Read)
- longer in RS (4KB)
- worse in RS (4KB)
- Performance degradation in Reed-Solomon.
- Even though there was no failure, performance degradation occurred.
- Caused by RS-concatenation, which generates extra data transfers.
Data Chunk 1
Data Chunk 2
Data Chunk 3
Data Chunk 4
Data Chunk 5
Data Chunk 0
“RS-Concatenation”
Stripe
![Page 102: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/102.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
3) Actual amount of reads & writes served from disks.
4) Private network traffics.
![Page 103: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/103.jpg)
Computing and Software Overheads (CPU Utilization)
<Random Read><Random Write>
![Page 104: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/104.jpg)
Computing and Software Overheads (CPU Utilization)
<Random Read><Random Write>
- RS requires much more CPU cycles than replication.
![Page 105: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/105.jpg)
Computing and Software Overheads (CPU Utilization)
<Random Read><Random Write>
- RS requires much more CPU cycles than replication.
- User mode CPU utilizations account for 70~75% of total CPU cycles.
70~75%
![Page 106: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/106.jpg)
Computing and Software Overheads (CPU Utilization)
<Random Read><Random Write>
- RS requires much more CPU cycles than replication.
- User mode CPU utilizations account for 70~75% of total CPU cycles.
- Uncommon in RAID systems
70~75%
![Page 107: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/107.jpg)
Computing and Software Overheads (CPU Utilization)
<Random Read><Random Write>
- RS requires much more CPU cycles than replication.
- User mode CPU utilizations account for 70~75% of total CPU cycles.
- Uncommon in RAID systems
- Implemented at the user level. (ex. OSD daemon, PG backend, fault tolerant modules)
70~75%
![Page 108: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/108.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
elative number of context switches =The number of context switches
Total amount of request (MB)
![Page 109: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/109.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
- The number of context switches Total amount of request MB
- )𝐵𝑀( 𝑡𝑠𝑒𝑢𝑞𝑒𝑟 𝑓𝑜 𝑡𝑛𝑢𝑜𝑚𝑎 𝑙𝑎𝑡𝑜 The number of context switches Total amount of request
sehctiws txetnoc fo rebmun evitaleRelative number of context switches =
The number of context switches
Total amount of request (MB)
![Page 110: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/110.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
- The number of context switches Total amount of request MB
- Much more context switches occur in RS than replication.
![Page 111: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/111.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
- The number of context switches Total amount of request MB
- Much more context switches occur in RS than replication.
i) Read: Data transfers through OSDs and computations during RS-concatenation.
![Page 112: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/112.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
- The number of context switches Total amount of request MB
- Much more context switches occur in RS than replication.
i) Read: Data transfers through OSDs and computations during RS-concatenation.
ii) Write:
![Page 113: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/113.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
- The number of context switches Total amount of request MB
- Much more context switches occur in RS than replication.
i) Read: Data transfers through OSDs and computations during RS-concatenation.
ii) Write:
1) Initializing object has lots of writes, and significant amount of computations.
![Page 114: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/114.jpg)
Computing and software overheads (Context Switch)
<Random Read><Random Write>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
80k
160k
240k
0
3k
6k
9k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
500k
1M
0
20k
40k 32KB
64KB
128KB
Co
nte
xt
Sw
itc
h (
#/M
B)
3-Rep. RS 6, 3 RS 10, 4
- The number of context switches Total amount of request MB
- Much more context switches occur in RS than replication.
i) Read: Data transfers through OSDs and computations during RS-concatenation.
ii) Write:
1) Initializing object has lots of writes, and significant amount of computations.
2) Updating object introduces many transfers among OSDs through user-level modules.
![Page 115: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/115.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
3) Actual amount of reads & writes served from disks.
4) Private network traffics.
![Page 116: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/116.jpg)
I/O Amplification (Random Write)
/O amplification =Read/write amount from storage(MB)
Total amount of request(MB)
<Read Amplification><Write Amplification>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0100200300400500600700
32 64 1280
8
16
24
3-Rep. RS 6, 3 RS 10, 4
Wri
te t
o S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
50
32 64 1280.0
0.8
1.6
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
![Page 117: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/117.jpg)
I/O Amplification (Random Write)
i)/O amplification= Read/write amount from storage(MB) Total amount of
request(MB) Read/write amount from storage(MB) Read/write amount from
storage(MB) Total amount of request(MB) Total amount of request(MB)
Read/write amount from storage(MB) Total amount of request(MB)
ii) I/O amplification =Read/write amount from storage(MB)
Total amount of request(MB)
<Read Amplification><Write Amplification>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0100200300400500600700
32 64 1280
8
16
24
3-Rep. RS 6, 3 RS 10, 4
Wri
te t
o S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
50
32 64 1280.0
0.8
1.6
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
![Page 118: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/118.jpg)
I/O Amplification (Random Write)
i)/O amplification= Read/write amount from storage(MB) Total amount of
request(MB) Read/write amount from storage(MB) Read/write amount from
storage(MB) Total amount of request(MB) Total amount of request(MB)
Read/write amount from storage(MB) Total amount of request(MB)
ii) Erasure coding causes write amplification up to 700 times more than total
request volume.
<Read Amplification><Write Amplification>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0100200300400500600700
32 64 1280
8
16
24
3-Rep. RS 6, 3 RS 10, 4
Wri
te t
o S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
50
32 64 1280.0
0.8
1.6
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
![Page 119: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/119.jpg)
I/O Amplification (Random Write)
i)/O amplification= Read/write amount from storage(MB) Total amount of
request(MB) Read/write amount from storage(MB) Read/write amount from
storage(MB) Total amount of request(MB) Total amount of request(MB)
Read/write amount from storage(MB) Total amount of request(MB)
ii) Erasure coding causes write amplification up to 700 times more than total
request volume.
iii) Why is write amplification by random writes so big?
<Read Amplification><Write Amplification>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0100200300400500600700
32 64 1280
8
16
24
3-Rep. RS 6, 3 RS 10, 4
Wri
te t
o S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
50
32 64 1280.0
0.8
1.6
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
![Page 120: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/120.jpg)
I/O Amplification (Random Write)
i)/O amplification= Read/write amount from storage(MB) Total amount of
request(MB) Read/write amount from storage(MB) Read/write amount from
storage(MB) Total amount of request(MB) Total amount of request(MB)
Read/write amount from storage(MB) Total amount of request(MB)
ii) Erasure coding causes write amplification up to 700 times more than total
request volume.
iii) Why is write amplification by random writes so big?
<Read Amplification><Write Amplification>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0100200300400500600700
32 64 1280
8
16
24
3-Rep. RS 6, 3 RS 10, 4
Wri
te t
o S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
50
32 64 1280.0
0.8
1.6
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
uest
(MB
)
![Page 121: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/121.jpg)
I/O Amplification (Read)
<Sequential Read><Random Read>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
32 64 1280.8
1.2
1.6
2.0
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge
(M
B)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0.6
0.8
1.0
1.2
1.4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
ues
t (M
B) 3-Rep. RS 6, 3 RS 10, 4
![Page 122: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/122.jpg)
I/O Amplification (Read)
- Read amplification caused by RS-concatenation.
<Sequential Read><Random Read>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
32 64 1280.8
1.2
1.6
2.0
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge
(M
B)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0.6
0.8
1.0
1.2
1.4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
ues
t (M
B) 3-Rep. RS 6, 3 RS 10, 4
![Page 123: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/123.jpg)
I/O Amplification (Read)
- Read amplification caused by RS-concatenation.
i) Random read: Mostly reads different stripes. Lots of read amplifications.
<Sequential Read><Random Read>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
32 64 1280.8
1.2
1.6
2.0
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge
(M
B)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0.6
0.8
1.0
1.2
1.4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
ues
t (M
B) 3-Rep. RS 6, 3 RS 10, 4
![Page 124: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/124.jpg)
I/O Amplification (Read)
- Read amplification caused by RS-concatenation.
i) Random read: Mostly reads different stripes. Lots of read amplifications.
ii) Sequential read: Consecutive I/O requests read data from same stripe. no read
amplifications.
<Sequential Read><Random Read>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
32 64 1280.8
1.2
1.6
2.0
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge
(M
B)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0.6
0.8
1.0
1.2
1.4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
ues
t (M
B) 3-Rep. RS 6, 3 RS 10, 4
![Page 125: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/125.jpg)
I/O Amplification (Read)
- Read amplification caused by RS-concatenation.
i) Random read: Mostly reads different stripes. Lots of read amplifications.
ii) Sequential read: Consecutive I/O requests read data from same stripe. no read
amplifications.
<Sequential Read><Random Read>
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
10
20
30
40
32 64 1280.8
1.2
1.6
2.0
2.4
3-Rep. RS 6, 3 RS 10, 4
Read
fro
m S
tora
ge
(M
B)
To
tal R
eq
uest
(MB
)
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0.6
0.8
1.0
1.2
1.4
Read
fro
m S
tora
ge (
MB
)
To
tal R
eq
ues
t (M
B) 3-Rep. RS 6, 3 RS 10, 4
![Page 126: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/126.jpg)
Client
Core Core Core Core24
SSD SSD6TB
256GB DDR4 DRAM
SSDSSD
Mon
OSD1
OSD6
Mon
OSD1
OSD6
Mon
OSD1
OSD6
OSD1
OSD6
10Gb
10Gb
10Gb
Ceph Storage Cluster
PublicNetwork
PrivateNetwork
Analysis Overview
1) Overall performance.- throughput & latency
2) CPU utilization & # context switches.
3) Actual amount of reads & writes served from disks.
4) Private network traffics.
![Page 127: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/127.jpg)
Network Traffics Among the Storage Nodes
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
100
200
300
400
500
0
5
10
15
20 32KB
64KB
128KB
Pri
va
te N
etw
ork
Tra
ffic
(M
B)
To
tal
Re
qu
es
t (M
B) 3-Rep. RS 6, 3 RS 10, 4
1KB2KB
4KB8KB
16KB32KB
64KB128KB
0
2
4
6
8
10
Pri
va
te N
etw
ork
Tra
ffic
(M
B)
To
tal
Re
qu
es
t (M
B) 3-Rep.
RS 6, 3
RS 10, 4
<Random Read><Random Write>
- Show similar trend with I/O amplifications.
- Erasure coding
i) Write: initializing & updating objects in erasure coding cause lots of network traffics.
ii) Read: RS-concatenation cause lots of network traffics.
- Replication exhibits only minimum data transfers related to necessary communications.
(ex. OSD interaction: monitoring the status of each OSD)
![Page 128: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/128.jpg)
Conclusion
- We studied the overheads imposed by erasure coding on a
distributed SSD array system.
- In contrast to the common expectation on erasure codes, we
observed that they exhibit heavy network traffic and more I/O
amplification than replication.
- Also erasure coding requires much more CPU cycles and
context switches than replication due to user-level
implementation.
![Page 129: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/129.jpg)
Q&A
![Page 130: Computer Architecture and MEmory systems Lab.](https://reader033.fdocuments.net/reader033/viewer/2022051906/628487e3f0cd344a6922800a/html5/thumbnails/130.jpg)
Object Management in Erasure Coding
0s 50s 100s 150s0
600
1200
0s 20s 40s 60s 80s
0.0M
0.8M
1.6M0
30
60Overwrites
Th
rou
gh
pu
t
(MB
/s)
RX
TX
Writes on pristine
RX
TX
Co
nte
xt
Sw
itch
(# / s
ec)
#Context Switch/sec
#Context Switch/sec
CP
U
Utiliz
atio
n (
%)
User
System
User
System
Obejct initialize Obejct update
Time series analysis for CPU utilization, context switches, private network throughputobserved by random writes on pristine image & random overwrites