Design of an Exact Data Deduplication...

27

Transcript of Design of an Exact Data Deduplication...

Page 1: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as
Page 2: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

Design of an Exact Data Deduplication Cluster

Jürgen Kaiser (JGU) Dirk Meister (JGU) Andre Brinkmann (JGU) Sascha Effert (christmann)

Page 3: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Deduplication (short) • System Overview • Fault Tolerance • Inter-node Communication • Evaluation • Conclusion

Outline

Presenter
Presentation Notes
So kurz wie möglich
Page 4: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Storage savings approaches • Remove course-grained redundancy

• Process overview

1. Split data into chunks 2. Fingerprint chunks with

cryptographic hash 3. Check if fingerprint is already stored in

index (Chunk Index) 4. Store new chunk data (Storage) 5. Store block-chunk-mapping (Block Index)

Deduplication (short)

Page 5: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Single-node deduplication systems – Limited scaling

• Just a Bunch of Deduplication Systems

– Lots of independent deduplication systems – Complex load balancing, migration, and management – Cross data set sharing

• Clustering promising to scale deduplication

Clustered Deduplication

Presenter
Presentation Notes
Punkt: Nicht so lange Punkt: Etwas kurzer
Page 6: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Term: “Exact deduplication” – Detect all duplicate chunks – Same deduplication as single-node system

• But larger and faster

• Goal: – Scalable, exact deduplication – Small chunk sizes (8 – 16 KB)

• Contribution

– Architecture combining these properties – Evaluation using prototype – Exploring limitations

Exact deduplication

Presenter
Presentation Notes
Kompakter Ich muss nicht erklären was schon existiert
Page 7: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

System Design Overview

Presenter
Presentation Notes
Hier: Erklären was Chunk Index, Block Index ist
Page 8: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Shared Nothing (Direct Attached) – Replication/erasure coding – Chunk data over network

• Shared Storage (SAN)

– Complex locking schemes – Scaling issues

• One partition is only accessed by single node

– No short-term locking – Sharing is used for

• Fault tolerance • Load balancing

Storage Organization

Presenter
Presentation Notes
Wichtig, aber zu lange
Page 9: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Chunk Index Partitions • Block Index Partitions • Container Partitions • Container Metadata Partitions

• Partition types are handled differently

– Fault tolerance – Load balancing – Data assignment

Partition types

Page 10: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Contains parts of distributed chunk index

• Chunk assigned by SHA prefix

• Performance sensitive – SSD Storage – 100,000s requests/s during writes – Multiple request issues concurrently – Index not updated directly

• Write-ahead log • Dirty uncommitted state in memory

Chunk Index Partition

Presenter
Presentation Notes
10s of partions raus, bei allen Typen
Page 11: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Contains the block index

• iSCSI target is assigned to partition • Only node the partition is assigned to

exports iSCSI target – Extension like clustered iSCSI could be implemented

• High locality of access

– Modest performance requirements – Currently on SSD storage

Block Index Partition

Page 12: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Stored on HDD storage • Stored parts of container storage • Chunk data stored in container

• Assignment

– Prefer local partitions – One local partition for writing – Chunk data never transferred over network

• Container Metadata Partition

– Metadata, write-ahead log stored on SSD

Container Partition

Page 13: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Storage reliability – RAID 6 (or double mirroring)

• Node crash (basic idea):

– Failure detection using e.g. keep-alive messages – Partition remapping by cluster leader

• New leadership election if cluster leader fails • Based on ZooKeeper system

– Recovery strategy depends on partition type

• Load Balancing – Strongly related to fault tolerance – Details in paper

Fault tolerance

Presenter
Presentation Notes
Storage reliablity: Ganz gut
Page 14: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Chunk Index – Needs to be up very quickly No log replay – “False Negatives“ – State cleaned up during background log replay

• Block Index

– Out-dated results are not acceptable – Replay of write-ahead log to recover latest state – Downtime of around a minute

• Container Storage

– Only small non-committed state No issue

Recovery Strategy

Presenter
Presentation Notes
Log replay anders schreiben
Page 15: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Main communication pattern – For all chunks in write request:

• Ask responsible node • Requests to same node aggregated • Each message is small (20 – 100 bytes)

Inter-node Communication

Page 16: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Prototype implementation • 60 node cluster

– 24 as deduplication nodes – 24 for workload generation (clients) – 12 for shared storage simulation (6 SSD, 6 SAN) – Gigabit Ethernet Interconnect – IPoIB iSCSI to storage nodes

• Load generation

– Based on pattern/probability distributions from traces – “1. Generation”: Empty deduplication system – “2. Generation”: Later backup runs

Evaluation

Page 17: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

0

1000

2000

3000

4000

5000

6000

1 Node 2 Nodes

4 Nodes

8 Nodes

16 Nodes

24 Nodes

1. Generation (Total)

2. Generation (Total)

1. Generation (per Node) 2. Generation (per Node)

Prototype throughput (in MB/s)

Page 18: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Throughput relies on exchanging 100,000s of message/s • More nodes

– More write requests – Higher chunk lookup “fan-out” – Linear more ability to process messages – Sub-linear scaling

• But only for small chunk sizes, small cluster

– No fan-out for larger cluster – Scaling becomes more linear as cluster grows

Communication Limit

Page 19: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

Expected number of messages

Page 20: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

Communication Limits: Results

0

10

20

30

40

50

60

70

80

2 4 6 8

8 KB 16 KB 32 KB 64 KB 128 KB 256 KB

• Measured messages based on communication pattern

• Estimated throughput based message rates

• Performance excluding Storage, Chunking, … – Only communication

GB/s

Nodes

Page 21: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Exact, inline deduplication cluster – Architecture – Prototype

• Exact deduplication clusters with small chunk sizes are possible

• However, message exchange limits scalability

Conclusion

Presenter
Presentation Notes
bla
Page 22: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

THANK YOU Questions?

Page 23: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

Architecture

Page 24: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Chunk Index – Load balancing not necessary – SHA1 prefix ensures good distribution

• Block Index

– Load imbalance due to skewed access – Move partitions between nodes to balance load – Move volume mapping as a last resort

• Container Storage

– Load imbalance due to skewed read access – Move partitions between nodes to balance load

Load Balancing

Page 25: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Current limit: Nodes ability to receive and process messages

• Network switch can become bottleneck – High-performance switches are surprisingly capable – Probably only in larger clusters – Not seen any slowdown based on network / switch

• Up to 24 nodes

Network Limit?

Page 26: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

Distributed Chunk Index

Presenter
Presentation Notes
????????????????????
Page 27: Design of an Exact Data Deduplication Clusterstorageconference.us/2012/Presentations/R16.Deduplication.DesignOfAnExact.pdf• Prototype implementation • 60 node cluster – 24 as

• Provide SCSI target interface • Process incoming SCSI requests • Chunking and Fingerprinting

• Contain parts of indexes

– Chunk Index – Block Index

• Can directly access parts of stored chunk data

– Container Storage

Deduplication Nodes