Windows Server 2012: Data deduplication

32
Windows Server 2012 R2 Data Deduplication: Transforming "Big Data Veracity" Management to Notable Content Storage Savings Presenter: Nora Daniels Senior Technical Instructor

Transcript of Windows Server 2012: Data deduplication

Page 1: Windows Server 2012: Data deduplication

Windows Server 2012 R2 Data Deduplication: Transforming "Big Data Veracity" Management to Notable Content Storage Savings Presenter: Nora Daniels

Senior Technical Instructor

Page 2: Windows Server 2012: Data deduplication

Nora [email protected]

Senior Technical Instructor

MCT, Microsoft Certified Solution Associate (2), MCSE 2003, MCSA 2003, MCSE 2000, MCSA 2000, MCITP,-Server 2008, SQL Server 2008, SQL Server 2005, Vista-Enterprise, MCTS-Server 2008, Virtualization, Windows 7, SQL

Server 2008, SQL 2005, Vista , MCDBA, MCDST, MCP, A+ PC Service Technician, Network + Network Technician, S+ Security Technician, ITIL v3 Foundation, VCP-DCV

Instructor

Page 3: Windows Server 2012: Data deduplication

Data Deduplication in Server 2012 R2 is yet another of the OS's features that contributes to its must-have status because it may be one of many strategies that can be used to cut into data storage demands.. The savings that data deduplication will give you in storage costs will probably—by itself—justify the cost of upgrading Windows-based file servers. We will be discussing in this session the concept of deduplication and how it can be very effective to optimize storage and reduce the amount of disk space consumed—50% to 90% when applied to the right data.

Page 4: Windows Server 2012: Data deduplication

Agenda• What is Windows Data Deduplication

• Data Deduplication Characteristics

• Data Deduplication Candidates

• Data Deduplication Deployment Strategies

Page 5: Windows Server 2012: Data deduplication

Windows Server Data Deduplication

Transparently removes duplication, without changing access semantics

Post-processing approachOptimization job identifies files and applies a chunking algorithmChunks are inserted into a chunk store and selectively compressed Original files are replaced with reparse points Primary data stream is removed

Efficiently store fewer bits.

File1

File2

Page 6: Windows Server 2012: Data deduplication

Windows Server DeduplicationNew scenario support

Windows Server 2012• Debut of next

generation design (not based on Single Instance Storage- SIS)

• Supports closed files

Windows Server 2012 R2• New support for

VDI scenario• Dedup of open

VHD files for VDI• Supports Cluster

Shared Volumes (CSV)

TechEd 2014 EU• VDI at scale

reference deployment

• New virtualized backup scenario

• Requires November 2014 update rollup

Page 7: Windows Server 2012: Data deduplication

Some design principles…No customer data loss

Protect against hardware data corruption.Identify and repair corruptions.

Deduplication of active storage

Not a deduplication server.Transparent to the primary server workload.Designed to back-off if the server needs resources.

Easy optimization policies

Can exclude specific file types or select folders to skip.Process files in the background or choose a more aggressive schedule.Default setting is to process files older than 5 days.

Simple deployment experience

Role can be turned on in Server Manager and enabled on new volumes.Can be easily enabled on existing data volumes.

―― 3 days

Page 8: Windows Server 2012: Data deduplication

80% capacity savingplus

smaller/faster archive, backup/restore, and

migration

Page 9: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (1)• 1) Transparent and easy to use:

Deduplication can be easily installed and enabled on selected data volumes in a few seconds.

• Applications and end users will not know that the data has been transformed on the disk and when a user requests a file, it will be transparently served up right away.

• The file system as a whole supports all of the NTFS semantics that you would expect.

Page 10: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (2)• Some files are not processed by

deduplication, such as files encrypted using the Encrypted File System (EFS), files that are smaller than 32KB or those that have Extended Attributes (EAs). In these cases, the interaction with the files is entirely through NTFS and the deduplication filter driver does not get involved.

• If a file has an alternate data stream, only the primary data stream will be deduplicated and the alternate stream will be left on the disk.

Page 11: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (3)• 2) Designed for Primary Data: The feature can be installed on your primary data volumes without interfering with the server’s primary objective.

• Hot data (files that are being written to) will be passed over by deduplication until the file reaches a certain age.

• This way you can get optimal performance for active files and great savings on the rest of the files.

.

Page 12: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (4)• 2*) Designed for Primary Data: Files that meet the deduplication criteria are referred to as “in-

policy” files.

• Post Processing: Deduplication is not in the write-path when new files come along. New files write directly to the NTFS volume and the files are evaluated by a file groveler on a regular schedule. The background processing mode checks for files that are eligible for deduplication every hour and you can add additional schedules if you need them.

• File Age: Deduplication has a setting called MinimumFileAgeDays that controls how old a file should be before processing the file. The default setting is 5 days. This setting is configurable by the user and can be set to “0” to process files regardless of how old they are.

• File Type and File Location Exclusions: You can tell the system not to process files of a specific type, like PNG files that already have great compression or compressed CAB files that may not benefit from deduplication. You can also tell the system not to process a certain folder.

Page 13: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (5)• 3) Focused on using low resources: The feature was built to automatically yield system

resources to the primary server’s workload and back-off until resources are available again. Most people agree that their servers have a job to do and the storage is just facilitating their data requirements.• The chunk store’s hash index is designed to use low resources and reduce the read/write

disk IOPS so that it can scale to large datasets and deliver high insert/lookup performance. The index footprint is extremely low at about 6 bytes of RAM per chunk and it uses temporary partitioning to support very high scale

• Deduplication jobs will verify that there is enough memory to do the work and if not it will stop and try again at the next scheduled interval.

• Administrators can schedule and run any of the deduplication jobs during off-peak hours or during idle time.

Page 14: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (6)• 4) Portability: A volume that is under deduplication control is an atomic unit. You can

back up the volume and restore it to another server. • You can rip it out of one Windows 2012 server and move it to another.

Everything that is required to access your data is located on the drive. All of the deduplication settings are maintained on the volume and will be picked up by the deduplication filter when the volume is mounted.

• The only thing that is not retained on the volume are the schedule settings that are part of the task-scheduler engine.

• If you move the volume to a server that is not running the Data Deduplication feature, you will only be able to access the files that have not been deduplicated.

Page 15: Windows Server 2012: Data deduplication

Data Deduplication Characteristics (7)• 5) Sub-file chunking: Deduplication segments

files into variable-sizes (32-128 kilobyte chunks) using a new algorithm developed in conjunction with Microsoft research.

• The chunking module splits a file into a sequence of chunks in a content dependent manner. The system uses a Rabin fingerprint-based sliding window hash on the data stream to identify chunk boundaries.

• The chunks have an average size of 64KB and they are compressed and placed into a chunk store located in a hidden folder at the root of the volume called the System Volume Information, or “SVI folder”.

• The normal file is replaced by a small reparse point, which has a pointer to a map of all the data streams and chunks required to “rehydrate” the file and serve it up when it is requested.

Page 16: Windows Server 2012: Data deduplication

What about the data access impact? (1)• Deduplication creates fragmentation for the files that are on your disk as

chunks may end up being spread apart and this causes increases in seek time as the disk heads must move around more to gather all the required data.

• As each file is processed, the filter driver works to keep the sequence of unique chunks together, preserving on-disk locality, so it isn’t a completely random distribution.

• Deduplication also has a cache to avoid going to disk for repeat chunks.

• The file-system has another layer of caching that is leveraged for file access.

• If multiple users are accessing similar files at the same time, the access pattern will enable deduplication to speed things up for all of the users.

Page 17: Windows Server 2012: Data deduplication

Final Analysis on data access impact? (2)• There are no noticeable differences for opening an Office document. Users will never know that

the underlying volume is running deduplication. 

• When copying a single large file, we see end-to-end copy times that can be 1.5 times what it takes on a non-deduplicated volume.

• When copying multiple large files at the same time we have seen gains due to caching that can cause the copy time to be faster by up to 30%.

• Under our file-server load simulator (the File Server Capacity Tool) set to simulate 5000 users simultaneously accessing the system we only see about a 10% reduction in the number of users that can be supported over SMB 3.0.

• Data can be optimized at 20-35 MB/Sec within a single job, which comes out to about 100GB/hour for a single 2TB volume using a single CPU core and 1GB of free RAM. Multiple volumes can be processed in parallel if additional CPU, memory and disk resources are available.

Page 18: Windows Server 2012: Data deduplication

Reliability and Risk Mitigations (1)• Even with RAID and redundancy implemented in your system, data corruption risks exist due to various disk anomalies, controller errors, firmware bugs or even environmental factors, like radiation or disk vibrations.

• Deduplication raises the impact of a single chunk corruption since a popular chunk can be referenced by a large number of files.

• Imagine a chunk that is referenced by 1000 files is lost due to a sector error; you would instantly suffer a 1000 file loss.

Page 19: Windows Server 2012: Data deduplication

Reliability and Risk Mitigations (2)• Backup Support: We have support for fully-optimized backup using the in-box Windows Server Backup tool

• We have several major vendors working on adding support for optimized backup and un-optimized backup.

• We have a selective file restore API to enable backup applications to pull files out of an optimized backup.

Page 20: Windows Server 2012: Data deduplication

Reliability and Risk Mitigations (3)• Reporting and Detection: Any time the deduplication filter notices a corruption it logs it in the event log, so it can be scrubbed. • Checksum validation is done on all data and metadata when it is read and written.

• Deduplication will recognize when data that is being accessed has been corrupted, reducing silent corruptions.

Page 21: Windows Server 2012: Data deduplication

Reliability and Risk Mitigations(4)• Redundancy: Extra copies of critical metadata are created automatically. •Very popular data chunks receive entire duplicate copies whenever it is referenced 100 times.

•We call this area “the hotspot”, which is a collection of the most popular chunks.

Page 22: Windows Server 2012: Data deduplication

Reliability and Risk Mitigations(5)• Repair: A weekly scrubbing job inspects the event log for logged corruptions and fixes the data chunks from alternate copies if they exist. • There is also an optional deep scrub job available that will walk through

the entire data set, looking for corruptions and it tries to fix them. When using a Storage Spaces disk pool that is mirrored, deduplication will reach over to the other side of the mirror and grab the good version. Otherwise, the data will have to be recovered from a backup.

• Deduplication will continually scan incoming chunks it encounters looking for the ones that can be used to fix a corruption.

Page 23: Windows Server 2012: Data deduplication

Data Dedup Video Demo

Around 10:00-1430

Page 24: Windows Server 2012: Data deduplication

Data Deduplication CandidatesFile shares or servers that host user documents, software deployment binaries, or virtual hard disk files tend to have plenty of duplication, and they will yield high savings from deduplication.

Page 25: Windows Server 2012: Data deduplication

Great Candidates• Folder redirection servers

• Virtualization depot or provisioning library

• Software deployment shares

• SQL Server and Exchange Server backup volumes

• VDI VHDs (supported only on Windows Server 2012 R2)

Page 26: Windows Server 2012: Data deduplication

Candidates?Should be evaluated based on content

• Line-of-business servers

• Static content providers

• Web servers

• High-performance computing (HPC)

Not good candidates for deduplication

• Hyper-V hosts

• WSUS

• Servers running SQL Server or Exchange Server

• Files approaching or larger than 1 TB in size

• VDI VHDs on Windows Server 2012

Page 27: Windows Server 2012: Data deduplication

Deployment: Deduplication Data Evaluation Tool• To aid in the evaluation of datasets we created a portable evaluation tool. When the

feature is installed, DDPEval.exe is installed to the \Windows\System32\ directory.

• This tool can be copied and run on Windows 7 or later systems to determine the expected savings that you would get if deduplication was enabled on a particular volume. DDPEval.exe supports local drives and also mapped or unmapped remote shares.

• You can run it against a remote share on your Windows NAS, or an EMC / NetApp NAS and compare the savings.

Page 28: Windows Server 2012: Data deduplication

Video Demo

27:13/31:00

Page 29: Windows Server 2012: Data deduplication

What Is Data Deduplication?• Data deduplication identifies and removes duplications within data

without compromising its integrity or fidelity with the ultimate goal to store more data on less space

• When you enable data deduplication on a volume, a background task runs with low-priority that:

1. Segments data into small, variable sized chunks

2. Identifies duplicate chunks

3. Replaces redundant copies with a reference

4. Compresses chunks

• You should consider using deduplication for the following areas:

File Shares Software Deployment Shares

VHD Libraries

Page 30: Windows Server 2012: Data deduplication

Data Deduplication in Server 2012 R2 is yet another of the OS's features that contributes to its must-have status because it may be one of many strategies that can be used to cut into data storage demands.. The savings that data deduplication will give you in storage costs will probably—by itself—justify the cost of upgrading Windows-based file servers. We will be discussing in this session the concept of deduplication and how it can be very effective to optimize storage and reduce the amount of disk space consumed—50% to 90% when applied to the right data.

Page 31: Windows Server 2012: Data deduplication

Get Trained!Courses

• 20410: Installing and Configuring Windows Server 2012

• 20411: Administering Windows Server 2012

• 20412: Configuring Advanced Windows Server 2012 Services

• 20417: Upgrading Your Skills to MCSA Windows Server 2012

• 10971: Storage and High Availability with Windows Server