Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop...

23
Why Scale-Out Big Data Apps Need A New Scale- Out Storage Modern storage for modern business Rob Whiteley, VP, Marketing, Hedvig April 9, 2015

Transcript of Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop...

Page 1: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Why Scale-Out Big Data Apps Need A New Scale-Out Storage Modern storage for modern business

Rob Whiteley, VP, Marketing, Hedvig April 9, 2015

Page 2: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

• Big data pressures on storage infrastructure

• The rise of elastic software-defined storage (SDS)

• 6 SDS capabilities for big data

• 3 cases studies of SDS for big data

Agenda

Copyright 2015 Hedvig Inc. Confidential.

Page 3: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Big data pressures on storage infrastructure

Page 4: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Big data requires flexible infrastructure

Copyright 2015 Hedvig Inc. Confidential. 4

Big data

Time-to-market

Flexible infrastructure

Business executives Developers IT infrastructure & DevOps

Page 5: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

According to Forrester . . .

Copyright 2015 Hedvig Inc. Confidential. 5

10X faster growth of enterprise

data than storage budgets

58% of orgs take days, weeks, or months

to provision storage

14% of orgs have “cloud-like”

provisioning capabilities

Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report. Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015.

Visit hedviginc.com for full research report. Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report.

Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report.

Page 6: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Three truths and a lie about storage & big data

• Software-defined storage is the right direction.

• Hyperconverged provides the best economics. • Big data apps are repeating the sins of the 90s.

• Hyperscale helps virtualize Hadoop and NoSQL.

Copyright 2015 Hedvig Inc. Confidential. 6

Page 7: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

The rise of elastic software-defined storage (SDS)

Page 8: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

A big data inflection point in storage

Copyright 2015 Hedvig Inc. Confidential. 8

Price/ performance

Storage capabilities

Traditional Scale-up Scale-out Elastic

Before After Hardware-defined Software-defined

Scale-out Elastic

High-availability + RAID Distributed + Replication

Hyperconverged Hyperconverged + Hyperscale

The  big  data  so-ware  storage  inflec3on  point  

Page 9: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Storage flavors

Storage features

Deployment flexibility

Three legs to the big data requirements stool

Hyperconverged

Software-defined storage

Monolithic arrays

Virtual SANs

Copyright 2015 Hedvig Inc. Confidential. 9

Page 10: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

LINUX Hypervisor Windows

Storage cluster

Hadoop, NoSQL cluster

LINUX Hypervisor

Windows

Hadoop/NoSQL + Storage cluster

Hyperscale Hyperconverged

Storage client

Storage node

Page 11: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

DC1 DC2 Cloud3

How SDS provides elastic storage for big data Big

data

iSCSI

Big data

NFS

Big data

Object

1 Admin provisions virtual volumes and script or apply storage policies

2 Virtual volume presents block, file, & object storage to big data hosts

3 Storage client captures guest I/O and communicates to underlying cluster

4 Cluster distributes and replicates data, applies compression & dedupe

5 Cluster autotiers & balances to optimize data locality & availability

Storage cluster

Copyright 2015 Hedvig Inc. Confidential. 11

= x86 or ARM server

Page 12: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

SDS capabilities for big data 6

Page 13: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

6 big data friendly SDS capabilities

1.  I/O sequentialization 2.  Tunable replication 3.  DR replication 4.  Disk failures and rebuilds 5.  Data efficiency methods 6.  Flash caching & flash pinning

Copyright 2015 Hedvig Inc. Confidential. 13

Page 14: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

1. Random I/O to sequential writes

Copyright 2015 Hedvig Inc. 14

Big data node Application writes data in random blocks, and gets immediate ack from cluster. 1

Storage cluster sequentializes incoming blocks (in RAM+SSD) into larger chunks. 2

Storage cluster writes larger sequentialized data chunks to underlying disks in auto-balanced, and auto-distributed manner according to policy.

3

Storage client

Storage node

Page 15: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Example: Single write operation

15

Example Policy: 3 COPIES; AGNOSTIC

Big data node

Hedvig Controller software

Hedvig Cluster software

Application sends write to any storage cluster node. (round-robin) 1 Cluster node writes first aggregated blocks locally. Second copy written to first responding cluster node.

2

Ack sent back to big data node after majority quorum of acks. (2 ack’s in case of 3 copies) CHECKSUMMED!

3

Third copy is written semi-synchronously. Could also be synchronous if all servers are equidistant.

4 SSD/Flash

SAS/SATA

Page 16: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

2. Granular replication of data

Storage containers

Big data node

Granular data chunks

Disk Platter Disk Platter Disk Platter

Chunks are distributed across all servers and containers in

the storage cluster.

Hedvig Controller software

Hedvig Cluster software

Page 17: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

3. DR Policy: 3xDC-aware with 3 copies

DR Policy: Datacenter-aware (One copy per DC) Data Copies: 3 Sync-Acknowledgements: 2

Data Center A Data Center B Data Center C

Active Active

Hedvig Controller software

Hedvig Cluster software

Page 18: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

4. Disk failures and rebuilds •  Disks managed in protection groups. •  Disk rebuilds initiated automatically upon disk failure across entire cluster. •  No spare disks needed. •  Quick wide-stripe rebuilds allow for largest disks. •  Average 4TB disk rebuild time is under 20 minutes. •  Easily support 6TB, 8TB and 10TB drives.

Page 19: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

5. Thin provisioning, deduplication and compression •  Thin provisioning for every virtual volume

•  Inline compression and deduplication

•  Global, system-wide deduplication – all attached storage nodes participate

•  60-75% data reduction – dedupe rates vary based on data type

•  Dedupe cache can reside on Controller SSD/flash in application server

•  Eliminate all duplicate I/O from network, dramatically lower latency and increase IOPS!

•  Clone non-deduped volume with dedupe

Client-side SSD/flash dedupe read cache with dedupe map

Big data node + storage client

Storage node

Cluster SSD/flash read+write cache

Page 20: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

6. 3 ways Hedvig uses SSD and PCIe flash

Client side read and dedupe cache on big data node

Primary storage as dedicated volume on storage nodes + flash pinning

Read/write cache on storage nodes

Big data node + storage client

Storage node

Page 21: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

cases studies of SDS for big data 3

Page 22: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Three case studies

Copyright 2015 Hedvig Inc. Confidential. 22

Fortune 100 bank

Deploying Cassandra and MongoDB for developers with infrastructure self-provisioning for DevOps model. Multiple NoSQL deployments leading to islands of (elastic) storage and inability to self-provision or plug into bank’s orchestration tools. Building elastic SDS cluster on commodity infrastructure to lower cost per bit by and drive self-provisioning through RESTful APIs.

4th largest US law firm

Needs quick, reliable indexing of 100M active client docs in HP Autonomy. Needed a scale-out, flash-friendly solution to replace local SSDs, which are required to achieve sub-one second index queries. Getting 6x performance with SDS versus traditional hybrid array, which included flash tier; now has incremental commodity scalability.

Fortune 50 telecom

Seeks centralized, shared storage to virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands of (also elastic) storage and preventing IT’s virtualization-first policy. Virtualizing all three Hadoop distributions and deploying SDS as the “data lake” for scale-out storage and global dedupe across data sets.

Page 23: Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Thank you!