Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL...

#ibmedge © 2016 IBM Corporation

Session #2442: Flash-Optimized Apache Spark: Expanding In-Memory Analytics into Flash Bernie Wu, Levyx

Randy Swanberg, IBM

9/21/16

#ibmedge

Please Note: •  IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice

and at IBM’s sole discretion.

•  Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

•  The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

•  The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

•  Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

2

#ibmedge

Agenda

•  Apache Spark

•  OpenPOWER

•  Spark on OpenPOWER •  CAPI Flash Technology

•  Levyx •  Technology overview •  Capabilities and use-cases •  Levyx on OpenPOWER with CAPI Flash

•  Summary / Questions / Follow-up

3

© 2016 IBM Corporation #ibmedge

Apache Spark

#ibmedge

Apache Spark

5

Fast and general engine for large-scale data processing

SparkCoreAPIR Scala SQL Python Java

SparkSQL Streaming MLlib GraphX

#ibmedge

Apache Spark

6

•  Unified Analytics Platform –  Combine streaming, graph, machine

learning and SQL analytics on a single platform

–  Simplified, multi-language programming model

–  Interactive and Batch

•  In-Memory Design –  Pipelines multiple iterations on single

copy of data in memory –  Superior Performance –  Natural Successor to MapReduce

Fast and general engine for large-scale data processing

SparkCoreAPIR Scala SQL Python Java

SparkSQL Streaming MLlib GraphX


OpenPOWER

#ibmedge

OpenPOWER, a Catalyst for Open Innovation

8 8

Accelerated innovation through collaboration of partners

AmplifiedcapabiliAesdrivingindustryperformanceleadership

Vibrant ecosystem through open development

Cloud Computing Hyperscale & Large scale

Datacenters

High Performance Computing & Analytics

Domestic IT Agendas

Industry Adoption, Open choice

OpenPOWER Strategy

Moore’s law no longer satisfies performance gain

Numerous IT consumption models

Growing workload demands

Mature Open software ecosystem

Market Shifts

#ibmedge 9

Machine Learning SQL Graph

1.7X System-to-System Advantage 2X Core-to-Core Advantage

Machine Learning SQL Graph Machine Learning SQL Graph

1.5X Price Performance Advantage

PerformanceofSparkonPOWER7-Node S812LC 10-core vs. 7-Node E5-2690 v3 12-core

#ibmedge 10 10

Typical I/O Model Flow

Flow with a Coherent Model Shared Mem.

Notify Accelerator Acceleration Shared Memory Completion

ü  Virtualaddressing&dataCaching

ü  Easierprogrammingmodel

ü  EnablesapplicaAonsnotpossibleonI/O

OpenPOWERTechnology:CoherentAcceleratorProcessorInterface(CAPI)

CAPP PCIe

POWER8 Processor

FPGA

Fun

ction n

Fun

ction 0

Fun

ction 1

Fun

ction 2

CAPI

IBM Supplied POWER Service Layer

DD Call Copy or Pin Source Data

MMIO Notify Accelerator Acceleration Poll / Int

Completion Copy or Unpin

Result Data Ret. From DD Completion

#ibmedge

strategy ( )

CAPI Attached Flash Optimization §  Attach IBM FlashSystem to POWER8 via CAPI §  Read/write commands issued via APIs from applications to eliminate 97% of code path length §  Saves 10+ cores per 1M IOPS

Pin buffers, Translate, Map DMA, Start I/O

Application

Read/Write Syscall

Interrupt, unmap, unpin,Iodone scheduling

20K instructions reduced to

<2000

Disk and Adapter DD

strategy ( ) iodone ( )

FileSystem

Application

User Library

Posix Async I/O Style API

Shared Memory Work Queue

aio_read() aio_write()

iodone ( )

LVM

#ibmedge

CAPI Flash Configurations

Up to 56TB of extended memory with one POWER8 server + CAPI attach FLASH

Power S822L / S812L

Flash System 900

Power S822L / S812L / S822 LC

NEW

External Flash Configuration

Integrated Flash Configuration

Up to 8TB of super-fast storage tier on one POWER8 server

12

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Conventional CAPI - I CAPI - E

IOPS per Hardware Thread

0

20

40

60

80

100

120

140

160

180

200

Conventional CAPI - I CAPI - E

Latency (microseconds)

0.6X1X

2.6X

3.7X

0%

100%

200%

300%

400%

FibreChannel NVMe CAPIFibreChannel CAPINVMe

AverageRelativeIOPsperCPUThread

#ibmedge

CAPI Flash Solution Use Cases

Memory Expansion •  Application constrained by single-

system memory capacity. Typical growth is through additional compute nodes.

•  CAPI Flash APIs offer highly-efficient flash access, increased total capacity at better $ / throughput.

Data Cache •  Application uses in-memory caches

for data storage, and typically-constrained by ratios of memory to underlying storage.

•  CAPI Flash APIs offer access to much larger ephemeral or persistent data in Flash, freeing up RAM.

Fast Storage •  Application is constrained by IO

overhead and throughput of existing storage infrastructure.

•  CAPI Flash APIs offer extremely high IO per CPU thread with low latency.

#ibmedge

Levyx Overview •  Mission:

•  Provide Software that cost-effectively maximizes performance and minimizes latency for Big Data and other Database server Platforms

•  Founded in 2013 , Headquartered in Irvine, CA •  Reza Sadri, CEO

–  Entrepreneur, PhD CS. Database specialization •  Tony Givargis, CTO

–  UC Irvine Professor, PhD CS, Embedded Systems

•  Series “A” led by OCA Ventures

•  Patent-Pending Indexing technology

•  Cloud, OEM, SI/SP partnerships

15

#ibmedge

Levyx Key-Value Storage Layer Bridges Gap

16

Software Hardware

NVMs

Flash SSDs

Multi-core Processors

Hardware

Agnostic storage layer designed to

optimize data-focused SW

and latest HW

#ibmedge

•  Helium-DB Storage Engine •  World’s Fastest Key Value store for Big Data Analytics and Operational

Databases •  In-Memory Speeds or greater with Persistence

•  LevyxSpark: Apache Spark+Helium •  Storage Optimized and Accelerated Open Source Spark for real-time/hi IO

performance applications •  Full Spark SQL query pushdown (join, group-by, filter, etc) and

acceleration to machine code speeds •  Node consolidation with combined memory-flash storage layer

Levyx Products

#ibmedge

Example Use Cases

•  Financial Services •  Electronic Trading Workflow- Streaming analytics, compliance, risk-

management, algorithmic/ML based trading

•  Cybersecurity •  Logging and event management, correlation •  User behavior analytics/ML

•  IOT •  Edge and Datacenter real time and batch analytics/operational databases

•  E-commerce/Adtech •  Real-time Bidding Analytics

18 #ibmedge © 2016 IBM Corporation 19 © Copyright 2013-2016 Levyx Inc.

Helium: World’s Fastest Key Value Store Pluggable DB Storage Engine

#ibmedge © 2016 IBM Corporation 19 © Copyright 2013-2016 Levyx Inc.

Helium: World’s Fastest Key Value Store Pluggable DB Storage Engine

#ibmedge

Optimization Tool

Patent-pending Multi-core

© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential 20

Ultra-low latency indexing engine

for billions of objects

NVM/Flash Replaces DRAM

Enables Very Dense Nodes

World’s Fastest Key Value Store

Helium: Database Engine for Big & Fast Data

© Copyright 2013-2016 Levyx Inc.

#ibmedge

Helium: Flash /Multi-Core Optimized Leverage Multi-core/Multi-Channel Parallelism to boost performance/reduce latency. Reduce layers of abstraction/overhead Application-analytics

platform Database

Database Storage Engine

OS File System

OS Volume Manager

OS Device Driver

Disk Controller

Disk Drive

Application-Analytics platform

Database

Levyx Helium

OS device driver

Flash controller F/W

Flash Chips

#ibmedge © Copyright 2013-2015 Levyx Inc. 22

Helium Key Attributes Helium •  Compact RAM-based Index – 10’s Billions of Keys, PTB’s Data

•  Flash Optimized– tight 99%, 99.99% latency

•  Lock-free architecture

•  Structured: •  Full SQL Command Set – Sort, Join, Group-by, Filter, Aggregate, Projections, etc

•  Unstructured: •  Get, Put, Delete, Point/Range Query, Point Update

•  ACID Compliance/Transactions Groups

•  In-line Dictionary Compression

•  Snapshot

#ibmedge © Copyright 2013-2015 Levyx Inc. 23

§ Portable Implementation with Architecture and OS-specific Dependencies Fully Isolated §  Available on Unix/Linux/Window/Mac platforms

§ Distributed in the Form of a Library §  Fully documented key/value API

§ Bundled as a Server with Client API Support in Popular Languages §  C, C++, Java, Node.js, REST, etc.

§ Wrappers for Popular KVS § RocksDB, Memcached

§ Platform for Integration with Other Technologies §  Support for structured data (to improve Spark’s shuffle performance) §  Columnar database integration with SparkSQL

Helium: Programming Language/Platform/Wrapper Support

#ibmedge

Helium Accelerated Memcached

•  Faster : Standard 90:10 (get:set) Helium-Memcached is at least 10x better in TPS on cloud and on-prem.

•  Cheaper : Single Helium-Memcached scales with cores/SSD vs. stock memcached (needs multiple nodes, large amounts of RAM)

•  Simpler: Plug and Play with existing Memcached applications. Rapid Automatic recovery from persisted SSD simplifies

24

#ibmedge

Helium vs RocksDB vs Aerospike http://www.levyx.com/content/helium-demo

© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential 25 © Copyright 2013-2016 Levyx Inc.


LevyxSpark = Helium + Apache Spark

Faster, Cheaper, Simpler….

#ibmedge 27

LevyxSpark (Helium integrated w/ Spark)

(99% open source)

Helium Data Engine

End-customers

Spark Integration Facilitates Immediate End User Deployment


#ibmedge

Apache Spark- Levyx Integration •  Spark connector between Helium to Spark

•  Spark RDD/DataFrame maps to Helium dataset

•  Pushdown of SQL queries from Spark Catalyst Optimizer to Helium layer •  JIT “C” level compilation/execution

28

#ibmedge

LevyxSpark Advantages

•  Faster •  Combined solution provides superior performance vs Native Apache Spark

especially in situations involving: –  Large datasets dealing with sorting, joins, group-by (heavy shuffling) –  Ideal for workloads involving small Random inserts, point queries –  Leveraging Index lookups vs filtering

•  Cheaper •  Up to 90% reduction in Nodes/lower cost Nodes for equivalent or greater “in-

memory” capacity

•  Simpler •  Reduced network complexity •  No need to tier from Memory to Flash

29

#ibmedge 30

Spark without Levyx (500 nodes) r3.8 large

$33,600 /day

Spark with Levyx (50 nodes) c3.8 large

$1,920 /day

15X Lower Cost!

LevyxSpark Reduces Nodes and Cost

Cyber Security Real Time Monitoring Use Case “Often times technology vendors advertise

scale-out as a way to reach high performance goals. It is a proven approach, but it is often

used to mask single node inefficiencies. Without a solution where CPU, memory, network, and local storage are properly

balanced, this is simply what we call “throwing hardware at the problem”. Hardware that,

virtual or not, customers pay for.”

-Google Blog, 2015, in reference to Levyx and its groundbreaking

technology



OpenPower + LevyxSpark Even Faster, Cheaper, and Simpler

#ibmedge

LevyxSpark and OpenPower: Ideal Dense, ”Scale-in” Platform

•  Power 8 •  Hi core count/Relatively low cost •  CAPI Hi-performance interface

–  2 week porting effort •  Goal: Native Spark(FC) vs LevyxSpark (CAPI)

•  Test Unit •  Power System S822L 2-socket POWER8 Server •  20 POWER8 cores, 160 logical CPUs (SMT8, 8 threads per core) •  256GB RAM •  Apache Spark 1.6 •  FC and CAPI HBAs connected to IBM FlashSystem 840

•  Ubuntu16.04.01

32

#ibmedge

Test Benchmarks •  Sort – Integer, String,GenSort

–  Read an input table from data ingestion drive –  Sort table based on integer column –  Write sorted table to flash subsystem

•  Iterative Join •  Read 16 table from data ingestion drive •  Save final join result to flash subsystem •  For 10 iterations

–  Change one of input join graph –  Calculate new value of final join result –  Update a new result on flash subsystem

•  Incremental Update to Sorted Table •  Read an input table from data ingestion drive as a baseline data set •  For 10 iterations

–  Read another small table from data ingestion drive –  Add all elements of small table to base line data set –  Sort base line data based on first integer column –  Write sorted table to flash subsystem

33

#ibmedge

Specification – Test Bench Summary

Bench Mark Data Set Size (GB) Comment

Sort 64, 128, 256, 512 Highlight advantage of LevyxSpark in analytical use cases

Iterative Join 128, 256, 512 Highlight advantage of LevyxSpark in data persisting

Incremental Update

128, 256, 512

Highlight advantage of LevyxSpark in transactional use cases

34

#ibmedge

PERFORMANCE COMPARISON

©C

opyr

ight

201

3-20

14

Levy

x In

c.

Pro

prie

tary

and

C

onfid

entia

l

35

#ibmedge

Integer Sort Test Bench

Execution Time

0

1000

2000

3000

4000

5000

6000

64 128 256 512

LevyxSpark Spark

Average CPU(s) User %

©C

opyr

ight

201

3-20

16

Levy

x In

c.

Pro

prie

tary

and

C

onfid

entia

l

36

0% 5%

10% 15% 20% 25% 30% 35% 40% 45% 50%

64 128 256 512

LevyxSpark Spark

#ibmedge

String Sort Test Bench

Execution Time

0

1000

2000

3000

4000

5000

6000

7000

8000

64GB 128GB 256GB 512GBInputSize

LevyxSpark Spark


©C

opyr

ight

201

3-20

16

Levy

x In

c.

Pro

prie

tary

and

C

onfid

entia

l

37

0%

10%

20%

30%

40%

50%

60%

64GB 128GB 256GB 512GBInputSize

LevyxSpark Spark

#ibmedge

GenSort Test Bench

Execution Time

0

500

1000

1500

2000

2500

64GB 128GB 256GBInputSize

LevyxSpark Spark


©C

opyr

ight

201

3-20

16

Levy

x In

c.

Pro

prie

tary

and

C

onfid

entia

l

38

0%

10%

20%

30%

40%

50%

60%


LevyxSpark Spark

#ibmedge

Iterative Graph Test Bench

Execution Time

0500

100015002000250030003500400045005000


LevyxSpark Spark


©C

opyr

ight

201

3-20

16

Levy

x In

c.

Pro

prie

tary

and

C

onfid

entia

l

39

0%

5%

10%

15%

20%

25%


LevyxSpark Spark

Sto

ck S

park

Fai

led

to R

un

Sto

ck S

park

Fai

led

to R

un

#ibmedge

Incremental Update

Execution Time

0

1000

2000

3000

4000

5000

6000

7000


LevyxSpark Spark


©C

opyr

ight

201

3-20

16

Levy

x In

c.

Pro

prie

tary

and

C

onfid

entia

l

40

0%

10%

20%

30%

40%

50%

60%


LevyxSpark Spark

#ibmedge

Summary •  LevyxSpark plus POWER8/CAPI integration ideal combination for

Apache Spark IO Intensive Workloads

•  Balanced Scale-in platform- Fewer nodes needed for a given workload

•  Freed up cores by CAPI integration allow more analytical/computational workloads

•  Larger datasets per node/reduced shuffling/spills/crashes

41

#ibmedge 42

[email protected] [email protected]

Questions?


Backup

#ibmedge

Sort Benchmarks: CPU Idle Time Comparison

0%

10%

20%

30%

40%

50%

60%

70%

80%

64GIntSort

128GIntSort

256GIntSort

512GIntSort

64GStrSort

128GStrSort

256GStrSort

512GStrSort

64GGenSort

128GGenSort

256GGenSort

CPUId

leTim

e%

LevyxSpark-CAPI LevyxSpark-HBA Spark44

#ibmedge

Iterative Join and Incremental Update Benchmarks CPU Idle Time Comparison

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

128GIteraAveJoin176GIteraAveJoin256GIteraAveJoin 64GIncrementalUpdate

128GIncrementalUpdate

256GIncrementalUpdate

CPUId

leTim

e(%

)

LevyxSpark-CAPI LevyxSpark-HBA Spark45

#ibmedge

Notices and Disclaimers

46

Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.

IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.”

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

#ibmedge

Notices and Disclaimers Con’t.

47

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.


Thank You

Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL...

Documents

Transcript of Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL...