Auf z Systems - IBM Resilient Distributed Data (RDD): • Spark’s basic unit of data •...

Khadija Souissi

07. – 08. November 2016 @ IBM z Systems Mainframe Event 2016

Auf z Systems

© 2016 IBM Corporation 2

Acknowledgements

● Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.


Agenda

● What is Spark

● Spark Details

● The ecosystem for Spark on z/OS

● Use Cases

● References


Big Data Maturity

4

Big Data Maturity

Value

Operations DataWarehousing

LineofBusinessandAnalytics

NewBusinessImperatives

MostBusinessesAreHere

LowertheCostofStorage

WarehouseModernization

§ Datalake§ Dataoffload§ ETLoffload§ Queryablearchiveand

staging

Data-informedDecisionMaking

§ Fulldatasetanalysis(nomoresampling)

§ Extractvaluefromnon-relationaldata

§ 360o viewofallenterprisedata

§ Exploratoryanalysisanddiscovery

BusinessTransformation

§ Createnewbusinessmodels

§ Risk-awaredecisionmaking

§ Fightfraudandcounterthreats

§ Optimizeoperations§ Attract,grow,retain

customers

What Spark is (and is not)


WhatSparkIs,WhatitIsNot

●An Apache Foundation open source project●Not a product● An in-memory compute engine that works with data●Not a data store●Enables highly sophisticated analysis on huge volumes of data at scale●Unified environment for data scientists, developers and data engineers●Radically simplifies the process of developing intelligent apps fueled by data

© 2015 IBM Corporation


What Spark on z/OS is NotWhat it is not

●A data cache for all data in DB2, IMS, IDAA, VSAM …

●Just a different SQL engine or query optimizer

●An effective mechanism to access a single data source for analytics

Why isn’t it the same as a query acceleration / IBM DB2 Analytics Accelerator?

● Spark does not optimize SQL queries

● Spark is not a mechanism to store data, it rather provides interfaces to uniformly access data & to apply analytics using a unified interface

● DB2 Analytics Accelerator interaction with applications is via DB2;Spark interaction with applications is via Spark interfaces (Stream, MLlib, Graphx, SQL), driven through REST or Java

● Spark analytics can access data in DB2, DB2 Analytics Accelerator, VSAM, IMS, off platform, etc.


ApacheSpark– acomputeEngine

Fast• Leverages aggressively cached in-memory• distributed computing and JVM threads• Faster than MapReduce for some workloadsEase of use (for programmers)• Written in Scala, an object-oriented, functionalprogramming language

• Scala, Python and Java APIs• Scala and Python interactive shells• Runs on Hadoop, Mesos, standalone or cloudGeneral purpose• Covers a wide range of workloads• Provides SQL, streaming and complex analytics

Logistic regression in Hadoop and Spark

from http://spark.apache.org


Over 750 contributors from over 250 companies including IBM, Google, Amazon, SAP, Databricks…

Spark History


Why does Spark matter to the business


Buildmodelsquickly.Iteratefaster.Applyintelligenceeverywhere.

Who uses Spark

IBM’sInvolvementandCommitment

●IBM is one of the four founding members of the UC Berkeley AMPLab●Work closely with AMPLab research on projects of mutual interest

●June 2015, IBM announced:●3,500 researchers and developers to work on Spark-related projects at IBM labs worldwide●Donate IBM SystemML machine learning technology to the Spark open source ecosystem●Spark Technology Center established in San Francisco for the data science and developer community

●IBM supports the Spark community●Code contributions●Partnerships with AMPLab Galvanize and Big Data University (MOOC)●Education for data scientists and developers

●Visit the IBM Spark Technology Center at http://www.spark.tc

Spark Details


TheSparkStack,ArchitecturalOverview


HowtouseSpark?


WhatisSparkusedfor?


SparkStreaming


WhyApacheSpark

Existingapproach:MapReduceforcomplexjobs,interactivequery,andonlineevent-hubprocessinginvolveslotsofdiskI/O

NewApproach:Keepmoredatain-memorywithanewdistributedexecutionengine


HowDoesSparkProcessYourData- RDDs• Resilient Distributed Data (RDD):

• Spark’s basic unit of data• Immutable, modifications create new RDDs • Caching, persistence (memory, spilling to disk)

• Fault tolerance• If data in memory is lost it can be recreated from their original lineage

• RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes

• An RDD is physically distributed across the cluster, but manipulated as one logical entity:• Spark will automatically “distribute” any required processing to all partitions• Perform necessary redistributions and aggregations as well

Names


How Does Spark Process Your Data – Resilient Distributed Data Sets (RDDs)• Spark’s basic unit of data• Immutable collection of elements that can be operated on in parallel across a cluster• Fault tolerance

– If data in memory is lost it will be recreated from lineage

• Caching, persistence (memory, spilling, disk) and check-pointing• Many database or file types can be supported• An RDD is physically distributed across the cluster, but manipulated as one logical entity:

– Spark will “distribute” any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well.– Example: Consider a distributed RDD “Names” made of names

Names


HowDoesSparkProcessYourData–ResilientDistributedDataSets(RDDs)

• Key idea: write programs in terms of transformations on distributed datasets

• An RDD is Spark’s basic unit of data• Fault tolerance

• If data in memory is lost it will be recreated from lineage

• Caching, persistence (memory or disk)• Many databases or file types can be

supported


SparkProgrammingModel

• Operations on RDDs (datasets)• Transformation• Action

• Transformations use lazy evaluation• Executed only if an action requires it

• An application consist of a Directed Acyclic Graph (DAG)• Each action results in a separate batch job• Parallelism is determined by the number of RDD partitions


Resilient Distributed Datasets (RDDs)


Sparklazyexecution

• At a high level, any Spark application creates RDDs out of some input..• ... runs (lazy) transformations of these RDDs to some other form (shape)…• ... and finally perform actions to collect or store data

• Execution (data access) only starts when an action is encountered.


Common,PopularMethodstoAccessDataSpark SQL●Provide for relational queries expressed in SQL, HiveQL and Scala

●Seamlessly mix SQL queries with Spark programs

●Provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files

●Standard connectivity through JDBC/ODBC

●Integration of Spark z/OS with Rocket Software provides unique functionality to access data across a wide variety of environments with very high performance and flexibility

Spark R●Spark R is an R package that provides a light-weight front-end to use Apache Spark from R

●Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.

●Goal is to make Spark R production ready

●Rocket Software has announced intent to support R on z/OS

The ecosystem Spark on z Systems


Spark processing of z dataAccess to z data

• JDBC access to DB2 z/OS, IMS • Rocket Mainframe Data Service for Apache Spark (MDSS)

Spark running on z Systems– z/OS – Linux on z Systems

https://www.ibm.com/developerworks/java/jdk/spark/

Operatingsystem Hardware Bitnessz/OS2.1 zSystems 64SUSELinuxEnterpriseServer12 zSystems 64RedHatEnterpriseLinux7.1 zSystems 64Ubuntu16.04LTS zSystems 64

Spark Support on z Systems


IBMz/OSPlatformforApacheSpark

DB2 z/OS

OLTPApplicationsCICS,IMS,WAS…

Type 2Type 4

IMS VSAM

Java z/OS

ApacheSparkCore

SparkStream

SparkSQL MLib GraphX

RDDRDD

RDD

RDD

z/OSSparkAnalyticApplications:• CustomerProvided• IBMProvided• PartnerProvided

SMF

And more…

Analysisroutines

Type 4

Twitter

External Data Sources

Mainframe Data Service for Apache Spark z/OS

IBM DB2 Analytics Accelerator

Oracle

HDFS JSON/XML

DB2 LUW Tera

...


IBMz/OSPlatformforApacheSpark...

●Almost any data source from any location can be processed in the Spark environment on z/OS●Mainframe Data Service for Apache Spark (MDSS) is key to providing a single, optimized view of heterogeneous data sources●MDSS can integrate off-platform data sources as well●Large majority of cycles used by MDSS are zIIP-eligible●Possible to use Spark on z/OS without it, but MDSS is recommended●Integration in Online Transaction Processing (OLTP) is possible, but response time may be an issue●Spark is the high performance solution for processing big data●IBM DB2 Analytics Accelerator optimization with DB2 for z/OS can be integrated


zSystemsAnalyticsandSpark:Well-matched

§Fasterframeworkforin-memoryanalytics,bothbatchandreal-time§Federateddata-in-placeanalyticsaccesswhilepreservingconsistentAPIs§WhySparkonzSystems

§OptimizedperformancewithzEDC forcompression§SMT2 forbetterthreadperformance§SIMDforoptimized floatingpointperformanceandaccelerateanalyticsprocessingformathematicalmodelsviavectorprocessing§LeverageIBMJavaoptimizationsandenhancements§zIIP eligibility

FileSystem SchedulerNetwork Serialization

MLlib

RDDcache

SQLGraphXStream

Compression


Why and when use Spark on z/OS?• The environment where Apache Spark z/OS makes sense:

●Running real-time or batch analytics over a variety of heterogeneous data sources

●Efficient real-time access to current and historical transactions

●Where a majority of data is z/OS resident

●When data contains sensitive information

●Don't scatter across several distributed nodes to be held in memory for some unknown period of time

●When implementing common analytic interfaces that are shared with users on distributed platforms

• z/OS strengths are valuable in a Spark environment:

● Intra-SQL and intra-partition parallelism for optimal data access to:

● nearly all z/OS data environments

● distributed data sources


TheEcosystemforSparkonz/OS–relevancetothenewconsumers

Data Scientist

Data Engineer App Developer

●The primary customer●Creates the spark application(s) that produce insights with business value●Probably doesn't know or care where all of the Spark resources come from

●Helps the Data scientist assemble and clean the data, write applications●Probably better awareness about resource details, but still is primarily concerned with the problem to solve, not the platform

„Thewrangler“ „Thebuilder“

„Theconvincer“

●Close to the platform, probably a Z-based person●Works with the IM specialist to associate a view of the data with the actual on-platform assets

UseCases


BusinessUseCaseillustratedinashowcase

34

FinancialInstitution:offersbothretail/consumerbankingaswellas

investmentservicesBusinessCriticalInformationownedbythe

organization

CreditCardInfo TradeTransactionInfo CustomerInfo

StockPriceHistory– publicdata

SocialMediaData:Twitter

GOALProduceright-timeoffersforcrosssellorupsell,tailoredforindividualcustomersbasedon:customerprofile,tradetransactionhistory,creditcardpurchasesandsocialmediasentimentandinformation

SparkzOSClientInsightUseCase


Sparkz/OSShowcase:ReferenceArchitecture

DB2z/OS

IMS

VSAM

CloudantLinuxonz

SparkJob

Server

JDBCIMS

Transactionsystem

CreditCardInfo

OpenSourceVisualizationLibraries

CustomerInfo

TradeTransactionInfo

Linuxonz z/OS

CICSTransactionsystem

RESTAPI

JDBC

JDBC


·Sparkapplicationisagnostictodatasourceandnumberofsources

·MDSSrequiredonatleastonesystem,MDSSagentsrequiredonallsystems.NoIPLrequiredforinstallation

·Logstream recordingmoderequiredforrealtimeinterfaces

MDSSClientLPAR1

MDSSClientLPAR2

MDSSClientLPAR3

SMFRealtime

LogstreamLogstreamLogstream SMFRealtime

LogstreamLogstreamLogstream SMFRealtime

LogstreamLogstreamLogstream

SparkApplicationusingSparkSQL

MainframeDataServiceforSparkz/OS(MDSS)

JDBC

LPARn

SMFRealtime

LogstreamLogstreamLogstream

DumpDataSets

Analyzing SMF Data with


StreamingSMFinto

• Sparksupportscustominputstreams– CustomReceiversupport• PoC builtwithacustomSMFdatareceiverusingreal-timeSMFservices• SMF30recordsstoredinaDStream inSpark– TimeseriesofRDDs

• DStream map()methodtranslatesrawSMF30recordsintoformattedstrings• Simpleexampletoshowthedataflow-- Formattedrecordsprintedtostdout

byte[]DStream

StringDStream

CustomSMF

Receivermap()Operation

SMF30stime0to1

SMF30stime1to2

SMF30stime2to3

SMF30stime3to4

RDD@time1 RDD@time2 RDD@time3 RDD@time4

FormattedSMF30stime0to1





SMFAnalyticsExample:SMFStreaminginto

• Spark is the “analytics operating system” • Supports input streams – Custom Receiver support• PoC built with a custom SMF data receiver using real-time SMF services• This can be an interesting way to learn and get comfortable with Spark technologies

ExistingSMFBufferingTechnology

SMFReadIn-MemoryDataService

SparkCustomSMFDataReceiver

ExistingSMFEWTM

Newz/OSSupport PoC Code Existingz/OSSupport

JavaJNIInterfaces

SimpleSparkSMFApplication

SMFJavaClassesCodeRunninginSparkonz/OS

Basez/OSSupport


SMFReal-TimeStreaminginto

1.RunthecommandtostarttheSparkjob./bin/spark-submit--masterlocal/spark/spark15/lib/JavaSMFReceiver.jar IFASMF.INMEM

2.SeethestatusoftheSMFreal-timeresourceontheMVSconsole

3.LookattheSparkjoboutputinUnix

SY1 d smf,mIFA714I 15.21.05 SMF STATUS 082

IN MEMORY CONNECTIONS Resource: IFASMF.INMEM Con#: 0001 Connect Time: 2015.273 15:20:11ASID: 002F RECORDS: 00000007

SMF30.2 on SY1 @ 15:21:00 273.2015SMF30.2 on SY1 @ 15:21:00 273.2015 SMF30.2 on SY1 @ 15:21:00 273.2015...


40

NewRedbookonApacheSparkimplementationonIBMz/OS


FormoreinformationclickhereContact:KhadijaSouissi,[email protected]

Want to see and know more about Spark on z Systems?Then be our Guest for the Spark on z Systems Event

References● SparkCommunities●https://cwiki.apache.org/confluence/display/SPARK/Committers●https://amplab.cs.berkeley.edu/software/● SparkSQLProgrammingGuide:●http://spark.apache.org/docs/latest/sql-programming-guide.html● IBMSystemML●Open Source: June 2015, we announced to open source SystemML●https://developer.ibm.com/open/systemml/●SystemML has been accepted as an Apache Incubator project●http://systemml.apache.org/●Source code: https://github.com/apache/incubator-systemml●Published paper: http://researcher.watson.ibm.com/researcher/files/us-ytian/systemML.pdf● BigDataUniversity– SparkFundamentalscourse●http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/● IBMpaperonFuelingInsightEconomywithApache●www-01.ibm.com/marketing/iwm/dre/signup?source=ibm-analytics&S_PKG=ov37158&dynform=19170&S_TACT=M161001W● ADeeperUnderstandingofSparkInternals●https://www.youtube.com/watch?v=dmL0N3qfSc8

● IBMz/OSPlatformforApacheSparkdocumentation

●http://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ibm.zos.v2r2.azk/azk.htm

42

IBM

Thank you!

Auf z Systems - IBM Resilient Distributed Data (RDD): • Spark’s basic unit of data •...

Documents

Transcript of Auf z Systems - IBM Resilient Distributed Data (RDD): • Spark’s basic unit of data •...