Auf z Systems - IBM Resilient Distributed Data (RDD): • Spark’s basic unit of data •...
Transcript of Auf z Systems - IBM Resilient Distributed Data (RDD): • Spark’s basic unit of data •...
© 2016 IBM Corporation 2
Acknowledgements
● Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
© 2016 IBM Corporation 3
Agenda
● What is Spark
● Spark Details
● The ecosystem for Spark on z/OS
● Use Cases
● References
© 2016 IBM Corporation 4
Big Data Maturity
4
Big Data Maturity
Value
Operations DataWarehousing
LineofBusinessandAnalytics
NewBusinessImperatives
MostBusinessesAreHere
LowertheCostofStorage
WarehouseModernization
§ Datalake§ Dataoffload§ ETLoffload§ Queryablearchiveand
staging
Data-informedDecisionMaking
§ Fulldatasetanalysis(nomoresampling)
§ Extractvaluefromnon-relationaldata
§ 360o viewofallenterprisedata
§ Exploratoryanalysisanddiscovery
BusinessTransformation
§ Createnewbusinessmodels
§ Risk-awaredecisionmaking
§ Fightfraudandcounterthreats
§ Optimizeoperations§ Attract,grow,retain
customers
© 2016 IBM Corporation 6
WhatSparkIs,WhatitIsNot
●An Apache Foundation open source project●Not a product● An in-memory compute engine that works with data●Not a data store●Enables highly sophisticated analysis on huge volumes of data at scale●Unified environment for data scientists, developers and data engineers●Radically simplifies the process of developing intelligent apps fueled by data
© 2015 IBM Corporation
© 2016 IBM Corporation 7
What Spark on z/OS is NotWhat it is not
●A data cache for all data in DB2, IMS, IDAA, VSAM …
●Just a different SQL engine or query optimizer
●An effective mechanism to access a single data source for analytics
Why isn’t it the same as a query acceleration / IBM DB2 Analytics Accelerator?
● Spark does not optimize SQL queries
● Spark is not a mechanism to store data, it rather provides interfaces to uniformly access data & to apply analytics using a unified interface
● DB2 Analytics Accelerator interaction with applications is via DB2;Spark interaction with applications is via Spark interfaces (Stream, MLlib, Graphx, SQL), driven through REST or Java
● Spark analytics can access data in DB2, DB2 Analytics Accelerator, VSAM, IMS, off platform, etc.
© 2016 IBM Corporation 8
ApacheSpark– acomputeEngine
Fast• Leverages aggressively cached in-memory• distributed computing and JVM threads• Faster than MapReduce for some workloadsEase of use (for programmers)• Written in Scala, an object-oriented, functionalprogramming language
• Scala, Python and Java APIs• Scala and Python interactive shells• Runs on Hadoop, Mesos, standalone or cloudGeneral purpose• Covers a wide range of workloads• Provides SQL, streaming and complex analytics
Logistic regression in Hadoop and Spark
from http://spark.apache.org
© 2016 IBM Corporation 9
Over 750 contributors from over 250 companies including IBM, Google, Amazon, SAP, Databricks…
Spark History
© 2016 IBM Corporation 11
Buildmodelsquickly.Iteratefaster.Applyintelligenceeverywhere.
Who uses Spark
IBM’sInvolvementandCommitment
●IBM is one of the four founding members of the UC Berkeley AMPLab●Work closely with AMPLab research on projects of mutual interest
●June 2015, IBM announced:●3,500 researchers and developers to work on Spark-related projects at IBM labs worldwide●Donate IBM SystemML machine learning technology to the Spark open source ecosystem●Spark Technology Center established in San Francisco for the data science and developer community
●IBM supports the Spark community●Code contributions●Partnerships with AMPLab Galvanize and Big Data University (MOOC)●Education for data scientists and developers
●Visit the IBM Spark Technology Center at http://www.spark.tc
© 2016 IBM Corporation 18
WhyApacheSpark
Existingapproach:MapReduceforcomplexjobs,interactivequery,andonlineevent-hubprocessinginvolveslotsofdiskI/O
NewApproach:Keepmoredatain-memorywithanewdistributedexecutionengine
© 2016 IBM Corporation 19
HowDoesSparkProcessYourData- RDDs• Resilient Distributed Data (RDD):
• Spark’s basic unit of data• Immutable, modifications create new RDDs • Caching, persistence (memory, spilling to disk)
• Fault tolerance• If data in memory is lost it can be recreated from their original lineage
• RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes
• An RDD is physically distributed across the cluster, but manipulated as one logical entity:• Spark will automatically “distribute” any required processing to all partitions• Perform necessary redistributions and aggregations as well
Names
© 2016 IBM Corporation 20
How Does Spark Process Your Data – Resilient Distributed Data Sets (RDDs)• Spark’s basic unit of data• Immutable collection of elements that can be operated on in parallel across a cluster• Fault tolerance
– If data in memory is lost it will be recreated from lineage
• Caching, persistence (memory, spilling, disk) and check-pointing• Many database or file types can be supported• An RDD is physically distributed across the cluster, but manipulated as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well.– Example: Consider a distributed RDD “Names” made of names
Names
© 2016 IBM Corporation 21
HowDoesSparkProcessYourData–ResilientDistributedDataSets(RDDs)
• Key idea: write programs in terms of transformations on distributed datasets
• An RDD is Spark’s basic unit of data• Fault tolerance
• If data in memory is lost it will be recreated from lineage
• Caching, persistence (memory or disk)• Many databases or file types can be
supported
© 2016 IBM Corporation 22
SparkProgrammingModel
• Operations on RDDs (datasets)• Transformation• Action
• Transformations use lazy evaluation• Executed only if an action requires it
• An application consist of a Directed Acyclic Graph (DAG)• Each action results in a separate batch job• Parallelism is determined by the number of RDD partitions
© 2016 IBM Corporation 24
Sparklazyexecution
• At a high level, any Spark application creates RDDs out of some input..• ... runs (lazy) transformations of these RDDs to some other form (shape)…• ... and finally perform actions to collect or store data
• Execution (data access) only starts when an action is encountered.
© 2016 IBM Corporation 25
Common,PopularMethodstoAccessDataSpark SQL●Provide for relational queries expressed in SQL, HiveQL and Scala
●Seamlessly mix SQL queries with Spark programs
●Provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files
●Standard connectivity through JDBC/ODBC
●Integration of Spark z/OS with Rocket Software provides unique functionality to access data across a wide variety of environments with very high performance and flexibility
Spark R●Spark R is an R package that provides a light-weight front-end to use Apache Spark from R
●Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.
●Goal is to make Spark R production ready
●Rocket Software has announced intent to support R on z/OS
© 2016 IBM Corporation 27
Spark processing of z dataAccess to z data
• JDBC access to DB2 z/OS, IMS • Rocket Mainframe Data Service for Apache Spark (MDSS)
Spark running on z Systems– z/OS – Linux on z Systems
https://www.ibm.com/developerworks/java/jdk/spark/
Operatingsystem Hardware Bitnessz/OS2.1 zSystems 64SUSELinuxEnterpriseServer12 zSystems 64RedHatEnterpriseLinux7.1 zSystems 64Ubuntu16.04LTS zSystems 64
Spark Support on z Systems
© 2016 IBM Corporation 28
IBMz/OSPlatformforApacheSpark
DB2 z/OS
OLTPApplicationsCICS,IMS,WAS…
Type 2Type 4
IMS VSAM
Java z/OS
ApacheSparkCore
SparkStream
SparkSQL MLib GraphX
RDDRDD
RDD
RDD
z/OSSparkAnalyticApplications:• CustomerProvided• IBMProvided• PartnerProvided
SMF
And more…
Analysisroutines
Type 4
External Data Sources
Mainframe Data Service for Apache Spark z/OS
IBM DB2 Analytics Accelerator
Oracle
HDFS JSON/XML
DB2 LUW Tera
...
© 2016 IBM Corporation 29
IBMz/OSPlatformforApacheSpark...
●Almost any data source from any location can be processed in the Spark environment on z/OS●Mainframe Data Service for Apache Spark (MDSS) is key to providing a single, optimized view of heterogeneous data sources●MDSS can integrate off-platform data sources as well●Large majority of cycles used by MDSS are zIIP-eligible●Possible to use Spark on z/OS without it, but MDSS is recommended●Integration in Online Transaction Processing (OLTP) is possible, but response time may be an issue●Spark is the high performance solution for processing big data●IBM DB2 Analytics Accelerator optimization with DB2 for z/OS can be integrated
© 2016 IBM Corporation 30
zSystemsAnalyticsandSpark:Well-matched
§Fasterframeworkforin-memoryanalytics,bothbatchandreal-time§Federateddata-in-placeanalyticsaccesswhilepreservingconsistentAPIs§WhySparkonzSystems
§OptimizedperformancewithzEDC forcompression§SMT2 forbetterthreadperformance§SIMDforoptimized floatingpointperformanceandaccelerateanalyticsprocessingformathematicalmodelsviavectorprocessing§LeverageIBMJavaoptimizationsandenhancements§zIIP eligibility
FileSystem SchedulerNetwork Serialization
MLlib
RDDcache
SQLGraphXStream
Compression
© 2016 IBM Corporation 31
Why and when use Spark on z/OS?• The environment where Apache Spark z/OS makes sense:
●Running real-time or batch analytics over a variety of heterogeneous data sources
●Efficient real-time access to current and historical transactions
●Where a majority of data is z/OS resident
●When data contains sensitive information
●Don't scatter across several distributed nodes to be held in memory for some unknown period of time
●When implementing common analytic interfaces that are shared with users on distributed platforms
• z/OS strengths are valuable in a Spark environment:
● Intra-SQL and intra-partition parallelism for optimal data access to:
● nearly all z/OS data environments
● distributed data sources
© 2016 IBM Corporation 32
TheEcosystemforSparkonz/OS–relevancetothenewconsumers
Data Scientist
Data Engineer App Developer
●The primary customer●Creates the spark application(s) that produce insights with business value●Probably doesn't know or care where all of the Spark resources come from
●Helps the Data scientist assemble and clean the data, write applications●Probably better awareness about resource details, but still is primarily concerned with the problem to solve, not the platform
„Thewrangler“ „Thebuilder“
„Theconvincer“
●Close to the platform, probably a Z-based person●Works with the IM specialist to associate a view of the data with the actual on-platform assets
© 2016 IBM Corporation 34
BusinessUseCaseillustratedinashowcase
34
FinancialInstitution:offersbothretail/consumerbankingaswellas
investmentservicesBusinessCriticalInformationownedbythe
organization
CreditCardInfo TradeTransactionInfo CustomerInfo
StockPriceHistory– publicdata
SocialMediaData:Twitter
GOALProduceright-timeoffersforcrosssellorupsell,tailoredforindividualcustomersbasedon:customerprofile,tradetransactionhistory,creditcardpurchasesandsocialmediasentimentandinformation
SparkzOSClientInsightUseCase
© 2016 IBM Corporation 35
Sparkz/OSShowcase:ReferenceArchitecture
DB2z/OS
IMS
VSAM
CloudantLinuxonz
SparkJob
Server
JDBCIMS
Transactionsystem
CreditCardInfo
OpenSourceVisualizationLibraries
CustomerInfo
TradeTransactionInfo
Linuxonz z/OS
CICSTransactionsystem
RESTAPI
JDBC
JDBC
© 2016 IBM Corporation 36
·Sparkapplicationisagnostictodatasourceandnumberofsources
·MDSSrequiredonatleastonesystem,MDSSagentsrequiredonallsystems.NoIPLrequiredforinstallation
·Logstream recordingmoderequiredforrealtimeinterfaces
MDSSClientLPAR1
MDSSClientLPAR2
MDSSClientLPAR3
SMFRealtime
LogstreamLogstreamLogstream SMFRealtime
LogstreamLogstreamLogstream SMFRealtime
LogstreamLogstreamLogstream
SparkApplicationusingSparkSQL
MainframeDataServiceforSparkz/OS(MDSS)
JDBC
LPARn
SMFRealtime
LogstreamLogstreamLogstream
DumpDataSets
Analyzing SMF Data with
© 2016 IBM Corporation 37
StreamingSMFinto
• Sparksupportscustominputstreams– CustomReceiversupport• PoC builtwithacustomSMFdatareceiverusingreal-timeSMFservices• SMF30recordsstoredinaDStream inSpark– TimeseriesofRDDs
• DStream map()methodtranslatesrawSMF30recordsintoformattedstrings• Simpleexampletoshowthedataflow-- Formattedrecordsprintedtostdout
byte[]DStream
StringDStream
CustomSMF
Receivermap()Operation
SMF30stime0to1
SMF30stime1to2
SMF30stime2to3
SMF30stime3to4
RDD@time1 RDD@time2 RDD@time3 RDD@time4
FormattedSMF30stime0to1
FormattedSMF30stime1to2
FormattedSMF30stime2to3
FormattedSMF30stime3to4
© 2016 IBM Corporation 38
SMFAnalyticsExample:SMFStreaminginto
• Spark is the “analytics operating system” • Supports input streams – Custom Receiver support• PoC built with a custom SMF data receiver using real-time SMF services• This can be an interesting way to learn and get comfortable with Spark technologies
ExistingSMFBufferingTechnology
SMFReadIn-MemoryDataService
SparkCustomSMFDataReceiver
ExistingSMFEWTM
Newz/OSSupport PoC Code Existingz/OSSupport
JavaJNIInterfaces
SimpleSparkSMFApplication
SMFJavaClassesCodeRunninginSparkonz/OS
Basez/OSSupport
© 2016 IBM Corporation 39
SMFReal-TimeStreaminginto
1.RunthecommandtostarttheSparkjob./bin/spark-submit--masterlocal/spark/spark15/lib/JavaSMFReceiver.jar IFASMF.INMEM
2.SeethestatusoftheSMFreal-timeresourceontheMVSconsole
3.LookattheSparkjoboutputinUnix
SY1 d smf,mIFA714I 15.21.05 SMF STATUS 082
IN MEMORY CONNECTIONS Resource: IFASMF.INMEM Con#: 0001 Connect Time: 2015.273 15:20:11ASID: 002F RECORDS: 00000007
SMF30.2 on SY1 @ 15:21:00 273.2015SMF30.2 on SY1 @ 15:21:00 273.2015 SMF30.2 on SY1 @ 15:21:00 273.2015...
© 2016 IBM Corporation 41
FormoreinformationclickhereContact:KhadijaSouissi,[email protected]
Want to see and know more about Spark on z Systems?Then be our Guest for the Spark on z Systems Event
References● SparkCommunities●https://cwiki.apache.org/confluence/display/SPARK/Committers●https://amplab.cs.berkeley.edu/software/● SparkSQLProgrammingGuide:●http://spark.apache.org/docs/latest/sql-programming-guide.html● IBMSystemML●Open Source: June 2015, we announced to open source SystemML●https://developer.ibm.com/open/systemml/●SystemML has been accepted as an Apache Incubator project●http://systemml.apache.org/●Source code: https://github.com/apache/incubator-systemml●Published paper: http://researcher.watson.ibm.com/researcher/files/us-ytian/systemML.pdf● BigDataUniversity– SparkFundamentalscourse●http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/● IBMpaperonFuelingInsightEconomywithApache●www-01.ibm.com/marketing/iwm/dre/signup?source=ibm-analytics&S_PKG=ov37158&dynform=19170&S_TACT=M161001W● ADeeperUnderstandingofSparkInternals●https://www.youtube.com/watch?v=dmL0N3qfSc8
● IBMz/OSPlatformforApacheSparkdocumentation
●http://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ibm.zos.v2r2.azk/azk.htm
42
IBM