Post on 21-Jan-2018
© 2017 IBM Corporation
Data: Spark and the Data Federation
Leif Pedersen
Executive IT Specialist,
z Analytics, Europe
Email: Leif.Pedersen@dk.ibm.com
© 2017 IBM Corporation
Systems of InsightSystems of RecordSystems of
Engagement
Look like a “déjà vu”?
2
© 2017 IBM Corporation
In the new insight economy, winners infuse analytics everywhere to drive better outcomes!
Create new business models(CEO)
Attract, grow, retain customers(CMO)
Transform financial& management
processes(CFO)
Manage risk(CRO)
Prioritize IT investmentfor innovation(CIO, CDO)
Optimize operations
(COO)
Fight fraud and counter threats
(CSO)
Systems of Insight
Systems of RecordSystems of
Engagement
3
© 2017 IBM Corporation
All Data New Dev StylesNew Analytics More People
Business Value
Embrace all data
Run at the speed of business
1 Enable all analytics
IBM Analytics Point of View - Make DATA SIMPLE and ACCESSIBLE to ALL
DATA Professionals are
leading THE Transformation!
2
3
4
© 2017 IBM Corporation
The Evolution in the Approach to Getting Value from Data
Operations Data Warehousing Self-service Analytics
New Business Imperatives
Maturity High
High
Low
Data-Informed Decision Making
• Full dataset analysis (no more sampling)
• Extract value from non-relational data
• 360o
view of all enterprise data
• Exploratory analysis and discovery
Warehouse Modernization
• Data lake
• Data offload
• ETL offload
• Queryable archive and staging
Lower the Costof Storage
Ensure resiliency and availability
Business Transformation
• Create new business models
• Risk-aware decision making
• Fight fraud and counter threats
• Optimize operations
• Attract, grow, retain customers
Value
We are here
5
© 2017 IBM Corporation
SoE
Analytics evolution to support all Analytics Apps on all Data –The Mainframe Use case
6
Applications Data
SoI
HDFSMap / Reduce
SparkHistorical data in DB2 for z/OS &
IBM DB2 Analytics Accelerator
Other Data
BI Reporting Data Warehouse / Data Marts
The Data Lake Evolution
Operational Data stored in
VSAM, IMS, DB2
SoR Core Business supported by
CICS, IMS, WAS
z/OSRulesScore
execution
Machine LearningThe Predictive Analytics Evolution Score
Creation
IT Operational Data
© 2017 IBM Corporation
z Systems Analytics Areas complement existing Analytics Environments.
IBM
DB
2 A
naly
tics
Accele
rato
r
In transaction rules and score execution
Intraday capability for ad-hoc queries & predictive analytics
Availability of historical data (in raw format)
Accelerated reporting to fulfill internal and regulatory
requirements
Ability to transform data before offload to
DWH or reportingAbility to create new models at any time
Quasi Real Time availability of data
for analytics
Instant access to raw data for new report generation in
hours instead of days
Load and merge of ANY non DB2 z/OS data
Scoring Rules
A
zDatazApps
Scoring
Rules
Explore data to uncover hidden
insights
A
7
© 2017 IBM Corporation
� Opportunity to rethink business processes: analytics as an integral part of the process itself, rather than a separate activity performed after the fact
o Transform business processes, not just provide existing styles of analytics faster and without latency
� Enable business leaders to perform, in the context of operational processes, advanced and sophisticated real-time analysis of their business data
Hybrid transaction/analytical processing will empower application leaders toinnovate via greater situation awareness and improved business agility.
Gartner Research Note G00259033 28 January 2014: Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation
The integration of transactions and analytics is an emerging and important market segment
“”
Analytics as part of the
flow of business
Insights on every
transaction
© 2017 IBM Corporation
Hybrid Transaction/Analytical Processing (HTAP) - with DB2 Analytics Accelerator
OLAP
DB
2 fo
r z/O
S
Pro
cessin
g
IBM
DB
2 A
nalytic
s A
ccele
rato
r
DB2 for z/OS CPU savings target• Operational (in transaction)
analytics
• (complex) OLTP
Accelerator focus• Ad-hoc queries
• Complex queries scanning
large amount of data
• ETL acceleration/virtual
transformation
Complex queries (more history)
OLTP Transactions
High concurrency
Hybrid Transactional & Analytical Processing
Standard reports
© 2017 IBM Corporation
Data Warehouse and Data Lake
A Data Lake is…
+An analytics sandbox for exploring data to gain insight
+An enterprise-wide catalog to find data across the enterprise and to link from business term to technical metadata
+An environment for enabling reuse data transformations and queries
+An environment where users can access vast amounts raw data
+An environment for developing and proving an analytics model and then moving into production; experience in production may drive further experimentation in the data lake
A Data Lake is not…
- A data warehouse or data mart of all of the data in an enterprise
- A high-performance production environment
- A production reporting application
- A purpose-built system to solve a specific problem
10
© 2017 IBM Corporation
� Fast Runtime Environment– Interactive or batch processing
– Based on data in-memory processing• High performance for multi-step processes where Spark can
pass the data directly without using disk storage.
– Parallel processing
� Interface to Data – Accessing Hadoop based HDFS data, Cassandra,
Hbase, …
– Accessing any traditional databases using JDBC
� Interface for Applications – Ease of Use APIs supported by modern languages
– Stack of libraries including SQL, Machine Learning,
GraphX, and Spark Streaming
– Over 80 high-level operators that make it easy to build
parallel applications
– Many languages supported including Java, Scala, Python
and R• Spark is actually written in Scala
Spark, a Transaction Manager for Analytics Applications
11
Spark is NOT a datastore, NOT a replacement for Hadoop!
© 2017 IBM Corporation
2. Spark lets you develop line-of-business applications faster
3. Spark learns from data and delivers in real time
With Hadoop, you ask a question and get back a batch of data. With Spark, you may say, “continue to give me answers to this question”…and when new data comes, the user is smarter.
1. Spark makes it easier to access and work
with all data
- Enables new data-based use cases
- All data: Internal/ External, Structured/
Unstructured
- Real-time insights, from all data
sources
- Automates analytics with Machine
Learning
- Clients that lead in data, lead their
industry
DesignDevelop
ment
Data
Science
Why Spark matters to a business?
12
© 2017 IBM Corporation
VSAM
z/OSKey
Business
Transaction
& Batch
Systems
Spark Applications: IBM
and Partners
AdabasIMSDB2 z/OS
Distributed
Teradata
HDFS
Apache Spark Core
Spark
Stream
Spark
SQLMLib GraphX
RDDDF
RDD
DF
Optimized data access
IBM z/OS Platform for Apache Spark
and *many* more . . .
Spark can run on z/OS close to z/OS-based Applications & Data
Values:Data-in place analytics, without need to ETL or move data for analytic purposes
Optimized access and z/OS governed ‘in-memory’ capabilities for core business data
Unique capability to access almost all z/OS sources with Apache Spark SQL & many non-z data sources
Almost all zIIP eligible
Integration of analytics across core systems, social data, website information, etc.
13
and *many* more including SMF, OPERLOG, SYSLOGs, . . .
© 2017 IBM Corporation14
Examples of Spark Use Case
© 2017 IBM Corporation15
� Client Insight Analytics over transactions & customer interactions
� Leverage data on z/OS (DB2, VSAM) & distributed (Oracle, SQL Server, HDFS) to enable real-time access from data
science teams focused on client insight to develop patterns, models
� Data Distillation - Hybrid Architecture
� Run Spark z/OS to access, aggregate, filter and *distill* large volumes of data
� Make available smaller, aggregated analytic results for access by: customer insight solutions, data science environments
� 360 Degree View: Customers, Payments, Transactions
� Leverage Spark z/OS to get real-time or near real-time view of current status of payments, transactions, customers combining data from OLTP, distributed sources, & streaming
� IT Analytics
� Analyze real-time streamed SMF data, combined with archived SMF data and syslog data, visualize and interact with data
science Jupyter Notebook to find patterns
Use Case Patterns
© 2017 IBM Corporation16
Distill the Data: • Use Spark z/OS for data blending, cleansing, transform, etc with data-
in-place• Store results in ‘Tidy’ Data Repository • Refresh as needed
Explore the results� Data exploration, investigationleveraging ‘Tidy’ Repository
Values:• Leverage most current business data for data science• Efficiencies in reducing ETL • Leverage common analytics ecosystem skill • Integrate Spark on multiple platforms for optimal analytics infrastructure
Use Case #1: Hybrid Data Science
© 2017 IBM Corporation17
Use Case #2: Optimized Customer Insight
Customer
z/OS
Transaction Merchant
Spark Analytic Result Set
Call Center
Apache Spark Core
Spark
Stream
Spark
SQLMLib GraphX
RDDDF
RDD
DF
Optimized data Layer
IBM z/OS Platform for Apache
Spark
Subset of Data: distilled, filtered, transformed
BIDashboard
Components
DataCube
AnalyticalEngines
WebPortal
Analytics
AP
I G
ate
wa
y
APIs
Pre-BuiltDashboards
Pre-BuiltData Models
Pre-BuiltAnalytical Models
Transform (if needed), &
populate BBCI staging area /
cache
Input &
Output
Tidy Data
Values:• Avoid costly and ineffective wholesale copy of data• Frequent refresh of most relevant data elements to customer insights solution• Faster time to implementation for business solution to deliver insights on churn, cross-
sell, etc.
Customer Insight for Banking Solution
© 2017 IBM Corporation18
Use Case #3: Real-Time Application Event Analytics Use Case
Spark z/OS
Event Stream
� CICS Event triggers create an event stream that would
be captured by Spark running on its own z/OS LPAR
� Spark configured for high availability to avoid impacting CICS
� Real-Time Analytics with Spark z/OS:
� Real time analytics to provide feedback into the
Systems of Engagement or Monitoring Systems on types of banking services and frequency of
consumption
� Real time monitoring of core business processes and applications
� Historical Analysis leverages IDAA:
� Batch Load of Events for historical, trending and
reporting
Real Time
Analytics, can
include scoring
DB2 Analytics Accelerator
Loader
Channel
System of Engagement
CICS Transactions
Monitor
LogstreamLogstream
IBM DB2 Analytics Accelerator
Real-Time Consumption Batch Load Overnight
Historical
Analysis, Reporting
DB2 z/OS
© 2017 IBM Corporation19
Use Case #4: Surface Spark Results to JDBC / ODBC Applications
DB2 z/OS
z/OS
Apache Spark Core
Spark
Strea
m
Spark
SQLMLib
Graph
X
DFRDD
DF
RDD
DFStor
• Persist
specific Spark
Result
Sets
• Backed
by VSAM • Leverage
z/OS SAF,
Dataset
mgmt
HDFS
JDBC / ODBC / REST, noSQLClient accessing Spark RDDs, example: Cognos , Tableau, …
Optimized Data Layer
IMSVSAM
© 2017 IBM Corporation20
Use Case #5: Analyzing SMF Data with Spark
• Spark application is
agnostic to data source
and number of sources
• MDSS required on at
least one system, MDSS
agents required on all
systems. No IPL required
for installation
• Logstream recording
mode required for
realtime interfaces
MDSS Client
LPAR1
MDSS Client
LPAR2
MDSS Client
LPAR3
SMF
Realtime
LogstreamLogstream
Logstream SMF
Realtime
LogstreamLogstream
Logstream SMF
Realtime
LogstreamLogstream
Logstream
Spark Application using SparkSQL
Optimized Data Integration Layer (MDSS)
JDBC
LPARn
SMF
Realtime
LogstreamLogstream
Logstream
Dump Data Sets
�Analyze real-time in-memory SMF data, combined with archived data
�Analyze data across multiple LPARs
�Augment with SYSLOG and other sources for richer analytic outcome
�Efficiencies in avoiding data movement
© 2017 IBM Corporation21
Use Cases for Real Time SMF Analytics
� Detect excessive memory consumption – SMF30
Monitor high water mark for real memory usage for jobs and send alerts if usage exceeds normal consumption
� Detect security violations in real-time – SMF 80
Monitor volume of datasets/files accessed per user within a given time period and raise alerts for above normal access rates
� Real time monitoring resource usage in cloud environments (CPU, Memory, Disk)
A list of supported SMF record types can be found in the Redbook “Apache Spark Implementation on IBM z/OS” - page 78
http://www.redbooks.ibm.com/abstracts/sg248325.html
© 2017 IBM Corporation22
IBM Open Data Analytics for z/OS
© 2017 IBM Corporation
Business Applications
CustomerTransaction Merchant
Distributed
Apache Spark
Distilled Insight
Query
Acceleration
Leveraging IBM Z for Optimized Analytics
Federate analytics leveraging data in place for more current insights at scale,
optimized security, privacy and reduced costs
DataDataData PrepData Prep
ML AlgoML
AlgoModelModel DeployDeploy PredictPredict
Python
Distilled InsightAnalytic Result
Sets
Govern, Manage, Algorithm Assist…
Monitor, Feedback
Pauselss GC
New SIMD instructions 32 TB MemoryPervasive Encryption
23
IBM Open Data Analytics for z/OS
IBM Machine Learning for z/OS
Optimized Data Integration Layer
© 2017 IBM Corporation
IBM Open Data Analytics for z/OS: Offering Overview
What is in the Offering?
IBM Open Data Analytics for z/OS (IBM
product):• Apache Spark 2.1.1 enabled for z/OS
• Python 3.6.1
• All Pre-requisite libraries
• Select Anaconda Libraries (approx. 250 including
pandas, dask, numpy, scikit-learn, matplotlib…)
• Optimized Data Integration Layer: optimized for
Spark & Python db access to z/OS data
• Integration with WLM z/OS for resource
management aligned with job priority
• Integration with security (SAF) interfaces
• Support & Service available from IBM for a fee
–Very aggressive pricing for zIIPs (cores) and memory for
Open Data Analytics z/OS workload
Ecosystem
–GitHub zos-spark repository• Jupyter Notebooks (Scala, Python Workbenches)
• Kernel gateway, Jupyter client, kernel toree
• Sample data & code snippets
–Rocket: • Collaboration for Optimized Data Layer
• Industry vertical mappings, e.g. ISO8583-1, ACH,
SMF, etc.
–Continuum:
• Access to z/OS channel on Anaconda cloud for
updates / refreshes & Package management
• Option to license private mirrored environment
• Services & Consulting for Python
© 2017 IBM Corporation
Value: Increase Integration ���� through Persisting Analytic Results for Enterprise Collaboration
VSAM
z/OS
DF Store:• Specific
Spark &
Python
Result
Sets
• Backed by
VSAM
• Leverage
z/OS SAF,
Dataset
mgmtOptimized Data Layer
Apache Spark Core
Spark
Stream
DF DF
MLib GraphxSpark
SQL
Python 3.6.1Core Packages:• numpy• scikit-learn• dask• pandas• Matplotlib• Etc.
IMSDB2 z/OS
HDFS
JDBC / ODBC / REST, noSQLClient accessing Spark RDDs, example: Cognos , Tableau, …
IBM Open Data Analytics for z/OS
© 2017 IBM Corporation