Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context...

1© Cloudera, Inc. All rights reserved.

Analytics with SparkChen, Jianzhong

Cloudera


Agenda

• What Cloudera does for Spark Ecosystem

• Advanced Analytics with Spark


Spark Engineering in Cloudera

• Cloudera embraced Spark in early 2014

• Engineering with Intel to broaden Spark ecosystem

• Hive-on-Spark

• Pig-on-Spark

• Spark-over-YARN

• Spark Streaming Reliability

• General Spark Optimization


Hive on Spark

• Technology

• Hive: “standard” SQL tool in Hadoop

• Spark: next-gen distributed processing framework

• Hive + Spark

• Performance

• Minimum feature gap

• Industry

• A lot of customers heavily invest in Hive

• Want to leverage the Spark engine


Design Principles

• No or limited impact on Hive’s existing code path

• Maximize code reuse

• Minimum feature customization

• Low future maintenance cost


Class Hierarchy

TaskCompiler

MapRedCompiler TezCompiler

Task Work

MapRedTask TezTask TezWorkMapRedWork

SparkCompiler SparkTask SparkWork

generates described by


Work – Metadata for Task

• MapReduceWork contains one MapWork and a possible ReduceWork

• SparkWork contains a graph of MapWorks and ReduceWorks

MapWork1

ReduceWork1

MapWork2

ReduceWork2

MapWork1

ReduceWork1

ReduceWork2

Query: select name, sum(value) as v from dec

group by name order by v;

Spark Job

MR Job 2

MR Job 1


Spark Client and Spark Context

• Spark Client

• Talking to Spark cluster

• Support local, local-cluster, standalone, yarn-cluster, yarn-client

• Job submission, monitoring, error reporting, statistics, metrics, counters

• Spark Context

• Core of Spark client

• Heavy-weighted, thread-unsafe

• Designed for a single-user application

• Doesn’t work in multi-session environment


Remote Spark Context

• Being created and living outside HiveServer2

• In yarn-cluster mode, Spark context lives in application master (AM)

• Otherwise, Spark context lives in a separate process (other than HS2)

HiveServer 2

Session 1

Session 2

YARN Cluster

AM (RSC)

AM (RSC)

Node 3

Node 2

Node 1

User 2

User 1


Data Processing via Spark

• Treat Table as HadoopRDD (input RDD)

• Apply the function that wraps MR’s map-side processing

• Shuffle map output using Spark’s transformations (groupByKey, sortByKey, etc)

• Apply the function that wraps MR’s reduce-side processing


Spark Plan

• MapInput – encapsulate a table

• MapTran – map-side processing

• ShuffleTran – shuffling

• ReduceTran – reduce-side processing

Query: Select name, sum(value) as v from dec group by name order by v;


Current Status

• All functionality in Hive is implemented

• First round of optimization is completed

• Map join, SMB

• Split generation and grouping

• CBO, vectorization

• More optimization and benchmarking coming

• Beta in CDH

• http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/

• http://www.cloudera.com/content/cloudera/en/documentation/hive-spark/latest/PDF/hive-spark-get-started.pdf

http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/

http://www.cloudera.com/content/cloudera/en/documentation/hive-spark/latest/PDF/hive-spark-get-started.pdf


Advanced Analytics with Spark

• Written by Cloudera data science team

• First ever book bridging ML with Hadoopecosystem

• Focusing on use cases and examples rather than amanual

• Target for data scientist solving real word analysisproblems

• Generally available in May 2015


Analyzing Big Data

• Building a model to detect credit card fraud using thousands of features andbillions of transactions

• Intelligently recommend millions of products to millions of users

• Estimate financial risk through simulations of portfolios including millions ofinstruments

• Easily manipulate data from thousands of human genomes to detect geneticassociations with disease


Challenges of Data Science

• Data preprocessing

• Various fast data from multiple source requires powerful data pipeline

• Iteration

• Fundamental part of data science

• Accelerating disk data loading is much helpful

• From lab to production

• Make data useful to non-data scientists

• Models become part of the production service and may need to be rebuiltperiodically or even in real time.


Value at Risk

• VaR（风险价值或者风险收益）

•指在一定的持有期和给定的置信水平下，利率、汇率等市场风险要素发生变化时可能对某项资金头寸、资产组合或机构造成的潜在最大损失。

• 例如，在持有期为1天、置信水平为99%的情况下，若所计算的风险价值为10万人民币，则表明该银行的资产组合在1天中的损失有99%的可能不会超过10万人民币。

• Introduced by Harry Markowitz in 1952, Nobel Prize in Economics in 1990


Illustration for VaR


Methods for Calculating VaR

• Variance-Covariance

• Historical Simulation

• Monte Carlo Simulation


Estimating through Monte Carlo Simulation

Normalize Input

Modeling

instrument

BondsOil SNP SNP

Monte CarloSimulation

Sampling

market factors


Monte Carlo Simulation with Spark

• Normalize data

• Fill the missing value

• Transform the historical data to two-weeks’ return


Modeling

• Define factor features

• Use regression model to compute factor weights


Sampling

• Take the correlation information between the factors into account

• If S&P is down, the Dow is likely to be down as well


Simulation

• Broadcast instruments to each node

• Parallelize trial computation across workers

• Compute the trial returns


Evaluate the risk value

• VaR

• Return the cutOff value

• CVaR (Conditional Value at Risk)

• The average the loss


Q&A


Thank [email protected]

Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context...

Documents

Transcript of Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context...