Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context...

26
1 © Cloudera, Inc. All rights reserved. Analytics with Spark Chen, Jianzhong Cloudera

Transcript of Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context...

Page 1: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

1© Cloudera, Inc. All rights reserved.

Analytics with SparkChen, Jianzhong

Cloudera

Page 2: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

2© Cloudera, Inc. All rights reserved.

Agenda

• What Cloudera does for Spark Ecosystem

• Advanced Analytics with Spark

Page 3: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

3© Cloudera, Inc. All rights reserved.

Spark Engineering in Cloudera

• Cloudera embraced Spark in early 2014

• Engineering with Intel to broaden Spark ecosystem

• Hive-on-Spark

• Pig-on-Spark

• Spark-over-YARN

• Spark Streaming Reliability

• General Spark Optimization

Page 4: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

4© Cloudera, Inc. All rights reserved.

Hive on Spark

• Technology

• Hive: “standard” SQL tool in Hadoop

• Spark: next-gen distributed processing framework

• Hive + Spark

• Performance

• Minimum feature gap

• Industry

• A lot of customers heavily invest in Hive

• Want to leverage the Spark engine

Page 5: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

5© Cloudera, Inc. All rights reserved.

Design Principles

• No or limited impact on Hive’s existing code path

• Maximize code reuse

• Minimum feature customization

• Low future maintenance cost

Page 6: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

6© Cloudera, Inc. All rights reserved.

Class Hierarchy

TaskCompiler

MapRedCompiler TezCompiler

Task Work

MapRedTask TezTask TezWorkMapRedWork

SparkCompiler SparkTask SparkWork

generates described by

Page 7: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

7© Cloudera, Inc. All rights reserved.

Work – Metadata for Task

• MapReduceWork contains one MapWork and a possible ReduceWork

• SparkWork contains a graph of MapWorks and ReduceWorks

MapWork1

ReduceWork1

MapWork2

ReduceWork2

MapWork1

ReduceWork1

ReduceWork2

Query: select name, sum(value) as v from dec

group by name order by v;

Spark Job

MR Job 2

MR Job 1

Page 8: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

8© Cloudera, Inc. All rights reserved.

Spark Client and Spark Context

• Spark Client

• Talking to Spark cluster

• Support local, local-cluster, standalone, yarn-cluster, yarn-client

• Job submission, monitoring, error reporting, statistics, metrics, counters

• Spark Context

• Core of Spark client

• Heavy-weighted, thread-unsafe

• Designed for a single-user application

• Doesn’t work in multi-session environment

Page 9: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

9© Cloudera, Inc. All rights reserved.

Remote Spark Context

• Being created and living outside HiveServer2

• In yarn-cluster mode, Spark context lives in application master (AM)

• Otherwise, Spark context lives in a separate process (other than HS2)

HiveServer 2

Session 1

Session 2

YARN Cluster

AM (RSC)

AM (RSC)

Node 3

Node 2

Node 1

User 2

User 1

Page 10: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

10© Cloudera, Inc. All rights reserved.

Data Processing via Spark

• Treat Table as HadoopRDD (input RDD)

• Apply the function that wraps MR’s map-side processing

• Shuffle map output using Spark’s transformations (groupByKey, sortByKey, etc)

• Apply the function that wraps MR’s reduce-side processing

Page 11: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

11© Cloudera, Inc. All rights reserved.

Spark Plan

• MapInput – encapsulate a table

• MapTran – map-side processing

• ShuffleTran – shuffling

• ReduceTran – reduce-side processing

Query: Select name, sum(value) as v from dec group by name order by v;

Page 12: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

12© Cloudera, Inc. All rights reserved.

Current Status

• All functionality in Hive is implemented

• First round of optimization is completed

• Map join, SMB

• Split generation and grouping

• CBO, vectorization

• More optimization and benchmarking coming

• Beta in CDH

• http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/

• http://www.cloudera.com/content/cloudera/en/documentation/hive-spark/latest/PDF/hive-spark-get-started.pdf

Page 13: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

13© Cloudera, Inc. All rights reserved.

Advanced Analytics with Spark

• Written by Cloudera data science team

• First ever book bridging ML with Hadoopecosystem

• Focusing on use cases and examples rather than amanual

• Target for data scientist solving real word analysisproblems

• Generally available in May 2015

Page 14: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

14© Cloudera, Inc. All rights reserved.

Analyzing Big Data

• Building a model to detect credit card fraud using thousands of features andbillions of transactions

• Intelligently recommend millions of products to millions of users

• Estimate financial risk through simulations of portfolios including millions ofinstruments

• Easily manipulate data from thousands of human genomes to detect geneticassociations with disease

Page 15: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

15© Cloudera, Inc. All rights reserved.

Challenges of Data Science

• Data preprocessing

• Various fast data from multiple source requires powerful data pipeline

• Iteration

• Fundamental part of data science

• Accelerating disk data loading is much helpful

• From lab to production

• Make data useful to non-data scientists

• Models become part of the production service and may need to be rebuiltperiodically or even in real time.

Page 16: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

16© Cloudera, Inc. All rights reserved.

Value at Risk

• VaR(风险价值或者风险收益)

•指在一定的持有期和给定的置信水平下,利率、汇率等市场风险要素发生变化时可能对某项资金头寸、资产组合或机构造成的潜在最大损失。

• 例如,在持有期为1天、置信水平为99%的情况下,若所计算的风险价值为10万人民币,则表明该银行的资产组合在1天中的损失有99%的可能不会超过10万人民币。

• Introduced by Harry Markowitz in 1952, Nobel Prize in Economics in 1990

Page 17: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

17© Cloudera, Inc. All rights reserved.

Illustration for VaR

Page 18: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

18© Cloudera, Inc. All rights reserved.

Methods for Calculating VaR

• Variance-Covariance

• Historical Simulation

• Monte Carlo Simulation

Page 19: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

19© Cloudera, Inc. All rights reserved.

Estimating through Monte Carlo Simulation

Normalize Input

Modeling

instrument

BondsOil SNP SNP

Monte CarloSimulation

Sampling

market factors

Page 20: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

20© Cloudera, Inc. All rights reserved.

Monte Carlo Simulation with Spark

• Normalize data

• Fill the missing value

• Transform the historical data to two-weeks’ return

Page 21: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

21© Cloudera, Inc. All rights reserved.

Modeling

• Define factor features

• Use regression model to compute factor weights

Page 22: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

22© Cloudera, Inc. All rights reserved.

Sampling

• Take the correlation information between the factors into account

• If S&P is down, the Dow is likely to be down as well

Page 23: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

23© Cloudera, Inc. All rights reserved.

Simulation

• Broadcast instruments to each node

• Parallelize trial computation across workers

• Compute the trial returns

Page 24: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

24© Cloudera, Inc. All rights reserved.

Evaluate the risk value

• VaR

• Return the cutOff value

• CVaR (Conditional Value at Risk)

• The average the loss

Page 25: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

25© Cloudera, Inc. All rights reserved.

Q&A

Page 26: Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

26© Cloudera, Inc. All rights reserved.

Thank [email protected]