Scaling Analytics with Apache Spark

Location:

QuantUniversity Meetup

August 8th 2016

Boston MA

Scaling Analytics with Apache Spark

2016 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

www.QuantUniversity.com

[email protected]

2

Slides and Code will be available at: http://www.analyticscertificate.com/SparkWorkshop/

http://www.analyticscertificate.com/SparkWorkshop/

- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.)

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

4

http://www.analyticscertificate.com/

5

Quantitative Analytics and Big Data Analytics Onboarding

• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching the Analytics Certificate Program in September

(MATLAB version also available)

7

Quantitative Analytics and Big Data Analytics Onboarding

• Apply at: www.analyticscertificate.com

• Program starting September 18th

• Module 1:▫ Sep 18th , 25th , Oct 2nd, 9th

• Module 2:▫ Oct 16th , 23th , 30th, Nov 6th

• Module 3:▫ Nov 13th, 20th, Dec 4th, Dec 11th

• Capstone + Certification Ceremony▫ Dec 18th

http://www.analyticscertificate.com/

8

• August▫ 14-20th : ARPM in New York www.arpm.co

QuantUniversity presenting on Model Risk on August 14th

▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-data-bootcamp/event.html

• September▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house)

▫ 11th, 12th : Spark Workshop, Boston

▫ 19th, 20th : Anomaly Detection Workshop, New York

Events of Interest

http://www.arpm.co/

http://globalbigdataconference.com/68/boston/big-data-bootcamp/event.html

Agenda

1. A quick introduction to Apache Spark

2. A sample Spark Program

3. Clustering using Apache Spark

4. Regression using Apache Spark

5. Simulation using Apache Spark

Apache Spark : Soaring in Popularity

Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008

http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008

What is Spark ?

• Apache Spark™ is a fast and general engine for large-scale data processing.

• Came out of U.C. Berkeley’s AMP Lab

Lightning-fast cluster computing

https://spark.apache.org/



Why Spark ?

Speed

Run programs up to 100x faster than Hadoop MapReduce

in memory, or 10x faster on disk.

Spark has an advanced DAG execution engine that

supports cyclic data flow and in-memory computing.

Why Spark ?

• text_file = spark.textFile("hdfs://...")

text_file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a+b)

• Word count in Spark's Python API

Ease of Use

• Write applications quickly in Java, Scala or

Python,R.

• Spark offers over 80 high-level operators that

make it easy to build parallel apps. And you can

use it interactively from the Scala and Python

shells.

• R support recently added

Why Spark ?

• Generality• Combine SQL, streaming, and

complex analytics.• Spark powers a stack of high-level

tools including:1. Spark Streaming: processing real-time

data streams2. Spark SQL and DataFrames: support

for structured data and relational queries

3. MLlib: built-in machine learning library4. GraphX: Spark’s new API for graph

processing

https://spark.apache.org/docs/latest/streaming-programming-guide.html

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/mllib-guide.html

https://spark.apache.org/docs/latest/graphx-programming-guide.html

Why Spark?

• Runs Everywhere• Spark runs on Hadoop, Mesos,

standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

• You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos.

• Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

https://spark.apache.org/docs/latest/spark-standalone.html

https://spark.apache.org/docs/latest/ec2-scripts.html

http://mesos.apache.org/

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

http://cassandra.apache.org/

http://hbase.apache.org/

http://hive.apache.org/

http://tachyon-project.org/

Key Features of Spark

• Handles batch, interactive, and real-time within a single framework

• Native integration with Java, Python, Scala, R

• Programming at a higher level of abstraction

• More general: map/reduce is just one set of supported constructs

Secret Sauce : RDD, Transformation, Action

How does it work?

• Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel.

• Transformations create a new dataset from an existing one. All transformations in Spark are lazy: they do not compute their results right away – instead they remember the transformations applied to some base dataset.

• Actions return a value to the driver program after running a computation on the dataset.

How is Spark different?

• Map – Reduce : Hadoop

Problems with this MR model

• Difficult to code

Getting started

• http://spark.apache.org/docs/latest/index.html

• http://datascience.ibm.com/

• https://community.cloud.databricks.com

http://spark.apache.org/docs/latest/index.html

http://datascience.ibm.com/

https://community.cloud.databricks.com/

Quick Demo

• Test_Notebook.ipyb

Machine learning with Spark

26

Machine learning with Spark

Use case 1 : Segmenting stocks

• If we have a basket of stocks and their price history, how do we segment them into different clusters?

• What metrics could we use to measure similarity?

• Can we evaluate the effect of changing the number of clusters ?

• Do the results seem actionable?

K-means

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:

where μi is the mean of points in Si.

http://shabal.in/visuals/kmeans/2.html

http://shabal.in/visuals/kmeans/2.html

Demo

• Kmeans spark case.ipynb

http://localhost:8888/notebooks/K-means/Kmeans spark case.ipynb

Use-case 2 – Regression

• Given historical weekly interest data of AAA bond yields, 10 year treasuries, 30 year treasuries and Federal fund rates, build a regression model that fits

• Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)

Linear regression• Linear regression investigates the linear relationships between variables and

predict one variable based on one or more other variables and it can beformulated as:

𝑌 = 𝛽0 +

𝑖=1

𝑝

𝛽𝑖𝑋𝑖

where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is aconstant.

• In this model, ordinary least squares estimator is usually used to minimize thedifference between the dependent variable and independent variables.

31

Ordinary Least Squares Regression

Demo

• Regression.ipyb

Scaling Monte-Carlo simulations

Example:

• Portfolio Growth

• Given:▫ INVESTMENT_INIT = 100000 # starting amount

▫ INVESTMENT_ANN = 10000 # yearly new investment

▫ TERM = 30 # number of years

▫ MKT_AVG_RETURN = 0.11 # percentage

▫ MKT_STD_DEV = 0.18 # standard deviation

▫ Run 10000 monte-carlo simulation paths and compute the expected value of the portfolio at the end of 30 years

Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark

https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark

36

• The count-distinct problem is the problem of finding the number of distinct elements in a data stream with repeated elements.

• HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset

• Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality.

Hyperloglog

Ref: https://en.wikipedia.org/wiki/HyperLogLog

https://en.wikipedia.org/wiki/Count-distinct_problem

https://en.wikipedia.org/wiki/HyperLogLog

37

Hyperloglog

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n

Ref: https://en.wikipedia.org/wiki/HyperLogLog

https://en.wikipedia.org/wiki/HyperLogLog

38

• Approximate algorithms▫ approxCountDistinct: returns an estimate of the number of distinct

elements▫ approxQuantile: returns approximate percentiles of numerical data

Refer:https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/4196864626084292/3601578643761083/latest.html

Demo from Databricks’s blog

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/4196864626084292/3601578643761083/latest.html

39

• As per Databricks’s blog:“Spark strives at implementing approximate algorithms that are deterministic (they do not depend on random numbers to work) and that have proven theoretical error bounds: for each algorithm, the user can specify a target error bound, and the result is guaranteed to be within this bound, either exactly (deterministic error bounds) or with very high confidence (probabilistic error bounds)”

Spark’s implementation

41

www.analyticscertificate.com/SparkWorkshop

http://www.analyticscertificate.com/SparkWorkshop

42

Q&A

Slides, code and details about the Apache Spark Workshopat: http://www.analyticscertificate.com/SparkWorkshop/

http://www.analyticscertificate.com/Anomaly/

Thank you!Members & Sponsors!

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.com

Contact

Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not bedistributed or used in any other publication without the prior written consent of QuantUniversity LLC.

43

https://www.linkedin.com/profile/view?id=6656253&authType=name&authToken=DaWh&pvs=pp

http://www.modelriskanalytics.com/

Scaling Analytics with Apache Spark

Data & Analytics

Transcript of Scaling Analytics with Apache Spark