Scaling Analytics with Apache Spark

download Scaling Analytics with Apache Spark

of 43

Embed Size (px)

Transcript of Scaling Analytics with Apache Spark

  • Location:

    QuantUniversity Meetup

    August 8th 2016

    Boston MA

    Scaling Analytics with Apache Spark

    2016 Copyright QuantUniversity LLC.

    Presented By:

    Sri Krishnamurthy, CFA, CAP

  • 2

    Slides and Code will be available at:

  • - Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

  • Founder of QuantUniversity LLC. and

    Advisory and Consultancy for Financial Analytics Prior Experience at MathWorks, Citigroup and

    Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.)

    Regular Columnist for the Wilmott Magazine Author of forthcoming book

    Financial Modeling: A case study approachpublished by Wiley

    Charted Financial Analyst and Certified Analytics Professional

    Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

    Sri KrishnamurthyFounder and CEO


  • 5

    Quantitative Analytics and Big Data Analytics Onboarding

    Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

    Launching the Analytics Certificate Program in September

  • (MATLAB version also available)

  • 7

    Quantitative Analytics and Big Data Analytics Onboarding

    Apply at:

    Program starting September 18th

    Module 1: Sep 18th , 25th , Oct 2nd, 9th

    Module 2: Oct 16th , 23th , 30th, Nov 6th

    Module 3: Nov 13th, 20th, Dec 4th, Dec 11th

    Capstone + Certification Ceremony Dec 18th

  • 8

    August 14-20th : ARPM in New York

    QuantUniversity presenting on Model Risk on August 14th

    18-21st : Big-data Bootcamp

    September 1st : QuantUniversity Meetup (AnalyticsCertificate program open house)

    11th, 12th : Spark Workshop, Boston

    19th, 20th : Anomaly Detection Workshop, New York

    Events of Interest

  • 9

  • Agenda

    1. A quick introduction to Apache Spark

    2. A sample Spark Program

    3. Clustering using Apache Spark

    4. Regression using Apache Spark

    5. Simulation using Apache Spark

  • Apache Spark : Soaring in Popularity

    Ref: Wall street Journal

  • What is Spark ?

    Apache Spark is a fast and general engine for large-scale data processing.

    Came out of U.C. Berkeleys AMP Lab

    Lightning-fast cluster computing

  • Why Spark ?


    Run programs up to 100x faster than Hadoop MapReduce

    in memory, or 10x faster on disk.

    Spark has an advanced DAG execution engine that

    supports cyclic data flow and in-memory computing.

  • Why Spark ?

    text_file = spark.textFile("hdfs://...")

    text_file.flatMap(lambda line: line.split())

    .map(lambda word: (word, 1))

    .reduceByKey(lambda a, b: a+b)

    Word count in Spark's Python API

    Ease of Use

    Write applications quickly in Java, Scala or


    Spark offers over 80 high-level operators that

    make it easy to build parallel apps. And you can

    use it interactively from the Scala and Python


    R support recently added

  • Why Spark ?

    Generality Combine SQL, streaming, and

    complex analytics. Spark powers a stack of high-level

    tools including:1. Spark Streaming: processing real-time

    data streams2. Spark SQL and DataFrames: support

    for structured data and relational queries

    3. MLlib: built-in machine learning library4. GraphX: Sparks new API for graph


  • Why Spark?

    Runs Everywhere Spark runs on Hadoop, Mesos,

    standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

    You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos.

    Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

  • Key Features of Spark

    Handles batch, interactive, and real-time within a single framework

    Native integration with Java, Python, Scala, R

    Programming at a higher level of abstraction

    More general: map/reduce is just one set of supported constructs

  • Secret Sauce : RDD, Transformation, Action

  • How does it work?

    Resilient Distributed Datasets (RDD) are the primary abstraction in Spark a fault-tolerant collection of elements that can be operated on in parallel.

    Transformations create a new dataset from an existing one. All transformations in Spark are lazy: they do not compute their results right away instead they remember the transformations applied to some base dataset.

    Actions return a value to the driver program after running a computation on the dataset.

  • How is Spark different?

    Map Reduce : Hadoop

  • Problems with this MR model

    Difficult to code

  • Getting started

  • Quick Demo


  • Machine learning with Spark

  • Machine learning with Spark

  • 26

    Machine learning with Spark

  • Use case 1 : Segmenting stocks

    If we have a basket of stocks and their price history, how do we segment them into different clusters?

    What metrics could we use to measure similarity?

    Can we evaluate the effect of changing the number of clusters ?

    Do the results seem actionable?

  • K-means

    Given a set of observations (x1, x2, , xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k ( n) sets S = {S1, S2, , Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:

    where i is the mean of points in Si.

  • Demo

    Kmeans spark case.ipynb

    http://localhost:8888/notebooks/K-means/Kmeans spark case.ipynb

  • Use-case 2 Regression

    Given historical weekly interest data of AAA bond yields, 10 year treasuries, 30 year treasuries and Federal fund rates, build a regression model that fits

    Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)

  • Linear regression Linear regression investigates the linear relationships between variables and

    predict one variable based on one or more other variables and it can beformulated as:

    = 0 +


    where Y and are random variables, is regression coefficient and 0 is aconstant.

    In this model, ordinary least squares estimator is usually used to minimize thedifference between the dependent variable and independent variables.


  • Ordinary Least Squares Regression

  • Demo


  • Scaling Monte-Carlo simulations

  • Example:

    Portfolio Growth

    Given: INVESTMENT_INIT = 100000 # starting amount

    INVESTMENT_ANN = 10000 # yearly new investment

    TERM = 30 # number of years

    MKT_AVG_RETURN = 0.11 # percentage

    MKT_STD_DEV = 0.18 # standard deviation

    Run 10000 monte-carlo simulation paths and compute the expected value of the portfolio at the end of 30 years


  • 36

    The count-distinct problem is the problem of finding the number of distinct elements in a data stream with repeated elements.

    HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset

    Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality.


    Ref: https://en.wikip