Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark...

of 26 /26
Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki

Embed Size (px)

Transcript of Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark...

  • Interactive Visual Data Exploration with Spark in Databricks Cloud

    Hossein Falaki @mhfalaki

  • About Databricks

    Founded by creators of Apache Spark !

    !

    Offers Spark as a service in the cloud !

    !

    Dedicated to open source Spark > Largest organization contributing to Apache Spark > Drive the roadmap

    !

    2

  • 3

    Databricks Cloud

    Databricks Workspace

    Databricks Platform > Start clusters in seconds > Dynamically scale up & down

    > Notebooks > Dashboards > Job launcher

    > Latest version > Configured / Optimized

  • Fast & General distributed computing engine: batch, streaming, iterative !

    Capable of handling petabytes of data !

    Even faster by caching data in-memory !

    Versatile programming interfaces

    4

    Spark

  • Spark: Mixing SQL with Python/Scala

    5

    // Query an existing table and get results back as Schema RDD rdd = hiveContext.sql(“select article, text from wikipedia”) !

    // Perform transformations words = rdd.flatMap(lambda r: r.text.split()) !

    // Collect sample of data in driver machine sampled_words = words.sample(fraction = 0.001)

  • Databricks Platform

    Start clusters in seconds !

    Zero-cost management !

    Dynamically scale up and down

    6

  • Databricks Workspace

    Notebooks > SQL > Python > Scala

    Dashboards

    Job Launcher

    7

  • Notebooks

    Supports Python, Scala, SQL !

    Interactive commands and plots !

    On-line collaboration

    8

  • Dashboards

    WYSIWYG Builder !

    Interactive jobs !

    On-click publishing !

    Exporting from notebooks

    9

  • Job Launcher

    Runs arbitrary Spark jobs programmatically

    10

  • 11

    Expository vs. Exploratory

  • 12

    Large data

  • 13

    “Visualization is critical to data analysis.”William S. Cleveland

    But we often skip exploratory visualization with large data

  • Challenges

    14

    1. Interactivity

    with large data is challenging

    2. Visual medium

    cannot accommodate as many pixels as data points

  • Solutions

    15

    In-memory computation !

    High parallelism

    1. Interactivity

  • Reducing interaction latency with Spark

    1. In-memory computation

    > Significantly reduces latency

    2. High parallelism > Get more executors with Mesos or Yarn: a challenge in itself > Click a button to increase cluster size in Databricks Cloud

    16

  • Versatile programming interface

    !

    Data visualization is very much like programming. > Point and click doesn’t really cut it > Requires an API (grammar): ggplot, matplotlib, bokeh, etc.

    !

    Spark has SQL, Scala, Python, Java and (experimental) R API !

    Libraries for distributed statistics and machine learning

    17

  • Solutions

    18

    2. Visual medium

    In-memory computation !

    High parallelism

    In-browser collaborative notebooks !

    Summarizing, Sampling and Modeling

    1. Interactivity

  • More data points than pixels

    19

    Can we visualize 200GB of multidimensional data?

    Long answer: > Summarize & visualize

    > Sample & visualize

    > Model & visualize

    Short answer: no

  • Extensively used by BI tools > Aggregation > Pivoting !

    Most data scientists’ nightly jobs summarize data

    Summarize and visualize

    20

  • Sample and visualize

    Sometimes we need to visualize (feel) individual data points

    Sampling is extensively used in statistics

    Spark offers native support for:

    > Approximate and exact sampling

    > Approximate and exact stratified sampling

    !

    Approximate sampling is faster and is good enough in most cases

    21

  • Model and visualize

    MLLib supports a large (and growing) set of distributed algorithms

    > Clustering: k-means

    > Classification and regression: LM, DT, NB

    > Dimensionality reduction: SVD, PCA

    > Collaborative filtering: ALS

    > Correlation, hypothesis testing

    22

  • 23

    Demo

  • SummaryWith new big data tools we can resume interactive visual exploration of data

    !

    Using Spark we can manipulate large data in seconds > Cache data in memory > Increase parallelism

    !

    To visualize millions of data points we can > Summarize > Sample > Models

    24

  • 25

    Databricks Cloud databricks.com

    Apache Spark spark.apache.org

    Matplotlib matplotlib.org

    Python ggplot ggplot.yhathq.com

    D3 d3js.org