Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark...
Embed Size (px)
Transcript of Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark...
-
Interactive Visual Data Exploration with Spark in Databricks Cloud
Hossein Falaki @mhfalaki
-
About Databricks
Founded by creators of Apache Spark !
!
Offers Spark as a service in the cloud !
!
Dedicated to open source Spark > Largest organization contributing to Apache Spark > Drive the roadmap
!
2
-
3
Databricks Cloud
Databricks Workspace
Databricks Platform > Start clusters in seconds > Dynamically scale up & down
> Notebooks > Dashboards > Job launcher
> Latest version > Configured / Optimized
-
Fast & General distributed computing engine: batch, streaming, iterative !
Capable of handling petabytes of data !
Even faster by caching data in-memory !
Versatile programming interfaces
4
Spark
-
Spark: Mixing SQL with Python/Scala
5
// Query an existing table and get results back as Schema RDD rdd = hiveContext.sql(“select article, text from wikipedia”) !
// Perform transformations words = rdd.flatMap(lambda r: r.text.split()) !
// Collect sample of data in driver machine sampled_words = words.sample(fraction = 0.001)
-
Databricks Platform
Start clusters in seconds !
Zero-cost management !
Dynamically scale up and down
6
-
Databricks Workspace
Notebooks > SQL > Python > Scala
Dashboards
Job Launcher
7
-
Notebooks
Supports Python, Scala, SQL !
Interactive commands and plots !
On-line collaboration
8
-
Dashboards
WYSIWYG Builder !
Interactive jobs !
On-click publishing !
Exporting from notebooks
9
-
Job Launcher
Runs arbitrary Spark jobs programmatically
10
-
11
Expository vs. Exploratory
-
12
Large data
-
13
“Visualization is critical to data analysis.”William S. Cleveland
But we often skip exploratory visualization with large data
-
Challenges
14
1. Interactivity
with large data is challenging
2. Visual medium
cannot accommodate as many pixels as data points
-
Solutions
15
In-memory computation !
High parallelism
1. Interactivity
-
Reducing interaction latency with Spark
1. In-memory computation
> Significantly reduces latency
2. High parallelism > Get more executors with Mesos or Yarn: a challenge in itself > Click a button to increase cluster size in Databricks Cloud
16
-
Versatile programming interface
!
Data visualization is very much like programming. > Point and click doesn’t really cut it > Requires an API (grammar): ggplot, matplotlib, bokeh, etc.
!
Spark has SQL, Scala, Python, Java and (experimental) R API !
Libraries for distributed statistics and machine learning
17
-
Solutions
18
2. Visual medium
In-memory computation !
High parallelism
In-browser collaborative notebooks !
Summarizing, Sampling and Modeling
1. Interactivity
-
More data points than pixels
19
Can we visualize 200GB of multidimensional data?
Long answer: > Summarize & visualize
> Sample & visualize
> Model & visualize
Short answer: no
-
Extensively used by BI tools > Aggregation > Pivoting !
Most data scientists’ nightly jobs summarize data
Summarize and visualize
20
-
Sample and visualize
Sometimes we need to visualize (feel) individual data points
Sampling is extensively used in statistics
Spark offers native support for:
> Approximate and exact sampling
> Approximate and exact stratified sampling
!
Approximate sampling is faster and is good enough in most cases
21
-
Model and visualize
MLLib supports a large (and growing) set of distributed algorithms
> Clustering: k-means
> Classification and regression: LM, DT, NB
> Dimensionality reduction: SVD, PCA
> Collaborative filtering: ALS
> Correlation, hypothesis testing
22
-
23
Demo
-
SummaryWith new big data tools we can resume interactive visual exploration of data
!
Using Spark we can manipulate large data in seconds > Cache data in memory > Increase parallelism
!
To visualize millions of data points we can > Summarize > Sample > Models
24
-
25
Databricks Cloud databricks.com
Apache Spark spark.apache.org
Matplotlib matplotlib.org
Python ggplot ggplot.yhathq.com
D3 d3js.org