Open Source Ingredients for Interactive Data Analysis in Spark

Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose – June 2016Maxim Lukiyanov, Program ManagerBig Data, Microsoft@maxiluk

AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud

Resource management

What is your top concern for big data projects?

Length ofDevelopmentCycle

Length of Development CycleUniversal metric to track and improveAffects productivity and project risk

Development phasesData exploration and experimentation

Data sharingDevelopment of production code

Debugging

Interactive Spark on AzureYARN

Spark Application

Command line

Livy server

Thrift server

Jupyter notebooks

Default Queue

Thrift Queue

Local HDFS

Blob Storage

Data Lake Store

IntelliJ IDEA

BI Tools

Ingredients

Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling

Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams

Great momentumActive and large communitySupported by all major big data vendorsFast release cadence

Evolution of big data

Data Sources

Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC

Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Resource management

Interactive Spark on AzureYARN

Spark Application

Command line

Livy server

Thrift server

Jupyter notebooks

Default Queue

Thrift Queue

Local HDFS

Blob Storage

Data Lake Store

IntelliJ IDEA

BI Tools

Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back

Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly

Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic

LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy

Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud

TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud

ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic

Livy job serverhttps://github.com/cloudera/livy

IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/

Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/

Open Source Ingredients for Interactive Data Analysis in Spark

Technology

Transcript of Open Source Ingredients for Interactive Data Analysis in Spark

dc area spark interactive nov 2015

Sparkly Notebook: Interactive Analysis and Visualization with Spark

Big Data Scala by the Bay: Interactive Spark in your Browser

Interactive Visual Data Exploration with Spark in …lintool.github.io/SparkTutorial/slides/day2_databricks...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein

Spark Summit Europe: Building a REST Job Server for interactive Spark as a service

Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue

Spark: Making Big Data Interactive & Real-Time - SICS · PDF fileMatei Zaharia . UC Berkeley / MIT . . Spark: Making Big Data Interactive & Real-Time

Spark: Interactive To Production

Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark

Interactive Data Analysis in Spark Streaming

Apache Spark Introduction - Cineca...imore Spark handles current computing frameworks’ inefﬁciency (iterative algorithms and interactive data mining tools) ... • If a partition

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Building interactive audience analytics with Spark

Leveraging the Key Ingredients of Successful Service Delivery · Welcome to Leveraging the Key Ingredients of Successful Service Delivery Today’s presentation will feature interactive,

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang wzhg0508@163.com 2012.11.20.

Distributed Computing with Spark - Stanford Universityrezab/slides/maryland_intro.pdf · Distributed Computing with Spark ... training and interactive queries … DFS read DFS parse

Leveraging CAPI Flash for Apache Spark - OpenPOWER Foundation · Generation 2 •Workload: Interactive/Iterative •Resiliency (Spark): through in-memory re-computation •Key parameter:

SciSpark: Highly Interactive & Scalable Model Evaluation ...commons.esipfed.org/.../Wilson_SciSpark_poster_ESIP... · – Installed Mesos, Spark, Cassandra • Software Prototypes

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)