Open Source Ingredients for Interactive Data Analysis in Spark

Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose – June 2016Maxim Lukiyanov, Program ManagerBig Data, Microsoft@maxiluk

AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud

Resource management

What is your top concern for big data projects?

Length ofDevelopmentCycle

#1

Length of Development CycleUniversal metric to track and improveAffects productivity and project risk

Development phasesData exploration and experimentation

Data sharingDevelopment of production code

Debugging

Interactive Spark on AzureYARN

Spark Application

Spark Application

Spark Application

Spark Application

Command line

Livy server

Thrift server

Jupyter notebooks

REST

SSH

ODBC

Default Queue

Thrift Queue

Local HDFS

Blob Storage

Data Lake Store

IntelliJ IDEA

BI Tools

Ingredients

Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling

Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams

Great momentumActive and large communitySupported by all major big data vendorsFast release cadence

Evolution of big data

Data Sources

Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC

Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Resource management

Interactive Spark on AzureYARN

Spark Application

Spark Application

Spark Application

Spark Application

Command line

Livy server

Thrift server

Jupyter notebooks

REST

SSH

ODBC

Default Queue

Thrift Queue

Local HDFS

Blob Storage

Data Lake Store

IntelliJ IDEA

BI Tools

Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back

Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly

Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic

LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy

Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud

TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud

ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic

Livy job serverhttps://github.com/cloudera/livy

IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/

Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/

Open Source Ingredients for Interactive Data Analysis in Spark

Technology

Transcript of Open Source Ingredients for Interactive Data Analysis in Spark