Open Source Ingredients for Interactive Data Analysis in Spark

18
Open Source Ingredients for Interactive Data Analysis in Spark Hadoop Summit San Jose – June 2016 Maxim Lukiyanov, Program Manager Big Data, Microsoft @maxiluk

Transcript of Open Source Ingredients for Interactive Data Analysis in Spark

Page 1: Open Source Ingredients for Interactive Data Analysis in Spark

Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose – June 2016Maxim Lukiyanov, Program ManagerBig Data, Microsoft@maxiluk

Page 2: Open Source Ingredients for Interactive Data Analysis in Spark

AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud

Resource management

Page 3: Open Source Ingredients for Interactive Data Analysis in Spark

What is your top concern for big data projects?

Page 4: Open Source Ingredients for Interactive Data Analysis in Spark

Length ofDevelopmentCycle

#1

Page 5: Open Source Ingredients for Interactive Data Analysis in Spark

Length of Development CycleUniversal metric to track and improveAffects productivity and project risk

Page 6: Open Source Ingredients for Interactive Data Analysis in Spark

Development phasesData exploration and experimentation

Data sharingDevelopment of production code

Debugging

Page 7: Open Source Ingredients for Interactive Data Analysis in Spark

Interactive Spark on AzureYARN

Spark Application

Spark Application

Spark Application

Spark Application

Command line

Livy server

Thrift server

Jupyter notebooks

REST

SSH

ODBC

Default Queue

Thrift Queue

Local HDFS

Blob Storage

Data Lake Store

IntelliJ IDEA

BI Tools

Page 8: Open Source Ingredients for Interactive Data Analysis in Spark

Ingredients

Page 9: Open Source Ingredients for Interactive Data Analysis in Spark

Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling

Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams

Great momentumActive and large communitySupported by all major big data vendorsFast release cadence

Page 10: Open Source Ingredients for Interactive Data Analysis in Spark

Evolution of big data

Data Sources

Page 11: Open Source Ingredients for Interactive Data Analysis in Spark

Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC

Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Page 12: Open Source Ingredients for Interactive Data Analysis in Spark

Resource management

Page 13: Open Source Ingredients for Interactive Data Analysis in Spark

Interactive Spark on AzureYARN

Spark Application

Spark Application

Spark Application

Spark Application

Command line

Livy server

Thrift server

Jupyter notebooks

REST

SSH

ODBC

Default Queue

Thrift Queue

Local HDFS

Blob Storage

Data Lake Store

IntelliJ IDEA

BI Tools

Page 14: Open Source Ingredients for Interactive Data Analysis in Spark

Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back

Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly

Page 15: Open Source Ingredients for Interactive Data Analysis in Spark

Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic

LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy

Page 16: Open Source Ingredients for Interactive Data Analysis in Spark

Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud

TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud

Page 17: Open Source Ingredients for Interactive Data Analysis in Spark

ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic

Livy job serverhttps://github.com/cloudera/livy

IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/

Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/

Page 18: Open Source Ingredients for Interactive Data Analysis in Spark

© Microsoft Corporation. All rights reserved.