Open Source Ingredients for Interactive Data Analysis in Spark

of 18 /18
Open Source Ingredients for Interactive Data Analysis in Spark Hadoop Summit San Jose – June 2016 Maxim Lukiyanov, Program Manager Big Data, Microsoft @maxiluk

Embed Size (px)

Transcript of Open Source Ingredients for Interactive Data Analysis in Spark

Microsoft brand template

Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose June 2016

Maxim Lukiyanov, Program ManagerBig Data, [email protected]

6/28/2016 8:43 AM1 Microsoft Corporation. All rights reserved.

AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure CloudResource management

6/28/2016 8:43 AM2 Microsoft Corporation. All rights reserved.

What is your top concern for big data projects?

Length ofDevelopmentCycle#1

Length of Development CycleUniversal metric to track and improveAffects productivity and project risk

Development phasesData exploration and experimentationData sharingDevelopment of production codeDebugging

Interactive Spark on AzureYARNSpark ApplicationSpark ApplicationSpark ApplicationSpark ApplicationCommand lineLivy serverThrift serverJupyter notebooksRESTSSHODBCDefault QueueThrift Queue

Local HDFS

Blob Storage

Data Lake StoreIntelliJ IDEABI Tools

Ingredients

Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling

Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams

Great momentumActive and large communitySupported by all major big data vendorsFast release cadence

9

Evolution of big data

Data Sources

Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC

Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Resource management

6/28/2016 8:43 AM12 Microsoft Corporation. All rights reserved.

Interactive Spark on AzureYARNSpark ApplicationSpark ApplicationSpark ApplicationSpark ApplicationCommand lineLivy serverThrift serverJupyter notebooksRESTSSHODBCDefault QueueThrift Queue

Local HDFS

Blob Storage

Data Lake StoreIntelliJ IDEABI Tools

Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back

Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly

Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic

LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy

Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud

TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud

ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic

Livy job serverhttps://github.com/cloudera/livy

IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/

Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/

Microsoft Corporation. All rights reserved.

6/28/2016 8:43 AM18 Microsoft Corporation. All rights reserved.