Open Source Ingredients for Interactive Data Analysis in Spark
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
620 -
download
0
Transcript of Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose – June 2016Maxim Lukiyanov, Program ManagerBig Data, Microsoft@maxiluk
AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud
Resource management
What is your top concern for big data projects?
Length ofDevelopmentCycle
#1
Length of Development CycleUniversal metric to track and improveAffects productivity and project risk
Development phasesData exploration and experimentation
Data sharingDevelopment of production code
Debugging
Interactive Spark on AzureYARN
Spark Application
Spark Application
Spark Application
Spark Application
Command line
Livy server
Thrift server
Jupyter notebooks
REST
SSH
ODBC
Default Queue
Thrift Queue
Local HDFS
Blob Storage
Data Lake Store
IntelliJ IDEA
BI Tools
Ingredients
Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling
Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams
Great momentumActive and large communitySupported by all major big data vendorsFast release cadence
Evolution of big data
Data Sources
Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC
Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc
Resource management
Interactive Spark on AzureYARN
Spark Application
Spark Application
Spark Application
Spark Application
Command line
Livy server
Thrift server
Jupyter notebooks
REST
SSH
ODBC
Default Queue
Thrift Queue
Local HDFS
Blob Storage
Data Lake Store
IntelliJ IDEA
BI Tools
Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back
Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly
Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic
LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy
Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud
TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud
ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic
Livy job serverhttps://github.com/cloudera/livy
IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/
Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/
© Microsoft Corporation. All rights reserved.