Open Source Ingredients for Interactive Data Analysis in Spark
-
Author
dataworks-summithadoop-summit -
Category
Technology
-
view
616 -
download
0
Embed Size (px)
Transcript of Open Source Ingredients for Interactive Data Analysis in Spark
Microsoft brand template
Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose June 2016
Maxim Lukiyanov, Program ManagerBig Data, [email protected]
6/28/2016 8:43 AM1 Microsoft Corporation. All rights reserved.
AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure CloudResource management
6/28/2016 8:43 AM2 Microsoft Corporation. All rights reserved.
What is your top concern for big data projects?
Length ofDevelopmentCycle#1
Length of Development CycleUniversal metric to track and improveAffects productivity and project risk
Development phasesData exploration and experimentationData sharingDevelopment of production codeDebugging
Interactive Spark on AzureYARNSpark ApplicationSpark ApplicationSpark ApplicationSpark ApplicationCommand lineLivy serverThrift serverJupyter notebooksRESTSSHODBCDefault QueueThrift Queue
Local HDFS
Blob Storage
Data Lake StoreIntelliJ IDEABI Tools
Ingredients
Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling
Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams
Great momentumActive and large communitySupported by all major big data vendorsFast release cadence
9
Evolution of big data
Data Sources
Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC
Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc
Resource management
6/28/2016 8:43 AM12 Microsoft Corporation. All rights reserved.
Interactive Spark on AzureYARNSpark ApplicationSpark ApplicationSpark ApplicationSpark ApplicationCommand lineLivy serverThrift serverJupyter notebooksRESTSSHODBCDefault QueueThrift Queue
Local HDFS
Blob Storage
Data Lake StoreIntelliJ IDEABI Tools
Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back
Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly
Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic
LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy
Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud
TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud
ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic
Livy job serverhttps://github.com/cloudera/livy
IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/
Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/
Microsoft Corporation. All rights reserved.
6/28/2016 8:43 AM18 Microsoft Corporation. All rights reserved.