Spark SQL versus Apache Drill: Different Tools with Different Rules

download Spark SQL versus Apache Drill: Different Tools with Different Rules

If you can't read please download the document

  • date post

    09-Jan-2017
  • Category

    Technology

  • view

    1.075
  • download

    3

Embed Size (px)

Transcript of Spark SQL versus Apache Drill: Different Tools with Different Rules

PowerPoint Presentation

Spark SQL versus Apache Drill: Different Tools with Different Rules

# 2014 MapR Technologies

2014 MapR Technologies#Contact Information

Ted DunningChief Applications Architect at MapR Technologies Committer & PMC for Apaches Drill, Zookeeper & othersVP of Incubator at Apache FoundationEmail tdunning@apache.orgtdunning@maprtech.com

Twitter @ted_dunning

2014 MapR Technologies#

Spark SQL versus Apache Drill: Different Tools with Different RulesThe growing adoption of Spark SQL and Apache Drill presents an interesting conundrum- who should use which tool for what and when? Can they potentially be used together? This session will compare and contrast these technologies across a variety of parameters architectures, ease of use, data types and data sources supported. We will demo a rich set of queries on some very interesting datasets to bring the representative use cases to life for both Spark SQL and Apache Drill.

2014 MapR Technologies#What is DrillDedicated distributed SQL engineUses Calcite as parser, optimizerSpecialized distributed execution engineExecution does massive in-lining of UDFs, result is performanceAdvanced treatment of dynamic schemaSpecial zero-copy memory format: ValueVectorsSpawned Apache ArrowExtensive work on security, impersonationViews allow management of complex formatsSupports delgation of access rights via views

2014 MapR Technologies#What is Drill?

2014 MapR Technologies#A Query engine that hasColumnar/Vectorized Optimistic/pipelinedRuntime compilationLate binding Extensible

2014 MapR Technologies#Table Can Be an Entire Directory Tree// On a fileselect errorLevel, count(*)from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel;

// On the entire data collection: all years, all monthsselect errorLevel, count(*)from dfs.logs.`/AppServerLogs`group by errorLevel;

2014 MapR Technologies#Basic ProcessZookeeper

DFS/HBaseDFS/HBaseDFS/HBaseDrillbitDistributed CacheDrillbitDistributed CacheDrillbitDistributed CacheQuery1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)2. Drillbit generates execution plan based on query optimization & locality3. Fragments are farmed to individual nodes4. Result is returned to driving nodeccc

2014 MapR Technologies#Stages of Query PlanningParser Logical Planner Physical PlannerQuery Foreman

Plan fragments sent to drill bitsSQLQueryHeuristic and cost basedCost based

2014 MapR Technologies#Query ExecutionSQL ParserOptimizerSchedulerPig ParserPhysical PlanMongo CassandraHiveQL ParserRPC EndpointDistributed CacheStorage InterfaceOperatorsOperatorsForemanLogical PlanHDFSHBaseJDBC Endpoint

ODBC Endpoint

2014 MapR Technologies#Batches of ValuesValue vectorsList of values, with same schemaWith the 4-value semantics for each valueShipped around in batches max 256k bytes in a batch max 64K rows in a batchRPC designed for multiple replies to a request

2014 MapR Technologies#Fixed Value Vectors

2014 MapR Technologies#VectorizationDrill operates on more than one record at a timeWord-sized manipulationsSIMD instructionsGCC, LLVM and JVM all do various optimizations automaticallyManually code algorithmsLogical VectorizationBitmaps allow lightning fast null-checksAvoid branching to speed CPU pipeline

2014 MapR Technologies#Runtime Compilation is FasterJIT is smart, but more gains with runtime compilationJanino: Java-based Java compiler

From http://bit.ly/16Xk32x

2014 MapR Technologies#Drill compilerLoaded classMerge byte-code of the two classesJanino compiles runtimebyte-codeCodeModel generates codePrecompiled byte-code templates

2014 MapR Technologies#OptimisticNo need to checkpointCheckpoint frequentlyApache Drill

2014 MapR Technologies#Optimistic ExecutionRecovery code trivialRunning instances discard the failed querys intermediate statePipelining possibleSend results as soon as batch is large enoughRequires barrier-less decomposition of query

2014 MapR Technologies#PipeliningRecord batches are pipelined between nodes~256kB usuallyUnit of work for DrillOperators works on a batchOperator reconfiguration happens at batch boundariesDrillBitDrillBitDrillBit

2014 MapR Technologies#PipeliningRandom access: sort without copy or restructuringAvoids serialization/deserializationOff-heap (no GC woes when lots of memory)Read/write to diskwhen data larger than memory

Drill BitMemory overflow uses disk

Disk

2014 MapR Technologies#Cost-based OptimizationUsing Optiq, an extensible frameworkPluggable rules, and cost model Rules for distributed plan generationInsert Exchange operator into physical planOptiq enhanced to explore parallel query plansPluggable cost modelCPU, IO, memory, network cost (data locality)Storage engine features (HDFS vs HIVE vs HBase)

Query Optimizer

Pluggablerules

Pluggablecost model

2014 MapR Technologies#What is SparkSQL?

2014 MapR Technologies#What is Spark SQLEssentially syntactic sugar over a limited subset of SparkInherits all the virtues (and vices) of SparkLambdas can serve as UDFs (has subtle issues for performance)Inputs have to be loadedPerhaps lazily, not obvious when load actually happensNot designed as a streaming engine, requires more memorySome JSON support, but not so much for large or variable objects

Embedded in a real language!

2014 MapR Technologies#In More DetailA Spark program consists of a computation graph that consumes and produces so-called resilient data datasetsSparkSQL allows these computations to be defined using SQL (but needs schema definitions on the RDDs)

Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly

2014 MapR Technologies#Many SimilaritiesSQL ParserOptimizerJavaPhysical PlanScalaLogical PlanPython

2014 MapR Technologies#Important DifferencesSpark execution assumes RDDs are complete representation, not a stream of row batches

Input sources dont inject optimization rules, nor expose detailed cost models

Most RDDs dont have a zero-copy capability

Spark inherits JVM memory model, very limited use of off-heap

2014 MapR Technologies#

scala> sqlContext.sql("select * from json.`foo.json`").show+---+------+----+| a| b| c|+---+------+----+| 3|[3, 2]| xyz|| 7| null| wxy|| 7| []|null|+---+------+----+

2014 MapR Technologies#

scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`").show+---+---------+| a| b_v|+---+---------+| 3| 3|| 3| 2|+---+---------+

2014 MapR Technologies#First SynthesisDrill has a more nuanced optimizer, better code generationThis often leads to ~2x speed advantage

Drill has ValueVector and row batchesThis leads to much less memory pressure

Drill has much stricter memory life-cycleQuery and done and gone, no need for big GCs even on big memory

Drill is all about SQL execution

2014 MapR Technologies#But Spark can optimize across entire programThis often leads to ~2x speed advantage

Spark has much more flexible memory structuresThis can lead to much less memory pressure

Spark has much more flexible RDD life-cycleRDDs can be cached, persisted or simply recomputed as necessary

Spark is not all about SQL execution

2014 MapR Technologies#The Really Big DifferencesDrill focuses heavily on secure, multi-tenant access to dataStrong impersonation semanticsCascading rights via viewsQueries co-exist in a cluster and reserve only their momentary resource requirements

Spark focuses heavily on fully integrated execution modelsAny spark function works with (almost) any RDDsMemory residency of RDDs is the highest goal

2014 MapR Technologies#Drill security

End to end security from BI tools to HadoopStandard based PAM Authentication2 level user ImpersonationFine-grained row and column level access control with Drill Views no centralized security repository required

2014 MapR Technologies#

Granular security permissions through Drill viewsNameCityStateCredit Card #DaveSan JoseCA1374-7914-3865-4817JohnBoulderCO1374-9735-1794-9711

Raw File (/raw/cards.csv)

OwnerAdmins

Permission Admins

Business Analyst

Data ScientistNameCityStateCredit Card #DaveSan JoseCA1374-1111-1111-1111JohnBoulderCO1374-1111-1111-1111

Data Scientist View (/views/maskedcards.view.drill)Not a physical data copyNameCityStateDaveSan JoseCAJohnBoulderCO

Business Analyst View

OwnerAdmins

Permission Business AnalystsOwnerAdmins

Permission DataScientists

2014 MapR Technologies#

Ownership ChainingCombine Self Service Exploration with Data GovernanceNameCityStateCredit Card #DaveSan JoseCA1374-7914-3865-4817JohnBoulderCO1374-9735-1794-9711

Raw File (/raw/cards.csv)

NameCityStateCredit Card #DaveSan JoseCA1374-1111-1111-1111JohnBoulderCO1374-1111-1111-1111

Data Scientist (/views/V_Scientist)Jane (Read)John (Owner)NameCityStateDaveSan JoseCAJohnBoulderCO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RAW FILE

V_Scientist

V_AnalystDoes Jack have access to V_Analyst?->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ?-> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)John(Owner)Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no imperson