Spark SQL Deep Dive @ Melbourne Spark Meetup

download Spark SQL Deep Dive @ Melbourne Spark Meetup

of 57

  • date post

    28-Jul-2015
  • Category

    Software

  • view

    1.797
  • download

    6

Embed Size (px)

Transcript of Spark SQL Deep Dive @ Melbourne Spark Meetup

1. Spark SQL Deep Dive Michael Armbrust Melbourne Spark Meetup June 1st 2015 2. What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: > In-memory computing primitives > General computation graphs Improves usability through: > Rich APIs in Scala, Java, Python > Interactive shell Up to 100 faster (2-10 on disk) 2-5 less code 3. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Parallel functional transformations (map, filter, ) > Automatically rebuilt on failure 4. More than Map & Reduce map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ... 5. 5 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours 6. 6 Spark Hall of Fame LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB LARGEST SHUFFLE MOST INTERESTING APP Tencent (1PB+ /day) Alibaba (1 week on 1PB+ data) Databricks PB Sort (1PB) Jeremy Freeman Mapping the Brain at Scale (with lasers!) LARGEST CLUSTER Tencent (8000+ nodes) Based on Reynold Xins personal knowledge 7. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns vallines=spark.textFile(hdfs://...) valerrors=lines.filter(_startswithERROR) valmessages=errors.map(_.split(t)(2)) messages.cache() lines Block 1 lines Block 2 lines Block 3 Worker Worker Worker Driver messages.filter(_containsfoo).count() messages.filter(_containsbar).count() . . . tasks results messages Cache 1 messages Cache 2 messages Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in Part of the core distribution since Spark 1.0 (April 2014) 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 # of Contributors 15. SELECTCOUNT(*) FROMhiveTable WHEREhive_udf(data) Spark SQL > Part of the core distribution since Spark 1.0 (April 2014) > Runs SQL / HiveQL queries including UDFs UDAFs and SerDes About SQL 16. Spark SQL > Part of the core distribution since Spark 1.0 (April 2014) > Runs SQL / HiveQL queries including UDFs UDAFs and SerDes > Connect existing BI tools to Spark through JDBC About SQL 17. Spark SQL > Part of the core distribution since Spark 1.0 (April 2014) > Runs SQL / HiveQL queries including UDFs UDAFs and SerDes > Connect existing BI tools to Spark through JDBC > Bindings in Python, Scala, and Java About SQL 18. The not-so-secret truth is not about SQL. SQL 19. SQL: The whole story Create and Run Spark Programs Faster: > Write less code > Read less data > Let the optimizer do the hard work 20. DataFrame noun [dey-tuh-freym] 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3. Archaic: Previously SchemaRDD (cf. Spark < 1.3). 21. Write Less Code: Input & Output Spark SQLs Data Source API can read and write DataFrames using a variety of formats. 21 { JSON } Built-In External JDBC and more 22. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df=sqlContext.read .format("json") .option("samplingRatio","0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 23. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df=sqlContext.read .format("json") .option("samplingRatio","0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") read and write functions create new builders for doing I/O 24. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df=sqlContext.read .format("json") .option("samplingRatio","0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") Builder methods are used to specify: Format Partitioning Handling of existing data and more 25. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df=sqlContext.read .format("json") .option("samplingRatio","0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") load(), save() or saveAsTable() functions create new builders for doing I/O 26. ETL Using Custom Data Sources sqlContext.read .format("com.databricks.spark.git") .option("url","https://github.com/apache/spark.git") .option("numPartitions","100") .option("branches","master,branch-1.3,branch-1.2") .load() .repartition(1) .write .format("json") .save("/home/michael/spark.json") 27. Write Less Code: Powerful Operations Common operations can be expressed concisely as calls to the DataFrame API: Selecting required columns Joining different data sources Aggregation (count, sum, average, etc) Filtering 27 28. Write Less Code: Compute an Average privateIntWritableone= newIntWritable(1) privateIntWritableoutput= newIntWritable() proctectedvoidmap( LongWritablekey, Textvalue, Contextcontext){ String[]fields=value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one,output) } IntWritableone=newIntWritable(1) DoubleWritableaverage=newDoubleWritable() protectedvoidreduce( IntWritablekey, Iterablevalues, Contextcontext){ intsum=0 intcount=0 for(IntWritablevalue:values){ sum+=value.get() count++ } average.set(sum/(double)count) context.Write(key,average) } data=sc.textFile(...).split("t") data.map(lambdax:(x[0],[x.[1],1])) .reduceByKey(lambdax,y:[x[0]+y[0],x[1]+y[1]]) .map(lambdax:[x[0],x[1][0]/x[1][1]]) .collect() 29. Write Less Code: Compute an Average Using RDDs data=sc.textFile(...).split("t") data.map(lambdax:(x[0],[int(x[1]),1])) .reduceByKey(lambdax,y:[x[0]+y[0],x[1]+y[1]]) .map(lambdax:[x[0],x[1][0]/x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name",avg("age")) .collect() Using SQL SELECTname,avg(age) FROMpeople GROUPBYname 30. Not Just Less Code: Faster Implementations 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs) 31. 31 Demo: Data Frames Using Spark SQL to read, write, slice and dice your data using a simple functions 32. Read Less Data Spark SQL can help you read less data automatically: Converting to more eicient formats Using columnar formats (i.e. parquet) Using partitioning (i.e., /year=2014/month=02/)1 Skipping data using statistics (i.e., min, max)2 Pushing predicates into storage systems (i.e., JDBC) 33. Optimization happens as late as possible, therefore Spark SQL can optimize across functions. 33 34. 34 defadd_demographics(events): u=sqlCtx.table("users")#LoadHivetable events .join(u,events.user_id==u.user_id)#Joinonuser_id .withColumn("city",zipToCity(df.zip))#udfaddscitycolumn events=add_demographics(sqlCtx.load("/data/events","json")) training_data=events.where(events.city=="PaloAlto") .select(events.timestamp).collect() Logical Plan filter join events file users table expensive only join relevant users Physical Plan join scan (events) filter scan (users) 35. 35 defadd_demographics(events): u=sqlCtx.table("users")#LoadpartitionedHivetable events .join(u,events.user_id==u.user_id)#Joinonuser_id .withColumn("city",zipToCity(df.zip))#Runudftoaddcitycolumn Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events=add_demographics(sqlCtx.load("/data/events","parquet")) training_data=events.where(events.city=="PaloAlto") .select(events.timestamp).collect() Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 36. Machine Learning Pipelines tokenizer=Tokenizer(inputCol="text", outputCol="words) hashingTF=HashingTF(inputCol="words", outputCol="features) lr=LogisticRegression(maxIter=10,regParam=0.01) pipeline=Pipeline(stages=[tokenizer,hashingTF,lr]) df=sqlCtx.load("/path/to/data") model=pipeline.fit(df) df0 df1 df2 df3tokenizer hashingTF lr.model lr Pipeline Model 37. Set Footer from Insert Dropdown Menu 37 So how does it all work? 38. Plan Optimization & Execution Set Footer from Insert Dropdown Menu 38 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Catalog DataFrames and SQL share the same optimization/execution pipeline Code Generation 39. An example query SELECTname FROM( SELECTid,name FROMPeople)p WHEREp.id=1 Project name Project id,name Filter id = 1 People Logical Plan 39 40. Nave Query Planning SELECTname FROM( SELECTid,name FROMPeople)p WHEREp.id=1 Project name Project id,name Filter id = 1 People Logical Plan Project name Project id,name Filter id = 1 TableScan People Physical Plan 40 41. Optimized Execution Writing imperative code to optimize all possible patterns is hard.Project name Project id,name Filter id = 1 People Logical Plan Project name Project id,name Filter id = 1 People IndexLookup id = 1 return: name Logical Plan Physical Plan Instead write simple rules: Each rule makes one change Run many rules together to xed point. 41 42. Prior Work: # Optimizer Gene