Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...

21
Shark: Hive on Spark Borrowed slide 1 Optional Reading (additional material) Prajakta Kalmegh Duke University

Transcript of Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...

Shark:  Hive  on  Spark

Borrowed slide1

Optional Reading(additional material)

Prajakta KalmeghDuke University

What  is  Shark?Port  of  Apache  Hive  to  run  on  Spark

Compatible  with  existing  Hive  data,  metastores,  and  queries  (HiveQL,  UDFs,  etc)

Similar  speedups  of  up  to  40x

Borrowed slide2

MotivationHive  is  great,  but  Hadoop’s  execution  engine  makes  even  the  smallest  queries  take  minutes

Scala  is  good  for  programmers,  but  many  data  users  only  know  SQL

Can  we  extend  Hive  to  run  on  Spark?

Borrowed slide3

Hive  Architecture

Meta  store

HDFS

Client

Driver

SQL  Parser

Query  Optimizer

Physical  Plan

Execution

CLI JDBC

MapReduce

Borrowed slide4

Shark  Architecture

Meta  store

HDFS

Client

Driver

SQL  Parser

Physical  Plan

Execution

CLI JDBC

Spark

Cache  Mgr.

Query  Optimizer

Borrowed slide5

Shark  Engine:  Extensions  to  Hive• PDE  (Partial  DAG  Executions)

• To  Support  dynamic  query  optimization

• allows  dynamic  alteration  of  query  plans  based  on  data  statistics  collected  at  run-­‐time

• use  PDE  to  optimize  the  global  structure  of  the  plan  at  stage  boundaries

• Skew  Handling  and  Degree  of  Parallelism

• Importance  of  DoP  for  Mappers  vs  Reducers  (too  few  can  overload  reducers)

• Skew  mitigation:  Fine-­‐grained  partitions  are  assigned  to  coalesced  partitions  using  a  greedy  bin-­‐packing  heuristic  

• Distributed  Data  Loading

• Loading  tasks  use  the  data  schema  to  extract  individual  fields  from  rows  

• Marshal  a  partition  of  data  into  its  columnar  representation  

• Store  those  columns  in  memory

6

Shark  Engine:  Extensions  to  Hive

• Join Optimizations

7

Efficient  In-­‐Memory  Storage

Simply  caching  Hive  records  as  Java  objects  is  inefficient  due  to  high  per-­‐object  overhead

Instead,  Shark  employs  column-­‐oriented  storage  using  arrays  of  primitive  types

1

Column  Storage

2 3

john mike sally

4.1 3.5 6.4

Row  Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Borrowed slide8

Efficient  In-­‐Memory  Storage

Simply  caching  Hive  records  as  Java  objects  is  inefficient  due  to  high  per-­‐object  overhead

Instead,  Shark  employs  column-­‐oriented  storage  using  arrays  of  primitive  types

1

Column  Storage

2 3

john mike sally

4.1 3.5 6.4

Row  Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Benefit:  similarly  compact  size  to  serialized  data,but  >5x  faster  to  access

Borrowed slide9

Shark vs Spark SQL

10

Borrowed slide11

Spark  SQL

Borrowed slide12

Borrowed slide13

Borrowed slide14

Borrowed slide15

Borrowed slide16

Borrowed slide17

Borrowed slide18

Borrowed slide19

Borrowed slide20

References:[1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce Translator. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). IEEE Computer Society, Washington, DC, USA, 25-36

[2] Harold Lim, Herodotos Herodotou, and Shivnath Babu. 2012. Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. 5, 11 (July 2012), 1196-1207.

[3] PTF: http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive

[4] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on largeclusters. Proc. VLDB Endow. 3, 1-2 (September 2010), 285-296.

[5] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 810-818

[6] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

[7] Spark and Shark: <http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data?related=1>

[8] Spark SQL: <http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=2>

21