Cloudera Impala Technical Overview

18
Cloudera Impala Jus/n Erickson | Senior Product Manager May 2013

description

In this slidecast, Justin Erickson from Cloudera presents a technical overview of Cloudera Impala, an SQL-on-Hadoop solution that enables users to do real-time queries of data stored in Hadoop clusters. "To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-of-magnitude faster performance than Hive, depending on the type of query and configuration." Watch the presentation video at: http://inside-bigdata.com/?p=2977 or listen to the MP3: http://archive.org/download/ClouderaImpalaPodcast/Cloudera%20Impala%20Podcast.mp3

Transcript of Cloudera Impala Technical Overview

Page 1: Cloudera Impala Technical Overview

Cloudera  Impala  Jus/n  Erickson  |    Senior  Product  Manager  May  2013  

Page 2: Cloudera Impala Technical Overview

Agenda  

•  Why  Impala?  •  Architectural  Overview  •  Real-­‐World  Use  Cases  •  Alterna/ve  Approaches  •  The  PlaKorm  for  Big  Data  

©2013 Cloudera, Inc. All Rights Reserved. 2

Page 3: Cloudera Impala Technical Overview

Why  Hadoop?  

•  Scalability  •  Simply  scales  just  by  adding  nodes  •  Local  processing  to  avoid  network  boSlenecks  

•  Flexibility  •  All  kinds  of  data  (blobs,  documents,  records,  etc)  •  In  all  forms  (structured,  semi-­‐structured,  unstructured)  •  Store  anything  then  later  analyze  what  you  need  

•  Efficiency  •  Cost  efficiency  (<$1k/TB)  on  commodity  hardware  •  Unified  storage,  metadata,  security  (no  duplica/on  or  synchroniza/on)  

©2013 Cloudera, Inc. All Rights Reserved. 3

Page 4: Cloudera Impala Technical Overview

What’s  Impala?  

•  Interac2ve  SQL  •  Typically  5-­‐65x  faster  than  Hive  (observed  up  to  100x  faster)  •  Responses  in  seconds  instead  of  minutes  (some/mes  sub-­‐second)  

•  Nearly  ANSI-­‐92  standard  SQL  queries  with  Hive  SQL  •  Compa/ble  SQL  interface  for  exis/ng  Hadoop/CDH  applica/ons  •  Based  on  industry  standard  SQL  

•  Na2vely  on  Hadoop/HBase  storage  and  metadata  •  Flexibility,  scale,  and  cost  advantages  of  Hadoop  •  No  duplica/on/synchroniza/on  of  data  and  metadata  •  Local  processing  to  avoid  network  boSlenecks  

•  Separate  run2me  from  MapReduce  •  MapReduce  is  designed  and  great  for  batch  •  Impala  is  purpose-­‐built  for  low-­‐latency  SQL  queries  on  Hadoop  

©2013 Cloudera, Inc. All Rights Reserved. 4

Page 5: Cloudera Impala Technical Overview

Benefits  of  Impala  

5

More  &  Faster  Value  from  “Big  Data”  §  BI  tools  imprac/cal  on  Hadoop  before  Impala  §  Move  from  10s  of  Hadoop  users  per  cluster  to  100s  of  SQL  users  §  No  delays  from  data  migra/on  

Flexibility  §  Query  across  exis/ng  data  §  Select  best-­‐fit  file  formats  (Parquet,  Avro,  etc.)  §  Run  mul/ple  frameworks  on  the  same  data  at  the  same  /me    

Cost  Efficiency  §  Reduce  movement,  duplicate  storage  &  compute  §  10%  to  1%  the  cost  of  analy/c  DBMS  

Full  Fidelity  Analysis  §  No  loss  from  aggrega/ons  or  fixed  schemas  

©2013 Cloudera, Inc. All Rights Reserved.

Page 6: Cloudera Impala Technical Overview

Impala  Query  Execu/on  

6

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  request  

1)  Request  arrives  via  ODBC/JDBC/Beeswax/Shell  

©2013 Cloudera, Inc. All Rights Reserved.

Page 7: Cloudera Impala Technical Overview

Impala  Query  Execu/on  

7

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

2)  Planner  turns  request  into  collec2ons  of  plan  fragments  3)  Coordinator  ini2ates  execu2on  on  impalad(s)  local  to  data  

©2013 Cloudera, Inc. All Rights Reserved.

Page 8: Cloudera Impala Technical Overview

Impala  Query  Execu/on  

8

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

4)  Intermediate  results  are  streamed  between  impalad(s)  5)  Query  results  are  streamed  back  to  client  

Query  results  

©2013 Cloudera, Inc. All Rights Reserved.

Page 9: Cloudera Impala Technical Overview

Impala  and  Hive  

9

Shares  Everything  Client-­‐Facing  §  Metadata  (table  defini/ons)  §  ODBC/JDBC  drivers  §  SQL  syntax  (Hive  SQL)  §  Flexible  file  formats  §  Machine  pool  §  Hue  GUI  

But  Built  for  Different  Purposes  §  Hive:  runs  on  MapReduce  and  ideal  for  batch  

processing  §  Impala:  na/ve  MPP  query  engine  ideal  for  

interac/ve  SQL  

Storage  

Integra2on  

Resource  Management  

Metad

ata  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Hive  SQL  Syntax   Impala  

SQL  Syntax  +  Compute  Framework  MapReduce  

Compute  Framework  

Batch  Processing  

Interac/ve  

SQL  

©2013 Cloudera, Inc. All Rights Reserved.

Page 10: Cloudera Impala Technical Overview

Impala  Use  Cases  

10

Interac/ve  BI/analy/cs  on  more  data  

Asking  new  ques/ons  

Query-­‐able  archive  w/  full  fidelity  

Data  processing  with  /ght  SLAs  

Cost-­‐effec2ve,  ad  hoc  query  environment  that  offloads  the  data  warehouse  for:  

©2013 Cloudera, Inc. All Rights Reserved.

Page 11: Cloudera Impala Technical Overview

Global  Financial  Services  Company  

11

Saving  90%  on  incremental  EDW  spend  &  improving  performance  by  5x  

Offload  data  warehouse  for  query-­‐able  archive  

Store  decades  of  data  cost-­‐effec/vely  

Process  &  analyze  on  the  same  system  

Improve  capabili/es  through  interac/ve  query  on  more  data  

©2013 Cloudera, Inc. All Rights Reserved.

Page 12: Cloudera Impala Technical Overview

Six3  Systems  

12

Boos2ng  performance  by  20X  for  mission-­‐cri2cal,  real-­‐2me  cyber  security  

Analyze  unstructured  data  with  flexibility  &    real-­‐/me  response  

Integrate  with  exis/ng  desktop  &  BI  tools  

Deploy  in  minutes  with  Cloudera  Manager  

©2013 Cloudera, Inc. All Rights Reserved.

Page 13: Cloudera Impala Technical Overview

Expedia  

13

Implemen2ng  self-­‐service  BI  on  big  data,    reducing  data  latency  by  50%    

Offload  data  warehouse  for  archiving,  ETL  &  analy/cs  

Unify  IT  environment    

Con/nuously  ingest  &  analyze  at  scale  

Drive  greater  usability  &  adop/on  of  big  data  stack  

©2013 Cloudera, Inc. All Rights Reserved.

Page 14: Cloudera Impala Technical Overview

Our  Design  Strategy  

14

Storage  

Integra2on  

Resource  Management  

Metad

ata  

Batch  Processing  MAPREDUCE,  HIVE  &  PIG  

…Interac/ve  

SQL  IMPALA  

Machine  

Learning  MAHOUT,  DATAFU  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Engines  

One  pool  of  data  

One  metadata  model  

One  security  framework  

One  set  of  system  resources  

An  Integrated  Part  of  the  Hadoop  System  

©2013 Cloudera, Inc. All Rights Reserved.

Page 15: Cloudera Impala Technical Overview

Not  All  SQL  on  Hadoop  is  Created  Equal  

15

Batch  MapReduce  Make  MapReduce  faster  

Slow,  s2ll  batch  

Remote  Query  Pull  data  from  HDFS  over  the  network  to  the  DW  

compute  layer  

Slow,  expensive  

Siloed  DBMS  Load  data  into  a  

proprietary  database  file  

Rigid,  siloed  data,  slow  ETL  

Impala  Na/ve  MPP  query  engine  that’s  integrated  into  

Hadoop  

Fast,  flexible,    cost-­‐effec2ve  

$

©2013 Cloudera, Inc. All Rights Reserved.

Page 16: Cloudera Impala Technical Overview

The  Impala  Advantage  

16

BI  Partners:  Building  on  the  

Enterprise  Standard  POWERED BY

IMPALA

©2013 Cloudera, Inc. All Rights Reserved.

Page 17: Cloudera Impala Technical Overview

It’s  Not  Just  About  SQL  on  Hadoop  

17

The  Plaeorm  for  Big  Data  

Storage  

Integra2on  

Resource  Management  

Metad

ata  

Batch  Processing  MAPREDUCE,  HIVE  &  PIG  

…Interac/ve  

SQL  IMPALA  

Machine  

Learning  MAHOUT,  DATAFU  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO…   RECORDS  

Engines  

Management    |    Support  

Single  plaKorm  for  processing  &  analy/cs  

Scales  to  ‘000s  of  servers  

No  upfront  schema  

10%  the  cost  per  TB  

Open  source  plaKorm  

©2013 Cloudera, Inc. All Rights Reserved.

Page 18: Cloudera Impala Technical Overview