Impala Resource Management - OUTDATED

19
1 © Cloudera, Inc. All rights reserved. Impala Resource Management: A Brief Overview MaAhew Jacobs | @maAjacobs November 2015 Relevant through Impala 2.2/CDH5.4

Transcript of Impala Resource Management - OUTDATED

Page 1: Impala Resource Management - OUTDATED

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Impala  Resource  Management:  A  Brief  Overview  MaAhew  Jacobs  |  @maAjacobs    November  2015  Relevant  through  Impala  2.2/CDH5.4  

Page 2: Impala Resource Management - OUTDATED

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Impala  Resource  Management:  Overview  

• Problem:  how  to  best  uIlize  cluster  resources    State  of  the  world  as  of  Impala  2.2/CDH5.4  • Within  Impala  • READY  FOR  USE:    Built-­‐in  Admission  Control  (introduced  in  Impala  1.3/CDH  5.0)  

• Between  Impala  and  the  rest  of  the  world  • READY  FOR  USE:  “StaIc  ParIIoning”  from  Cloudera  Manager  • NOT  READY:  IntegraIon  with  YARN  

•  Experimental  integraIon  shipped  in  Impala  1.3/CDH  5.0  •  Some  known  issues  exist,  do  not  use  it  today!  More  on  this  later…  •  We’re  acIvely  working  on  this,  stay  tuned!  

Page 3: Impala Resource Management - OUTDATED

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Talk  Overview  

This  is  a  very  brief  overview!  Many  details  we  can’t  cover  in  20min  L    • How  to  be  successful  today  (including  with  Impala  2.3/CDH5.5)  • Overview  of  Impala  on  YARN  

• Architecture  • Why  you  can’t  use  it  yet  • How  it  might  look  when  you  can  

Page 4: Impala Resource Management - OUTDATED

4  ©  Cloudera,  Inc.  All  rights  reserved.  

“Resource  Management”  Today  

• Use  one  or  both  of:  • StaIc  ParIIoning  with  Cloudera  Manager  (also  called  “StaIc  Resource  Pools”)  •  Impala’s  built  in  Admission  Control  

•  StaIc  ParIIoning:  dedicate  resources  for  Impala,  HBase,  YARN,  etc.  • Easy  to  use  and  works  well.  Set  up  by  Cloudera  Manager,  uses  cgroups  • E.g.  Impala  gets  100GB/30%  CPU,  HBase  gets  50GB/20%  CPU,  etc.  

• Admission  Control:  throAle  Impala  queries  • Set  a  limit  on  the  max  #  queries  or  max  memory  used  by  those  queries  • E.g.  queue  queries  once  more  than  20  queries  are  running  concurrently,  or  queue  once  more  than  100GB  is  used  

Page 5: Impala Resource Management - OUTDATED

5  ©  Cloudera,  Inc.  All  rights  reserved.  

When  to  Use  AC?  StaIc  ParIIoning?  

With  Admission  Control   Without  Admission  Control  

With  Sta2c  Par22oning  

•  Using  Impala  with  other  systems  (e.g.  Hive,  Spark)  and  need  to  guarantee  each  get  resources  

•  Heavy  Impala  workload,  need  to  make  sure  queries  aren’t  stepping  on  each  other  

•  Using  Impala  with  other  systems  and  need  to  guarantee  each  get  resources  

•  Light  to  moderate  Impala  workload,  not  using  all  available  resources  yet  

Without  Sta2c  Par22oning  

•  Impala  only  cluster,  or  other  systems  have  very  light,  non-­‐compeIng  workloads  

•  Heavy  Impala  workload,  need  to  make  sure  queries  aren’t  stepping  on  each  other  

•  Enough  cluster  resources  are  available  for  all  workloads  to  consume  as  much  as  necessary  

Page 6: Impala Resource Management - OUTDATED

6  ©  Cloudera,  Inc.  All  rights  reserved.  

(Aside:  A  Plethora  of  Mem  Limits)  • Process  (impalad)  memory  limit  

•  Max  memory  the  process  can  use  across  all  queries.  When  a  query  consumes  memory  such  that  the  process  hits  this  limit  the  query  is  killed  

•  Set  with  the  “-­‐-­‐mem_limit”  impalad  command-­‐line  argument,  or  “Impala  Daemon  Memory  Limit”  in  CM.  The  value  is  specified  in  terms  of  single-­‐impalad  memory.  

• Pool  (admission  control)  memory  limit  •  Max  memory  the  queries  in  a  pool/queue  can  use.  The  value  is  used  only  to  admit  queries,  not  enforced  once  queries  are  admiAed.  The  value  is  specified  as  the  cluster-­‐wide  limit,  i.e.  aggregate  limit  across  all  impalads.  

•  hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-­‐impala/latest/topics/impala_admission.html  

• Query  (query  opIon)  memory  limit  •  Max  memory  a  query  can  use;  if  a  query  uses  more  than  it  may  have  to  be  killed  (if  it  can’t  spill).  •  Set  via  the  “set  mem_limit=Xg”  query  opIon.  Can  set  a  default  query  opIon  via  impalad  command-­‐line  arguments  (see  the  next  slide).  

•  The  value  is  specified  in  terms  of  single-­‐impalad  memory,  e.g.  Xg  per  node  •  hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-­‐impala/latest/topics/impala_mem_limit.html  

Page 7: Impala Resource Management - OUTDATED

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Important!  AC  with  Mem  Limits  is  Tricky  

• Admission  based  on  pool  memory  limits  will  use:  •  the  query  memory  limit  if  it  is  set    (set  MEM_LIMIT=Xg;)  • Otherwise  falls  back  to  an  esImate  from  planning,  this  is  usually  wrong!  

• Do  not  use  memory  limits  unless  you  set  query  memory  limits  • Consider  serng  a  default  value  for  the  ‘mem_limit’  query  opIon  • Set  via  the  ‘-­‐-­‐default_query_opIons’  impalad  argument  • E.g.  -­‐-­‐default_query_options='mem_limit=5g'  • Can  sIll  override  the  default  with  the  ‘set  mem_limit=X;’  query  opIon.  

• Picking  a  good  memory  limit  is  hard,  use  CM’s  charts  to  help  understand  your  workload  

Page 8: Impala Resource Management - OUTDATED

8  ©  Cloudera,  Inc.  All  rights  reserved.  

“Resource  Management”  Today,  Summary  

• Today:  Use  Admission  Control  and  StaIc  ParIIoning  • We  skipped  over  a  lot  of  details,  see  the  docs  for  more  informaIon  •  Impala  Admission  Control:  hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-­‐impala/latest/topics/impala_admission.html  • “StaIc  ParIIoning”  in  Cloudera  Manager:  (also  called  “StaIc  Service  Pools”)  hAp://www.cloudera.com/content/cloudera/en/documentaIon/core/latest/topics/cm_mc_service_pools.html  

• Ask  us  quesIons  on  impala-­‐[email protected]  

Page 9: Impala Resource Management - OUTDATED

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Impala  on  YARN  

• YARN  is  a  “resource  negoIator”  that  helps  share  cluster  resources  within  Hadoop  • Works  well  for  MapReduce  and  similar  batch-­‐oriented  processing  engines    • Doesn’t  work  well  for  services/frameworks  that  need:  

•  Long  running  processes  • Gang  scheduling  • Very  low-­‐latency  scheduling  requirements  

• Doesn’t  work  so  well  for  Impala  •  (And  also  HBase,  MPI,  Presto,  custom  apps,  etc.)  

Page 10: Impala Resource Management - OUTDATED

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Llama  to  the  Rescue  

•  Llama  =  Long  Lived  ApplicaIon  MAster  • On  github:  hAp://cloudera.github.io/llama/index.html  • An  interface  between:  • YARN’s  ApplicaIonMaster  (AM)  model  (batch  jobs  where  tasks  are  each  a  process,  coordinated  by  an  AM)  •  Impala’s  low-­‐latency,  in-­‐process  query  model  

•  Llama  provides:  • Gang-­‐scheduling  • “Container”  caching  (to  reduce  resource  acquisiIon  cost)  

Page 11: Impala Resource Management - OUTDATED

11  ©  Cloudera,  Inc.  All  rights  reserved.  

How  Llama  fits  in  

11  

Page 12: Impala Resource Management - OUTDATED

12  ©  Cloudera,  Inc.  All  rights  reserved.  

How  Llama  fits  in  

12  

Page 13: Impala Resource Management - OUTDATED

13  ©  Cloudera,  Inc.  All  rights  reserved.  

How  Llama  fits  in  

13  

Page 14: Impala Resource Management - OUTDATED

14  ©  Cloudera,  Inc.  All  rights  reserved.  

How  Llama  fits  in  

14  

Page 15: Impala Resource Management - OUTDATED

15  ©  Cloudera,  Inc.  All  rights  reserved.  

How  Llama  fits  in  

15  

Page 16: Impala Resource Management - OUTDATED

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Gang  scheduling  

•  YARN  returns  resources  in  a  trickle,  as  they  become  available  •  For  MR  this  is  perfect,  as  tasks  are  mostly  independent  (and  checkpoint  to  disk)  •  For  low-­‐latency  queries,  we  require  all  resources  to  be  available  at  once  so  that  query  tasks  can  stream  results  to  one  another  •  Llama  buffers  resources  between  YARN  and  Impala  to  make  resource  requests  appear  atomic  and  indivisible  

16  

Page 17: Impala Resource Management - OUTDATED

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Resource  caching  

• Every  container  requires  YARN  to  make  an  expensive  resource  allocaIon  decision  

• We  ask  Llama  to  cache  resources  between  requests  

• Containers  stay  in  their  queue  in  Llama,  unIl  YARN  forcefully  reclaims  them  

17  

Page 18: Impala Resource Management - OUTDATED

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Impala  on  YARN:  Current  Status  

• Experimental  integraIon  was  shipped  in  Impala  1.4  /  CDH  5.0  • Not  ready  for  use  yet!  

• A  number  of  known  bugs,  see  umbrella  JIRA  IMPALA-­‐2370  to  track  •  Some  (but  not  all)  important  fixes  in  upcoming  Impala  2.3  /  CDH  5.5  release  • Ongoing  scale  and  performance  tesIng  work  needed  to  provide  guidance  

•  In  a  future  release  (post-­‐Impala  2.3),  we  will  be  able  to  recommend  usage  for  some  workloads,  w/  guidance  

Page 19: Impala Resource Management - OUTDATED

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  @maAjacobs