How Conviva used Spark to Speed Up Video Analytics by...

22
How Conviva used Spark to Speed Up Video Analytics by 30x Dilip Antony Joseph (@DilipAntony)

Transcript of How Conviva used Spark to Speed Up Video Analytics by...

Page 1: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

How Conviva used Spark to Speed Up Video Analytics by 30x Dil ip Antony Joseph (@Dil ipAntony)

Page 2: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Conviva monitors and optimizes online video for premium content providers

Page 3: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

What happens if you don’t use Conviva?

Page 4: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

65 million video streams a day

Page 5: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Conviva data processing architecture

Video  Player  

Hadoop  for  historical  data  

Live  data  processing  

Monitoring  Dashboard  

Spark  

Reports  

Ad-­‐hoc  analysis  

Page 6: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Group By queries dominate our Reporting workload

10s  of  metrics,  10s  of  group  bys  

SELECT  videoName,  COUNT(1)  FROM  summaries  WHERE  date='2012_08_22'  AND  customer='XYZ’  GROUP  BY  videoName;  

Page 7: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Group By query in Spark

val  sessions  =  sparkContext.sequenceFile[SessionSummary,  NullWritable](                                                              pathToSessionSummaryOnHdfs,                                                                                                    classOf[SessionSummary],  classOf[NullWritable])                                                  .flatMap  {                                                                                                      case  (key,  val)  =>  val.fieldsOfInterest                                                                                }    val  cachedSessions  =  sessions.filter(                                                                                    whereCondifonToFilterSessionsForTheDesiredDay)                                                                              .cache  val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }    val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }    val  results  =  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap  

Page 8: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Spark is 30x faster than Hive 45  minutes  versus  24  hours    

for  weekly  Conviva  Geo  Report  

Page 9: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

How much data are we talking about?

•  150  GB  /  week  of  compressed  summary  data  

•  Compressed  ~  3:1  

•  Stored  as  Hadoop  Sequence  Files    •  Custom  binary  serializafon  format  

Page 10: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Spark is faster because it avoids reading data from disk multiple times

Page 11: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Read  from  HDFS  Decompress  Deserialize  

Read  from  HDFS  Decompress  Deserialize  

Read  from  HDFS  Decompress  Deserialize  

Group  By  Country  

Group  By  State  

Group  By  Video  

10s  of  Group  Bys  …  

Group  By  Country  Read  from  HDFS  Decompress  Deserialize  Cache  data  in  memory  

Read  data  from  memory  

Read  data  from  memory  

Group  By  State  

Group  By  Video  

10s  of  Group  Bys  …  

Spark  

Hive/MapReduce  startup  overhead  

Cache  only  columns  of  interest  

Overhead  of  flushing  

intermediate  data  to  disk  

Page 12: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Why not <some other big data system>?

•  Hive  

•  Mysql  

•  Oracle/SAP  HANA  

•  Column  oriented  dbs  

Page 13: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Spark just worked for us

•  30x  speed-­‐up  •  Great  fit  for  our  report  computafon  model  •  Open-­‐source  •  Flexible  •  Scalable  

Page 14: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

30% of our reports use Spark

Page 15: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

We are working on more projects that use Spark

•  Streaming  Spark  for  unifying  batch  and  live  computafon  

•  SHadoop  –  Run  exisfng  Hadoop  jobs  on  Spark    

Page 16: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Problems with Spark

Hadoop  for  historical  data  

Live  data  processing  

Spark  

Reports  

Ad-­‐hoc  analysis  

Video  Player  

Monitoring  Dashboard  

No  Logo  

Page 17: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Hadoop  for  historical  data  

Live  data  processing  

Spark  

Reports  

Ad-­‐hoc  analysis  

Video  Player  

Monitoring  Dashboard  

Page 18: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Spark queries are not very succinct SELECT  videoName,  COUNT(1)  FROM  summaries  WHERE  date='2012_08_22'  AND  customer='XYZ’  GROUP  BY  videoName;  

val  sessions  =  sparkContext.sequenceFile[SessionSummary,  NullWritable](                                                              pathToSessionSummaryOnHdfs,                                                                                                    classOf[SessionSummary],  classOf[NullWritable])                                                  .flatMap  {                                                                                                      case  (key,  val)  =>  val.fieldsOfInterest                                                                                }    val  cachedSessions  =  sessions.filter(                                                                                    whereCondifonToFilterSessionsForTheDesiredDay)                                                                              .cache  val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }    val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }    val  results  =  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap  

Spark  

Page 19: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

There is a learning curve associated with Scala, but ...

•  Type-­‐safety  offered  by  Scala  is  a  great  boon  –  Code  complefon  via  Eclipse  Scala  plugin  

•  Complex  queries  are  easier  in  Scala  than  in  Hive  –  Cascading  IF()s  in  Hive  

•  No  need  to  write  Scala  in  the  future  –  Shark  –  Java,  Python  APIs  

Page 20: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Additional Hiccups

•  Always  on  the  bleeding  edge  –  geung  dependencies  right  

•  More  maintenance/debugging  tools  required  

Page 21: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

Spark has been working great for Conviva for over a year

Page 22: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics

We are Hiring [email protected]  

hwp://www.conviva.com/blog/engineering/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva