cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike...

23
Big Data and Automotive IT Michael Cafarella University of Michigan September 11, 2013

Transcript of cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike...

Page 1: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Big  Data  and  Automotive  IT  

Michael  Cafarella  University  of  Michigan  September  11,  2013  

Page 2: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

2

A  Banner  Year  for  Big  Data  

Page 3: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Big  Data  

•  Who  knows  what  it  means  anymore?  

Page 4: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Big  Data  

•  Who  knows  what  it  means  anymore?  •  Associated  with:  

– Google,  Facebook,  Twitter  – Hadoop,  MapReduce  – Cluster  computing,  cloud  computing  – Machine  learning,  predictive  analytics,  data  science  – magazine  covers  

•  For  a  large  range  of  tasks,  data  availability  is  no  longer  a  serious  constraint  

Page 5: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Data  +  Statistics  =  Predictors    Web  pages  +  user  clicks   Google  web  search  

Movie  views  +  user  ratings   Netflix  recommendations  

Tweets  about  illness   Disease  outbreak  estimates  

Cameras  +  laser  scanners   Self-­‐driving  cars  

Cell  phone  sales  records   Customer  churn  prediction  

•  Statistics  grew  up  “data-­‐poor”  •  Old  techniques  now  v.  effective  thanks  to  data  •  Enabled  by  Web,  cheap  disks,  cheap  sensors  •  Google  was  among  the  first  to  see  it  coming  

Page 6: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Agenda  

•  A  Sample  Big  Data  Task  •  Possible  Tasks  in  Automotive  IT  

Page 7: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Agenda  

•  A  Sample  Big  Data  Task  •  Possible  Tasks  in  Automotive  IT  

Page 8: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Tweets  for  Macroeconomic  Prediction  

•  Why  use  Twitter?  •  Tweets  contain  valuable  information  freely  provided  by  the  Tweeter  in  real  time  – Quick  and  cheap  relative  to  surveys  – Better  at  capturing  turning  points    – Permit  retrospective  analysis  because  beliefs  and  actions  are  “archived”  

•  Let’s  try  “unemployment”    

Page 9: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...
Page 10: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

The  Data  

•  Tweets  are  short  timestamped  messages  •  Explicit  metadata:  author,  geography,  time  •  Implicit  metadata:  gender,  age,  many  others  •  Roughly  1B  every  2  days  •  More  than  15%  of  online  American  adults  

Page 11: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Processing  Pipeline  

•  How  to  turn  raw  text  into  predictions?  

Page 12: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Processing  Pipeline  1.  Obtain  ~13B  Tweets  in  2011-­‐2013  

(compressed  ~5  TB)  

2.  Enumerate  and  count  all  unique  k-­‐grams  in  data  

3.  Group  counts  by  week,  build    all  (k-­‐gram,  signal)  pairs  

4.  Choose  unemployment-­‐related  ones  

5.  Use  signals  to  build  model  to  predict  new  claims  

12  

“I  need  a  job”,  “I  got  fired”,  etc.  

10/17   i  need  a  job   491  

12/15   i  love  you   5092  

1/28   justin  bieber   940,291  

I  need  a  job,     )  (  

Page 13: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Deriving  Signals  Each  signal  derived  from  counts  of  k-­‐grams  •  Any  consecutive  sequence  of  k  or  fewer  words  •  Tweet  of  N  words  yields  ~kN  k-­‐grams  •  We  used  k=4  (enough  for  “I  lost  my  job”)  

“I  teach  at  the  University  of  Michigan”  •  1:  I,  teach,  at,  the  University,  of,  Michigan  •  2:  “I  teach”,  “teach  at”,  “at  the”,  “the  University”,  …  •  3:  “I  teach  at”,  “teach  at  the”,  “at  the  University”,  …  

Our  Tweet  corpus  contains  2.55  billion  unique  4-­‐grams  in  English  that  appear  at  least  three  times  

13  

Page 14: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Choosing  Signals  •  Too  many  to  examine  by  hand  •  Good  signals  may  not  be  obvious    

•  Lysol       flu  •  Obvious  signals  may  not  be  good    

•  unemployment  benefits  

•  Automated  methods  would  be  great,  but  very  difficult.  Our  research  focuses  on  this  problem  

•  For  now,  manually  formulate  plausible  ones  •  I  lost  my  job,  I  need  a  job,  I  want  to  work  

14  

Page 15: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Experiments  

15  

Page 16: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Signals  Category   Terms  

Search  signals   find  a  job,  looking  for  a  job,  looking  for  work,  need  a  job  

Lost  job  signals   canned,  laid  off,  fired  (get  fired,  got  fired,  be  fired,  fired  from,  was  fired,  been  fired,  fired  lol,  being  fired,  just  fired)  

Unemployment  signal  

unemployment  

16  

•  Exclude  “benefits,”  “fired  up,”  others  

Page 17: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Initial Claims (SA) versus Twitter IndexLearn log(2)

Thou

sand

s, W

eekl

y

2011 2012 2013320

340

360

380

400

420

440

460Initial ClaimsTwitter Index

Page 18: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Do  Twitter  Signals  Carry    Incremental  Information?  

•  Panel  of  economists  predict  unemployment,  make  mistakes  

•  Can  we  predict  economists’  surprise?  –  If  Twitter  adds  nothing  new,  should  be  impossible  

 

18  

Page 19: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Creating  Measures  of  Labor  Market  Flows  using  Social  Media   19  

Initial Claims for Unemployment BenefitsRevised Data

J A S O N D J F M A M J J2012

-30000

-20000

-10000

0

10000

20000

30000

40000SurprisePredicted with Twitter

Page 20: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Agenda  

•  A  Sample  Big  Data  Task  •  Possible  Tasks  in  Automotive  IT  

Page 21: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Finding  Novel  Applications:  Some  Rules  of  Thumb  

1.  Data  is  the  critical  resource,  often  overlooked  – Great  data  makes  a  middling  analyst  look  good  –  The  reverse  isn’t  true  

2.  Look  for  “data  exhaust”  to  exploit  –  Sales  records,  transaction  logs,  phone  logs  

3.  Datasets  are  synergistic  – Weather  data  is  boring  – Weather  +  repair  data  is  compelling  

4.  Resource  optimization  often  pays  off  quickly  5.  Novel  services  possible,  yield  bigger  impact  

Page 22: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Resource  Optimization  

•  Predict  demand  for  models  &  colors  (possibly  prior  to  manufacture)  – Can  be  localized  to  states,  probably  counties  – Esp  useful  for  dealer  inventory  management  – Also  possible  for  parts,  components,  accessories  

•  Predict  service  issues  – Manufacturer  warranty  liability  – Daily  load  on  service  staff  

•  Predict  buyer-­‐specific  propensity  to  purchase  –  (See  Charles  Duhigg,  NYTimes,  2/19/2012)  

 

Page 23: cafarella auto bigdata - University of Michigan · cafarella_auto_bigdata.pptx Author: Mike Cafarella Created Date: 20130910154602Z ...

Novel  Services  

•  Auto  Owners  – Better  prediction  =>  accurate  contract  pricing  – Better  service  and  “Refuel  now!”  warnings  – Next-­‐purchase  recommendations  

•  Traffic  and  Infrastructure  – Traffic  prediction  (e.g.,  “Tell  me  when  to  leave  work”)  

– Street-­‐specific  maintenance  and  salting  –  Intersection-­‐specific  accident  prediction  – Find  fun  drives