Scaling Mobile Millennium - UC Berkeley AMP...

34
August 22, 2012 Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor Moldovan, Matei Zaharia, Samy Merzgui, Justin Ma, Michael J. Franklin, Pieter Abbeel, Alexandre M. Bayen UC Berkeley

Transcript of Scaling Mobile Millennium - UC Berkeley AMP...

Page 1: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012

Scaling Mobile Millennium with BDAS

Timothy Hunter,Teodor Moldovan, Matei Zaharia, Samy Merzgui, Justin Ma,Michael J. Franklin, Pieter Abbeel, Alexandre M. Bayen

UC Berkeley

Page 2: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 2/34

Machine learning at scale

● Combining A, M and P in a real application:● Complex models (car traffic estimation)● Crowd-sourced data (mobile phones)● Computations on the cloud

● How we run Spark inside Mobile Millennium

Page 3: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 3/34

Plan

● Why car traffic estimation● Overview of Mobile Millennium● 2 minutes of applied Machine Learning● Programming with the Spark framework● Conclusion: the good, the bad, the not so beautiful

Page 4: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 4/34

Need for good traffic estimation

● Traffic congestion affects everyone● Up-to-date estimation is critical● Complex for urban streets (arterial roads)

Page 5: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 5/34

Real-time processing of fleet data

● Input: sampled position of taxicabs

● Observed every minute

Page 6: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 6/34

Estimating the travel times

● Input: sampled position of taxicabs

● Observed every minute

● Covers the whole SF Bay

● 0.5 Million points / day(60M / day total)

● 0.1 Million road links

Page 7: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 7/34

Filtering of fleet data

Preprocessing:

● Recovering trajectories from GPS points

Page 8: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 8/34

Mobile Millennium

● A cyberphysical system for participatory sensing

Page 9: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 9/34

Mobile Millennium

● A cyberphysical system for participatory sensing

Today:Batch jobs outsourcedto the cloud

Today:Batch jobs outsourcedto the cloud

Page 10: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 10/34

Estimation of arterial traffic

● Input:● Pieces of trajectories between GPS points

● Output: probability distributions of travel time● For each link● Parametrized by vector θ (mean and variance of link

travel time)

©G

oogl

e, I

nc.

Page 11: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 11/34

The way things work

● Example road network

● Associated link travel times:

Page 12: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 12/34

The way things work

Page 13: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 13/34

The way things work

Measurement sent

Page 14: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 14/34

The way things work

Measurement sent

Page 15: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 15/34

The way things work

...

Page 16: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 16/34

The way things work

...

Page 17: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 17/34

Life is not so simple

Measurement sent

Page 18: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 18/34

Life is not so simple

● Long time between observations Measurement sent

Page 19: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 19/34

Life is not so simple

● Long time between observations Measurement sent

● Solution: sample!

Page 20: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 20/34

Life is not so simple

● Long time between observations Measurement sent

Page 21: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 21/34

Life is not so simple

● Long time between observations Measurement sent

Page 22: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 22/34

Life is not so simple

● Long time between observations Measurement sent

Page 23: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 23/34

Machine learning without saying it

● Procedure called Expectation Maximization● Iterative in nature:

● Alternates between sampling (E step) and learning (M step)

● Some figures:● 50k road links ( parameters)● 50M observations (15GB, avg. 4 links / observation)● 200M partial travel times● x1000 samples per partial travel times

Page 24: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 24/34

System workflow

Start link parameters(on master node)

Observations(distributed, persisted across nodes)

Page 25: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 25/34

System workflow

Network parameters(distributed over the nodes)

Page 26: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 26/34

System workflow

Travel time samplesFor each observation link

Page 27: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 27/34

System workflow

Travel time samples aggregatedon a link basis

Page 28: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 28/34

System workflow

New parameters are generatedThe maximize sampled travel timesfor each link.

The master collects the vector ofnew parameters.

Page 29: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 29/34

Using the Spark programming model

Main loop of the programval observations = spark.textFile(“hdfs:...”)  .map(parseObservation _)  .cache()var params = // Initialize models parameterswhile (!converged) {

  val samples = observations.flatMap( obs =>    generateSamples(obs, params))

  params = samples.groupByKey(false).map(    case (linkId, vals) =>         mostLikelyParam(linkId, vals)  ).collect()}

Step 1 (E step)

Step 2 (M step)

Page 30: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 30/34

The good

● Before using Spark:● 3.5x slower than real-time● Could not even handle all the data

● With Spark:● Similar programming interface (methods on scala

collections)● Very good scalability (near linear)● Each iteration 3x faster than reloading from disk

cores

runtime

NERSC cluster:quad-core Xeon4X QDR InfiniBand interconnect

Page 31: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 31/34

Efficient utilization of memory

● The observation data is stored in memory:● Be careful with the memory footprint● Look at logs to monitor GC status

● We cache pointer-based structures ● Significant overhead in the JVM

● Workaround: use compact collection structures (arrays) and make liberal use of .toArray()

● Workaround: RDDs of serialized data

Page 32: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 32/34

Broadcast of large parameters

● Need to share data between all workers:● At the start of the job (network description, > 40MB)● Between iterations (updated parameters θ)

● Using Spark's broadcast● Data loading time reduced by 79%

val network = // load networkval bc_net = spark.broadcast(network)val observations = spark.textFile(“...”)  .map(parseObservation(_, bc_net.get()))

val network = // load networkval observations = spark.textFile(“...”)  .map(parseObservation(_, network))

No broadcast BT broadcast0

500

1000

1500

2000

2500

Data loading time

Load

ing

time

(sec

)

No broadcast With broadcast

Page 33: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 33/34

Conclusion

● An application of Spark:● Real-world ML problem● Crowd-sourced data

● Implementation now (much) faster than real time● Not limited by computations:

● We can use more complex ML tools than before

Page 34: Scaling Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../tim-hunter-amp-camp-2012-mobile-millenni… · Scaling Mobile Millennium with BDAS Timothy Hunter, Teodor

August 22, 2012 MM on BDAS - AMP Camp 2012 34/34

Thank you