Scaling Mobile Millennium - UC Berkeley AMP...

Post on 27-Jun-2020

2 views 0 download

Transcript of Scaling Mobile Millennium - UC Berkeley AMP...

August 22, 2012

Scaling Mobile Millennium with BDAS

Timothy Hunter,Teodor Moldovan, Matei Zaharia, Samy Merzgui, Justin Ma,Michael J. Franklin, Pieter Abbeel, Alexandre M. Bayen

UC Berkeley

August 22, 2012 MM on BDAS - AMP Camp 2012 2/34

Machine learning at scale

● Combining A, M and P in a real application:● Complex models (car traffic estimation)● Crowd-sourced data (mobile phones)● Computations on the cloud

● How we run Spark inside Mobile Millennium

August 22, 2012 MM on BDAS - AMP Camp 2012 3/34

Plan

● Why car traffic estimation● Overview of Mobile Millennium● 2 minutes of applied Machine Learning● Programming with the Spark framework● Conclusion: the good, the bad, the not so beautiful

August 22, 2012 MM on BDAS - AMP Camp 2012 4/34

Need for good traffic estimation

● Traffic congestion affects everyone● Up-to-date estimation is critical● Complex for urban streets (arterial roads)

August 22, 2012 MM on BDAS - AMP Camp 2012 5/34

Real-time processing of fleet data

● Input: sampled position of taxicabs

● Observed every minute

August 22, 2012 MM on BDAS - AMP Camp 2012 6/34

Estimating the travel times

● Input: sampled position of taxicabs

● Observed every minute

● Covers the whole SF Bay

● 0.5 Million points / day(60M / day total)

● 0.1 Million road links

August 22, 2012 MM on BDAS - AMP Camp 2012 7/34

Filtering of fleet data

Preprocessing:

● Recovering trajectories from GPS points

August 22, 2012 MM on BDAS - AMP Camp 2012 8/34

Mobile Millennium

● A cyberphysical system for participatory sensing

August 22, 2012 MM on BDAS - AMP Camp 2012 9/34

Mobile Millennium

● A cyberphysical system for participatory sensing

Today:Batch jobs outsourcedto the cloud

Today:Batch jobs outsourcedto the cloud

August 22, 2012 MM on BDAS - AMP Camp 2012 10/34

Estimation of arterial traffic

● Input:● Pieces of trajectories between GPS points

● Output: probability distributions of travel time● For each link● Parametrized by vector θ (mean and variance of link

travel time)

©G

oogl

e, I

nc.

August 22, 2012 MM on BDAS - AMP Camp 2012 11/34

The way things work

● Example road network

● Associated link travel times:

August 22, 2012 MM on BDAS - AMP Camp 2012 12/34

The way things work

August 22, 2012 MM on BDAS - AMP Camp 2012 13/34

The way things work

Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 14/34

The way things work

Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 15/34

The way things work

...

August 22, 2012 MM on BDAS - AMP Camp 2012 16/34

The way things work

...

August 22, 2012 MM on BDAS - AMP Camp 2012 17/34

Life is not so simple

Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 18/34

Life is not so simple

● Long time between observations Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 19/34

Life is not so simple

● Long time between observations Measurement sent

● Solution: sample!

August 22, 2012 MM on BDAS - AMP Camp 2012 20/34

Life is not so simple

● Long time between observations Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 21/34

Life is not so simple

● Long time between observations Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 22/34

Life is not so simple

● Long time between observations Measurement sent

August 22, 2012 MM on BDAS - AMP Camp 2012 23/34

Machine learning without saying it

● Procedure called Expectation Maximization● Iterative in nature:

● Alternates between sampling (E step) and learning (M step)

● Some figures:● 50k road links ( parameters)● 50M observations (15GB, avg. 4 links / observation)● 200M partial travel times● x1000 samples per partial travel times

August 22, 2012 MM on BDAS - AMP Camp 2012 24/34

System workflow

Start link parameters(on master node)

Observations(distributed, persisted across nodes)

August 22, 2012 MM on BDAS - AMP Camp 2012 25/34

System workflow

Network parameters(distributed over the nodes)

August 22, 2012 MM on BDAS - AMP Camp 2012 26/34

System workflow

Travel time samplesFor each observation link

August 22, 2012 MM on BDAS - AMP Camp 2012 27/34

System workflow

Travel time samples aggregatedon a link basis

August 22, 2012 MM on BDAS - AMP Camp 2012 28/34

System workflow

New parameters are generatedThe maximize sampled travel timesfor each link.

The master collects the vector ofnew parameters.

August 22, 2012 MM on BDAS - AMP Camp 2012 29/34

Using the Spark programming model

Main loop of the programval observations = spark.textFile(“hdfs:...”)  .map(parseObservation _)  .cache()var params = // Initialize models parameterswhile (!converged) {

  val samples = observations.flatMap( obs =>    generateSamples(obs, params))

  params = samples.groupByKey(false).map(    case (linkId, vals) =>         mostLikelyParam(linkId, vals)  ).collect()}

Step 1 (E step)

Step 2 (M step)

August 22, 2012 MM on BDAS - AMP Camp 2012 30/34

The good

● Before using Spark:● 3.5x slower than real-time● Could not even handle all the data

● With Spark:● Similar programming interface (methods on scala

collections)● Very good scalability (near linear)● Each iteration 3x faster than reloading from disk

cores

runtime

NERSC cluster:quad-core Xeon4X QDR InfiniBand interconnect

August 22, 2012 MM on BDAS - AMP Camp 2012 31/34

Efficient utilization of memory

● The observation data is stored in memory:● Be careful with the memory footprint● Look at logs to monitor GC status

● We cache pointer-based structures ● Significant overhead in the JVM

● Workaround: use compact collection structures (arrays) and make liberal use of .toArray()

● Workaround: RDDs of serialized data

August 22, 2012 MM on BDAS - AMP Camp 2012 32/34

Broadcast of large parameters

● Need to share data between all workers:● At the start of the job (network description, > 40MB)● Between iterations (updated parameters θ)

● Using Spark's broadcast● Data loading time reduced by 79%

val network = // load networkval bc_net = spark.broadcast(network)val observations = spark.textFile(“...”)  .map(parseObservation(_, bc_net.get()))

val network = // load networkval observations = spark.textFile(“...”)  .map(parseObservation(_, network))

No broadcast BT broadcast0

500

1000

1500

2000

2500

Data loading time

Load

ing

time

(sec

)

No broadcast With broadcast

August 22, 2012 MM on BDAS - AMP Camp 2012 33/34

Conclusion

● An application of Spark:● Real-world ML problem● Crowd-sourced data

● Implementation now (much) faster than real time● Not limited by computations:

● We can use more complex ML tools than before

August 22, 2012 MM on BDAS - AMP Camp 2012 34/34

Thank you