Scaling by Cheating

download Scaling by Cheating

of 30

  • date post

    09-Feb-2016
  • Category

    Documents

  • view

    32
  • download

    3

Embed Size (px)

description

Scaling by Cheating. Approximation, Sampling and Fault-Friendliness for Scalable Big Learning Sean Owen / Director, Data Science @ Cloudera. Two Big Problems. Grow Bigger. “. - PowerPoint PPT Presentation

Transcript of Scaling by Cheating

Slide 1

1Scaling by CheatingApproximation, Sampling and Fault-Friendliness for Scalable Big Learning

Sean Owen / Director, Data Science @ Cloudera

2Two Big Problems

3Grow Bigger

Make quotes lookinteresting or different.Todays big is just tomorrows small. Were expected to process arbitrarily large data sets by just adding computers. You cant tell the boss that anythings too big to handle these days.David, Sr. IT Manager4And Be Faster Make quotes lookinteresting or different.

Speed is king. People expect up-to-the-second results, and millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x.Shelly, CTO

5Two Big Solutions

6Plentiful Resources Make quotes lookinteresting or different.

Disk and CPU are cheap, on-demand. Frameworks to harness them, like Hadoop, are free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply.Scooter, White Lab7Not Right, but Close Enough Cheating8Kirk What would you say the odds are on our getting out of here?

Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one.

Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one?Spock Seven thousand eight hundred twenty four point seven to one.

Kirk That's a pretty close approximation.Star Trek, Errand of Mercyhttp://www.redbubble.com/people/feelmeflowWhen To Cheat Approximate9Only a few significant figures matterLeast-significant figures are noiseOnly relative rank mattersOnly care about high or low

Do you care about 37.94% vs simply 40%?

10Approximation

The Mean11Huge stream of values: x1 x2 x3 * Finding entire population mean is expensiveMean of small sample of N is close:

N = (1/N) (x1 + x2 + + xN)

How much gets close enough?* independent, roughly normal distributionClose Enough Mean12Want: with high probability p, at most error = (1 ) NUse Students t-distribution (N-1 d.o.f.)t = ( - N) / (N/N )How unknown behaves relative to known sample statstClose Enough Mean13Critical value for one tailtcrit = CDF-1((1+p)/2)Use library like Commons Math3:TDistribution.inverseCumulativeProbability()Solve for critical critCDF-1((1+p)/2) = (crit - N) / (N/N ) probably at most critStop when (crit - N) / N small (= 0.1) continue;...Stop When Close Enough21CloseEnoughMean.javaStop mapping when % Capitalized is close enough 10% error, 90% confidenceper Mapper18 minutes39.8% Capitalized...if (m.isCloseEnough()) { break;}...

22Fault-Friendliness

Oryx ()23

Oryx ()24Computation LayerOffline, Hadoop-basedLarge-scale model buildingServing LayerOnline, REST APIQuery model in real-timeUpdate model approximatelyFew Key AlgorithmsRecommendersALSClusteringk-means++ClassificationRandom decision forests

25Not A BankOryx ()26No Transactions!

Serving Layer Designs For 27Independent replicasNeed not have a globally consistent viewClients have consistent view through sticky load balancingPush data into durable store, HDFSBuffer a little locallyTolerate loss of a little bitFast AvailabilityFast 99.9% Durability

28

If losing 90% of the data might make