London Data Science - Super-Fast Clustering Report
-
Upload
mapr-technologies -
Category
Technology
-
view
154 -
download
1
description
Transcript of London Data Science - Super-Fast Clustering Report
![Page 1: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/1.jpg)
1©MapR Technologies - Confidential
Super-Fast ClusteringReport from MapR workshop
![Page 2: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/2.jpg)
2©MapR Technologies - Confidential
For Book Discount: @ellen_friedman Contact:– [email protected]– @ted_dunning
Twitter for this talk– #mapr_uk
Slides and such:– http://info.mapr.com/ted-uk-05-2012
![Page 3: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/3.jpg)
3©MapR Technologies - Confidential
Company Background
MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community
contributions with significant internally financed infrastructure development
Background of Team– Deep management bench with extensive analytic,
storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,
Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,
Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs
![Page 4: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/4.jpg)
4©MapR Technologies - Confidential
We Also Do …
Open source development– Zookeeper– Hadoop– Mahout– Stuff
Partner workshops– Machine learning– Information architecture– Cluster design
![Page 5: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/5.jpg)
5©MapR Technologies - Confidential
We Also Do …
Open source development– Zookeeper– Hadoop– Mahout– Stuff
Partner workshops– Machine learning– Information architecture– Cluster design
![Page 6: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/6.jpg)
6©MapR Technologies - Confidential
The Problem
A certain bank– had lots of customers– had lots of prospective customers– had a non-trivial number of fraudulent customers– had a non-trivial number of fraudulent merchants
They also – collected data– built models– collected more data– built more models
![Page 7: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/7.jpg)
7©MapR Technologies - Confidential
But …
These models were arduous to build
And hard to test
So people suggested something simpler
Like k-nearest neighbor
![Page 8: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/8.jpg)
8©MapR Technologies - Confidential
What’s that?
Find the k nearest training examples Use the average value of the target variable from them
This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results
Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time
![Page 9: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/9.jpg)
9©MapR Technologies - Confidential
What We Did
Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid
Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering– Kmeans, StreamingKmeans
![Page 10: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/10.jpg)
10©MapR Technologies - Confidential
Projection Search
![Page 11: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/11.jpg)
11©MapR Technologies - Confidential
K-means Search
![Page 12: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/12.jpg)
12©MapR Technologies - Confidential
But These Require k-means!
Need a new k-means algorithm to get speed
Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable
![Page 13: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/13.jpg)
13©MapR Technologies - Confidential
How It Works
For each point– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid
If centroids > K ~ C log N– Recursively cluster centroids with higher threshold
Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly
![Page 14: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/14.jpg)
14©MapR Technologies - Confidential
Parallel Speedup?
✓
![Page 15: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/15.jpg)
15©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
![Page 16: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/16.jpg)
16©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
![Page 17: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/17.jpg)
17©MapR Technologies - Confidential
Contact:– [email protected]– @ted_dunning
Slides and such:– http://info.mapr.com/ted-uk-05-2012
![Page 18: London Data Science - Super-Fast Clustering Report](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b757ba4a795905078b45a1/html5/thumbnails/18.jpg)
18©MapR Technologies - Confidential
Thank You