Oscon Data 2011 Ted Dunning
-
Upload
mapr-technologies -
Category
Technology
-
view
131 -
download
3
Transcript of Oscon Data 2011 Ted Dunning
![Page 1: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/1.jpg)
Hands-on Classification
![Page 2: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/2.jpg)
Preliminaries
• Code is available from github:– [email protected]:tdunning/Chapter-16.git
• EC2 instances available• Thumb drives also available• Email to [email protected]• Twitter @ted_dunning
![Page 3: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/3.jpg)
A Quick Review
• What is classification?– goes-ins: predictors– goes-outs: target variable
• What is classifiable data?– continuous, categorical, word-like, text-like– uniform schema
• How do we convert from classifiable data to feature vector?
![Page 4: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/4.jpg)
Data Flow
Not quite so simple
![Page 5: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/5.jpg)
Classifiable Data
• Continuous– A number that represents a quantity, not an id– Blood pressure, stock price, latitude, mass
• Categorical– One of a known, small set (color, shape)
• Word-like– One of a possibly unknown, possibly large set
• Text-like– Many word-like things, usually unordered
![Page 6: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/6.jpg)
But that isn’t quite there
• Learning algorithms need feature vectors– Have to convert from data to vector
• Can assign one location per feature – or category – or word
• Can assign one or more locations with hashing– scary– but safe on average
![Page 7: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/7.jpg)
Data Flow
![Page 8: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/8.jpg)
![Page 9: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/9.jpg)
Classifiable Data Vectors
![Page 10: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/10.jpg)
![Page 11: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/11.jpg)
![Page 12: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/12.jpg)
Hashed Encoding
![Page 13: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/13.jpg)
What about collisions?
![Page 14: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/14.jpg)
Let’s write some code
(cue relaxing background music)
![Page 15: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/15.jpg)
Generating new features
• Sometimes the existing features are difficult to use
• Restating the geometry using new reference points may help
• Automatic reference points using k-means can be better than manual references
![Page 16: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/16.jpg)
K-means using target
![Page 17: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/17.jpg)
K-means features
![Page 18: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/18.jpg)
More code!
(cue relaxing background music)
![Page 19: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/19.jpg)
Integration Issues
• Feature extraction is ideal for map-reduce– Side data adds some complexity
• Clustering works great with map-reduce– Cluster centroids to HDFS
• Model training works better sequentially– Need centroids in normal files
• Model deployment shouldn’t depend on HDFS
![Page 20: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/20.jpg)
Averagemodels
Parallel Stochastic Gradient Descent
Trainsub
model
Model
Input
![Page 21: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/21.jpg)
Updatemodel
Variational Dirichlet Assignment
Gathersufficientstatistics
Model
Input
![Page 22: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/22.jpg)
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, (1, point)
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)
• Output to HDFS
Read fromHDFS to local disk by distributed cache
Written by map-reduce
Read from local disk from distributed cache
![Page 23: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/23.jpg)
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, 1, point
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, n, sum/n
• Output to HDFSMapR FS
Read fromNFS
Written by map-reduce
![Page 24: Oscon Data 2011 Ted Dunning](https://reader034.fdocuments.net/reader034/viewer/2022042714/556a7454d8b42a7c758b45ec/html5/thumbnails/24.jpg)
Modeling architecture
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce
Now via NFS