BigML Spring 2014 Webinar - Clustering!
description
Transcript of BigML Spring 2014 Webinar - Clustering!
BigML Inc
BigML Inc 2
Today’s Webinar
• Speaker:
• Poul Petersen, CIO
• Moderator:
• Andrew Shikiar, VP Business Development
• Enter questions into chat box – we’ll answer some via text; others at the end of the session
• For direct follow-up, email us at [email protected]
BigML Inc 3
Clustering
BigML’s first unsupervised learning offering!
BigML Inc 4
Trees vs Clusters
Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label
Clusters (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity
BigML Inc 5
Trees vs Clusterssepal length
sepal width
petal length
petal width species
5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …
Inputs “X” Label “Y”
Learning Task: Find function “f” such that: f(X)≈Y
Learning Task: Find “k” clusters such that the data in each cluster is self similar
BigML Inc 6
Clustering Basics
K=3centroids
BigML Inc
batchprediction batchcentroid
centroidprediction
model cluster
7
WorkflowSupervised Learning Unsupervised Learning
CENTROIDCLUSTER
CLUSTER DATASET
+
CSV
DATASET MODEL DATASET CLUSTER
INSTANCE
+
PREDICTIONINSTANCE
+
MODEL
DATASET
+
CSVMODEL
BigML Inc 8
Use Cases
• Customer segmentation
• Item discovery
• Data summarization / compression
• Collaborative filtering / recommender
• Active learning
BigML Inc 9
Item Discovery• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.
BigML Inc 10
Customer Segments
GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
BigML Inc 11
Active Learning
GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.
• Dataset of diagnostic measurements of 768 patients.
• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.
BigML Inc 12
Clustering
• High dimensions - 10,000 fields
• Mixed data:
• numerical: 3.4
• categorical: red, green, blue
• date time: 2014-05-14T12:34:56
• unstructured text: “The quick brown fox…”
• Computing cluster membership for new data
• Using clusters programmatically
BigML Inc 13
FEEDBACK
@bigmlcom TWITTER
Get Started Today!
RESOURCESJoin us for future
webinars & hangouts