BigML Spring 2014 Webinar - Clustering!

BigML Inc

BigML Inc 2

Today’s Webinar

• Speaker:

• Poul Petersen, CIO

• Moderator:

• Andrew Shikiar, VP Business Development

• Enter questions into chat box – we’ll answer some via text; others at the end of the session

• For direct follow-up, email us at [email protected]

mailto:[email protected]

BigML Inc 3

Clustering

BigML’s first unsupervised learning offering!

BigML Inc 4

Trees vs Clusters

Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label

Clusters (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity

BigML Inc 5

Trees vs Clusterssepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Inputs “X” Label “Y”

Learning Task: Find function “f” such that: f(X)≈Y

Learning Task: Find “k” clusters such that the data in each cluster is self similar

BigML Inc 6

Clustering Basics

K=3centroids

BigML Inc

batchprediction batchcentroid

centroidprediction

model cluster

7

WorkflowSupervised Learning Unsupervised Learning

CENTROIDCLUSTER

CLUSTER DATASET

+

CSV

DATASET MODEL DATASET CLUSTER

INSTANCE

+

PREDICTIONINSTANCE

+

MODEL

DATASET

+

CSVMODEL

BigML Inc 8

Use Cases

• Customer segmentation

• Item discovery

• Data summarization / compression

• Collaborative filtering / recommender

• Active learning

BigML Inc 9

Item Discovery• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.

BigML Inc 10

Customer Segments

GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.

• Dataset of mobile game users.

• Data for each user consists of usage statistics and a LTV based on in-game purchases

• Assumption: Usage correlates to LTV

BigML Inc 11

Active Learning

GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.

• Dataset of diagnostic measurements of 768 patients.

• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*

*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.

BigML Inc 12

Clustering

• High dimensions - 10,000 fields

• Mixed data:

• numerical: 3.4

• categorical: red, green, blue

• date time: 2014-05-14T12:34:56

• unstructured text: “The quick brown fox…”

• Computing cluster membership for new data

• Using clusters programmatically

BigML Inc 13

FEEDBACK

@bigmlcom TWITTER

[email protected]

Get Started Today!

RESOURCESJoin us for future

webinars & hangouts

mailto:[email protected]

BigML Spring 2014 Webinar - Clustering!

Software

Transcript of BigML Spring 2014 Webinar - Clustering!