VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

September 8-9, 2016

BigML, Inc 2

Association Discovery

Geoff Webb Professor of Information Technology Research

Monash University, Melbourne, Australia

Finding interesting correlations

BigML, Inc 3Unsupervised Learning

• Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled

data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations

between variables.

Association Discovery


Unsupervised Learning

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51


Clustering

Anomaly Detection

similar

unusual


{class = gas} amount < 100

Association Rules


{customer = Bob, account = 3421} zip = 46140

Rules:

Antecedent Consequent


Use Cases

• Market Basket Analysis

• Web usage patterns

• Intrusion detection

• Fraud detection

• Bioinformatics

• Medical risk factors


Magnum OpusWhat's wrong with frequent pattern mining?


Magnum OpusWhat's wrong with frequent pattern mining?

• Feast or famine• often results in too few or too many patterns

• The vodka and caviar problem• some high value patterns are infrequent

• Cannot handle dense data• Minimum support may not be relevant• cannot be low enough to capture all valid rules• cannot be high enough to exclude all spurious rules


Magnum OpusVery infrequent patterns can be significant

Data file: Brijs retail.itl, 88162 cases / 16470 items237 → 1 [Coverage=3032; Support=28; Lift=3.06; p=1.99E-007]237 & 4685 → 1 [Coverage=19; Support=9; Lift=157.00; p=5.03E-012]1159 → 1 [Coverage=197; Support=9; Lift=15.14; p=1.13E-008]4685 → 1 [Coverage=270; Support=9; Lift=11.05; p=1.68E-007]168 → 1 [Coverage=293; Support=9; Lift=10.18; p=3.33E-007]4382 → 1 [Coverage=72; Support=8; Lift=36.83; p=6.26E-011]168 & 4685 → 1 [Coverage=9; Support=7; Lift=257.78; p=6.66E-011]


Magnum OpusVery high support patterns can be spurious

Data file: covtype.data 581012 cases / 125 valuesST15=0 → ST07=0 [Coverage=581009; Support=580904; Confidence=1.000]ST07=0 → ST15=0 [Coverage=580907; Support=580904; Confidence=1.000]ST15=0 → ST36=0 [Coverage=581009; Support=580890; Confidence=1.000]ST36=0 → ST15=0 [Coverage=580893; Support=580890; Confidence=1.000]ST15=0 → ST08=0 [Coverage=581009; Support=580830; Confidence=1.000]ST08=0 → ST15=0 [Coverage=580833; Support=580830; Confidence=1.000]… 197,183,686 such rules have highest support


Magnum Opus

• User selects measure of interest• System finds the top-k associations on that

measure within constraints • Must be statistically significant interaction between

antecedent and consequent• Every item in the antecedent must increase the

strength of association


Association Metrics

Coverage

Percentage of instances which match antecedent “A”

Instances

AC


Association Metrics

Support

Percentage of instances which match antecedent “A” and Consequent “C”

Instances

AC


Association Metrics

Confidence

Percentage of instances in the antecedent which also contain the consequent.

Coverage

Support

Instances

AC


Association Metrics

CInstances

A C

A

Instances

C

Instances

A

Instances

AC

0% 100%

Instances

AC

Confidence

A never implies C

A sometimes implies C

A always implies C


Association Metrics

Lift

Ratio of observed support to support if A and C were statistically independent.

Support == Confidence p(A) * p(C) p(C)

Independent

AC

C

Observed

A


Association Metrics

C

Observed

A

Observed

AC

< 1 > 1

Independent

A C

Lift = 1

Negative Correlation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C


Association Metrics

Leverage

Difference of observed support and support if A and C were statistically independent.

Support - [ p(A) * p(C) ]

Independent

AC

C

Observed

A


Association Metrics

C

Observed

A

Observed

AC

< 0 > 0

Independent

A C

Leverage = 0

NegativeCorrelation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

-1…


Use Cases

GOAL: Discover “interesting” rules about what store items

are typically purchased together.

• Dataset of 9,834 grocery cart transactions

• Each row is a list of all items in a cart at checkout


Association Discovery Demo #1


Use Cases

GOAL: Find general rules that indicate diabetes.

• Dataset of diagnostic measurements of 768 patients.

• Each patient labelled True/False for diabetes.


Association Discovery Demo #2


Medical RisksDecision Tree

If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44

then diabetes = TRUE

Association Rule

If plasma glucose > 146 then diabetes = TRUE

Latent Dirichlet Allocation

#VSSML16

September 2016

#VSSML16 Latent Dirichlet Allocation September 2016 1 / 24

Outline

1 Understanding the Limits of Simple Text Analysis

2 Aside: Generative Processes

3 Latent Dirichlet Allocation

4 A Couple of Instructive Examples

5 Applications


Outline





5 Applications


Bag of Words Analysis

• Easiest way of analyzing a textfield is just to treat it as a “bagof words”

• Each word is a separatefeature (usually an occurrencecount)

• When modeling, the featuresare treated in isolation fromone another, essentially “oneat a time”


Limitations

• Words are sometimesambiguous

• Both because of multipledefinitions and difference intone

• How do we usuallydisambiguate words? Context


An Instructive Example

• One way of looking at the usefulness of a machine learningfeature is to think about how well it isolates unique and coherentsubsets of the data

• Suppose I have a collection of documents where some of themare about two different topics (via Ted Underwood’s Blog):

I Leadership (CEOs, organization, management)I Chemistry (Elements, compounds, reactions)

• If I do a keyword search for “lead” (or try to classify documentsbased on that word alone), I’ll get documents from either categoryand documents that are a mix of both

• Can we build a feature that better isolates which set of documentswe’re looking for?


Outline





5 Applications


Generative Modeling

• Posit a parameterized structure that is responsible for generatingthe data

• Use the data to fit the parameters

• A notion of causality is important for these models


Example of a Generative model

• Consider a patient with somedisease

• Class: Disease present /absent, Features: Test results

• Arrows indicate cause in thisdiagram; the symptoms(features) are caused by thedisease

• This generative processimplies a structure; in this casethe so-called “Naive Bayes”model


Generative vs. Discriminative

• This is an important distinction in machine learning generally

• Generative models try to model / assume a structure for theprocess generating the data

• More mathematically, generative classifiers explicitly model thejoint distribution p(x, y) of the data

• Discriminate models don’t care; they “solve the prediction problemdirectly”, and model only the conditional p(y|x) (Vapnik)


Which is Better?

• No general answer to this question (not that we haven’t tried):Paper: On Discriminative vs. Generative Classifiers1

• Discriminative models tend to be faster to fit, quicker to predict,and in the case of non-parametrics are often guaranteed toconverge to the correct answer given enough data

• Generative models tend to be more probabilistically sound andable to do more than just classify

1http:

//ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf


http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf



Outline





5 Applications


A New Way of Thinking About Documents

• Three entities: Documents,Terms, and Topics

• A term is a single lexical token(usually one or more words,but can be any arbitrary string)

• A document has many terms

• A topic is a distribution over

terms


A Generative Model for Documents

• A document can be thought of as a distribution over topics, drawnfrom a distribution over possible distributions

• To create a document, repeatedly draw a topic at random from thedistribution, then draw a term from topic (which, remember, is adistribution over terms)

• The main thing we want to infer is the topic distribution


Dirichlet Process Intuition: Rich Get Richer

• We use a Dirichlet process to model the relationship betweendocuments, topics, and terms

• We’re more likely to think a word came from a topic if we’vealready seen a bunch of words from that topic

• We’re more likely to think the topic was responsible for generatingthe document if we’ve already seen a bunch of words in thedocument from that topics.

• Here lies the disambiguation: If a word could have come from twodifferent topics, we use the rest of the words in the document todecide which meaning it has

• Note that there’s a little bit of self-fulfilling prophecy going on here(by design)


Outline





5 Applications


Usenet Movie Reviews

Library of over 26,000 movie reviews

A solid noir melodrama from Vincent Sherman, who takes a standardstory and dresses it up with moving characterizations and beautifullyexpressionistic B&W; photography from cinematographer James Wong Howe.The director took a songwriter Paul Webster's short magazine storycalled "The Man Who Died Twice" and improved the story by rounding outthe characters to give them both strong and weak points, so that theywould not be one-note characters as was the case in the originalstory. The film was made by Warner Brothers, who needed a film fortheir contract star Ann Sheridan and asked Sherman to change the storyaround so that her part as Nora Prentiss, a nightclub singer, isexpanded


Supreme Court Cases

Library of about 7500 Supreme Court Cases

NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F.SUPP. 819, REVERSED.

FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THEAPPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THESUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUNDUNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THEFLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENTPLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA,ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH


Outline





5 Applications


Visualizing Changes in Topic Over Time

• Plot changes in topic distribution over time

• Especially nice for dated historical collections (e.g., novels,newspapers)


Search Without Keywords

• Keyword search is great, if youknow the keywords

• Good for finding search terms

• Great for, e.g., legal discovery

• Nice for finding “outliers”

• Surprise topics (From therecycle bin)


Feature Spaces for Classification

• Just classify the documents in “topic space” rather than “bagspace”

• The topics that come out of LDA have some nice benefits asfeatures

I Can reduce a feature space of thousands to a few dozen (faster tofit)

I Nicely interpretableI Automatically tailored to the documents you’ve provided

• Foreshadowing Alert: When using LDA in this way, we’re doing aform of feature engineering which we’ll hear more about tomorrow.


Some Caveats

• You need to choose the number of topics beforehand

• Takes forever, both to fit and to do inference

• Takes a lot of text to make it meaningful

• Tends to focus on “meaningless minutiae”

• While it sometimes makes a nice classification space, it’s a rarecase that provides dramatic improvement over bag-of-words

• I find it nice just for exploration


Thus Ends The Lesson

Questions?


VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

Data & Analytics

Transcript of VSSML16 L4. Association Discovery and Latent Dirichlet Allocation