Phenomenal Sli

28
PHENOMENAL DATA MINING: FROM OBSERVATIONS TO PHENOMENA www-formal.stanford.edu/jmc/data-mining.html John McCarthy Stanford University  [email protected] Conventional data mining infers relations among d e.g. the fraction of supermarket baskets with diapers also contain beer. Phenomenal data mining concerns relations between data and the phenomena underlying the data, e.g. yo married couples keeping old friends buy diapers and b Example: The sales receipts of a supermarket usuall not identify the customers. Grouping baskets by custo is possible and useful but requires new techniques. 1

Transcript of Phenomenal Sli

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 1/28

PHENOMENAL DATA MINING: FROM

OBSERVATIONS TO PHENOMENA

www-formal.stanford.edu/jmc/data-mining.ht

John McCarthyStanford University

 [email protected]

• Conventional data mining infers relations amon

e.g. the fraction of supermarket baskets with diape

also contain beer.

• Phenomenal  data mining concerns relations betw

data and the phenomena underlying the data, e.g

married couples keeping old friends buy diapers an

• Example: The sales receipts of a supermarket us

not identify the customers. Grouping baskets by cu

is possible and useful but requires new techniques

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 2/28

OBSERVATIONS versus PHENOMENA

Events occur in the world.

The events sometimes cause some observations in

server.

Two cars collide and a blind person hears the noi

A person buys some groceries and a database e

generated.

The observer infers from the sound of collision and

quent shouts that someone was injured. He furthe

that it was someone he knows.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 3/28

OBSERVATIONS and PHENOMENA

• Databases of purchases are observations of cu

behavior.

• Programs going beyond observations need knoof the world.

• A supermarket program needs facts about rconsumption, items that go together, needs of kinds of customers.

• The data-mining program infers that 30 baskepurchased by the same female customer with

three children whose husband goes on long tr

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 4/28

THE MAIN TOOLS

• Extend the relational database to include entitcustomers not present in the original database

• Knowledge base of facts represented as sente

a first order logical language.

• Minimize the total anomaly of the extended da

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 5/28

SUPERMARKET PROBLEM WITH MADE-

NUMBERS

• Chain has 1,000 supermarkets.

• Supermarket stocks 10,000 items.

• Supermarket has 10,000 customers.

• 1,000 purchase “baskets” per day.

• 20 items per “basket”.

Group baskets purchased by the same customer.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 6/28

GROUPING BASKETS BY CUSTOMER

• Data records purchases but not always customecustomer info is useful.

• Can a suitable data miner group baskets by cus

well enough to be useful?

• We call this identifying customers even though

give us the customers’ names.

• Grouping by customer is not a clustering  p

although there are some resemblances. Why?

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 7/28

• Use any available information about people

sumption and buying habits.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 8/28

EXAMPLES OF FACTS

• Rates of consumption vary less than rates of pu

• Children consume milk at steady rates.

•A family that buys diapers will soon buy baband six months later junior food.

• Variety in detergents is not a consumer goal.

• Variety in soft drinks is often wanted.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 9/28

• Italians buy much olive oil.

Which of these facts can a program use—and how

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 10/28

THE SIGNATURE HYPOTHESIS

• Most customers have enough unique purcha

terns among the 10,000 items to constitute a

tifying signature.

• Signature based on items for which variety is

pecially desired by customer, e.g. brand of dish

detergent.

• Problem: Customers don’t buy much of thei

tures each time they go to the store.

Signatures are only one of many tools for identifyitomers.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 11/28

ASSIGNMENTS AND THEIR ANOMALIES

• An assignment  assigns each basket to a putat

tomer.

• A partial assignment  assigns some baskets

tomers.

• If  α is an assignment, anomaly(α) measures h

the assignment is. (Partial assignments too.)

• anomaly(α) is a sum with terms associated w

putative customers and terms associated with

signment as a whole.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 12/28

• The data miner hill climbs in the space of (

assignments minimizing (total) anomaly.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 13/28

PER CUSTOMER ANOMALIES

• Badness of best signature. The signature as

gives probabilities of purchase.

• Badness of consumption continuity. It is u

though not impossible, that a family of three w

ten pounds of sugar on each of two successiv

• Badness of demographic ascription.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 14/28

SIGNATURES CHANGE

• Customers change their buying habits and hen

signatures. If the changes aren’t too great, th

be tracked.

• Some changes don’t count as raising an anoma

change from buying baby food to buying junio

• A fact about the world:

Buys(Babyf ood, customer, s) → (∃s′)(s < s′

Buys(J uniorf ood, customer, s′).

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 15/28

• A corresponding fact about the data mining:

x ∈ P urchases(Basket1) ∧ x ∈ Babyfood

∧Time(Basket1) < Time(Basket2) ∧ y ∈ Bas

∧y ∈ Juniorfood ∧ Ascribed(Basket1, custome

→ Anomaly(y, Basket2, customer) = 0.

What does it take to derive (2) from (1)?

information must be in the knowledge base fo

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 16/28

BADNESS OF ASSIGNMENT AS A WHOL

• Number of distinct customers

• Wrong demographics

• Violates beliefs of marketing experts

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 17/28

WHAT USE IS PHENOMENAL DATA MININ

• Stop buying hula hoops. Although sales havincreasing, they are only among preeteen gir

they buy just one.

• Decide that product A will sell well in stores

customers have been identified by phenomenmining as having a certain distribution of ag

ethnic, social class and taste characteristics.

waste of shelf space and of capital to sell it i

stores.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 18/28

REMARKS

•An experiment to identify customers from supket data is worth making. The experiment w

best if customer identification were available b

used to verify identifications. Enough facts are

obtained.

• How far away does the customer live? Don’t

this can’t be inferred.

• There are other applications and experiments.

wants data mining on data returned from spac

Phenomenal data mining is what they need.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 19/28

• Donal Lyons and Gregory Tseytain did PD

Dublin Transport data.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 20/28

APPLICATION TO CATCHING TERRORIS

•The members of a terrorist group may use fin a common way that yields a signature. Th

component of the Sept 11 terrorist signature

be using Travelocity.

• Groups with signatures can be inferred withoindividual having been previously suspected.

• The FBI does a lot of what is essentially phen

data mining by hand, but some methods of

groups are computationally intensive.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 21/28

FORMULAS

Separate credit cards for terrorist expenses (dubio

Has( person, creditcard1) ∧ Has( person, creditcard2∧Approximately-included(P urchases(creditcard1), T

∧Approximately-disjoint(P urchases(creditcard1), T

→ TwoCards ∈ Suspicions( person)

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 22/28

TERRORIST FORMULAS 2

Signatures:

Terrorists, like other groups of people, undoubte

the facilities of our society in special ways, some o

show up in databases of air travel, car rentals, tel

calls, credit card use, etc. They need to be distin

from other groups, e.g. employees of some com

researchers in AI.

(∃signature)((∀ person ∈ group)(adheres(signature,

∧¬(∃employer)(Members(group) ⊂ Employees(emp

→ suspicious(group)

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 23/28

TERRORIST FORMULAS 3

Identifying a group as common postponers of trip

Occurs(P ostponement(meeting), s) → (∀x)(Attende

→ Holds(Must(x, P ostpone(Trip-Meeting(x))),Nex

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 24/28

MORE REMARKS

• Suppose a customer of type i has a probabilit

including item j in a basket. We can infer an

imate number of types by looking at the appro

rank of the matrix P ij.

• Classifying customers into discrete types may n

as good results as a more complex model thinto account the age of the customer as a con

variable.

• A linear relation between phenomena and obser

is the simplest case, and such relations can pdiscovered by methods akin to factor analysis

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 25/28

• We could infer that there were two subpopula

we didn’t already know about sex.

• We might infer from data from our stores in

that there was a substantial part of the pop

that didn’t purchase meat products. We can t

from a situation in which everyone buys me

less, because certain other purchase patterns

sociated with not buying meat.

• If a customer buys a certain product but doesn

necessary complementary product, we can inf

he buys the complementary product from so

else.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 26/28

• Some brain storming is appropriate in thinking

tomer patterns, because the more we can th

the better the chances of identification.

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 27/28

HARANGUE about BAD PHILOSOPHY an

INADEQUATE COMPUTER SCIENCE

Extreme positivism held that science consisted

tions among sense data.

Much learning research and even logical AI resea

volves making inferences about existing data ex

directly in terms of this data.

Science does better. We and our environment ar

plex structures built up from atoms.

The phenomena are not immediately apparent in

servations and are not just relations among observ

7/28/2019 Phenomenal Sli

http://slidepdf.com/reader/full/phenomenal-sli 28/28

Like science, phenomenal data mining uses whate

main dependent information about the phenome

be available and useful.