. An introduction to machine learning and probabilistic ...

88
. An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel’s workshop on “Machine learning for the life sciences”, Berkeley, CA, 3 November 2003

Transcript of . An introduction to machine learning and probabilistic ...

Page 1: . An introduction to machine learning and probabilistic ...

.

An introduction to machine learning and probabilistic

graphical models

Kevin Murphy

MIT AI Lab

Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003

Page 2: . An introduction to machine learning and probabilistic ...

2

Overview

Supervised learning Unsupervised learning Graphical models Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides

Page 3: . An introduction to machine learning and probabilistic ...

3

Supervised learningyes no

Color Shape Size Output

Blue Torus Big Y

Blue Square Small Y

Blue Star Small Y

Red Arrow Small N

F(x1, x2, x3) -> tLearn to approximate function

from a training set of (x,t) pairs

Page 4: . An introduction to machine learning and probabilistic ...

4

Supervised learning

X1 X2 X3 T

B T B Y

B S S Y

B S S Y

R A S N

X1 X2 X3 T

B A S ?

Y C S ?

Learner

Training data

Hypothesis

Testing dataT

Y

N

Prediction

Page 5: . An introduction to machine learning and probabilistic ...

5

Key issue: generalization

yes no

? ?Can’t just memorize the training set (overfitting)

Page 6: . An introduction to machine learning and probabilistic ...

6

Hypothesis spaces

Decision trees Neural networks K-nearest neighbors Naïve Bayes classifier Support vector machines (SVMs) Boosted decision stumps …

Page 7: . An introduction to machine learning and probabilistic ...

7

Perceptron(neural net with no hidden layers)

Linearly separable data

Page 8: . An introduction to machine learning and probabilistic ...

8

Which separating hyperplane?

Page 9: . An introduction to machine learning and probabilistic ...

9

The linear separator with the largest margin is the best one to pick

margin

Page 10: . An introduction to machine learning and probabilistic ...

10

What if the data is not linearly separable?

Page 11: . An introduction to machine learning and probabilistic ...

11

Kernel trick

x1x2

z1

z2

z3

kernel

2

2

2

xx

xyy

y

Kernel implicitly maps from 2D to 3D,making problem linearly separable

Page 12: . An introduction to machine learning and probabilistic ...

12

Support Vector Machines (SVMs)

Two key ideas: Large margins Kernel trick

Page 13: . An introduction to machine learning and probabilistic ...

13

Boosting

Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations

Boosting maximizes the margin

Page 14: . An introduction to machine learning and probabilistic ...

14

Supervised learning success stories

Face detection Steering an autonomous car across the US Detecting credit card fraud Medical diagnosis …

Page 15: . An introduction to machine learning and probabilistic ...

15

Unsupervised learning

What if there are no output labels?

Page 16: . An introduction to machine learning and probabilistic ...

16

K-means clustering1. Guess number of clusters, K

2. Guess initial cluster centers, 1, 2

3. Assign data points xi to nearest cluster center4. Re-compute cluster centers based on assignments

Re

itera

te

Page 17: . An introduction to machine learning and probabilistic ...

17

AutoClass (Cheeseman et al, 1986)

EM algorithm for mixtures of Gaussians “Soft” version of K-means Uses Bayesian criterion to select K Discovered new types of stars from spectral data Discovered new classes of proteins and introns

from DNA/protein sequence databases

Page 18: . An introduction to machine learning and probabilistic ...

18

Hierarchical clustering

Page 19: . An introduction to machine learning and probabilistic ...

.

Principal Component Analysis (PCA)

PCA seeks a projection that best represents the data in a least-squares sense.

PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest.

Page 20: . An introduction to machine learning and probabilistic ...

20

Discovering nonlinear manifolds

Page 21: . An introduction to machine learning and probabilistic ...

21

Combining supervised and unsupervised learning

Page 22: . An introduction to machine learning and probabilistic ...

22

Discovering rules (data mining)Occup. Income Educ. Sex Married Age

Student $10k MA M S 22

Student $20k PhD F S 24

Doctor $80k MD M M 30

Retired $30k HS F M 60

Find the most frequent patterns (association rules)

Num in household = 1 ^ num children = 0 => language = English

Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}

Page 23: . An introduction to machine learning and probabilistic ...

23

Unsupervised learning: summary

Clustering Hierarchical clustering Linear dimensionality reduction (PCA) Non-linear dim. Reduction Learning rules

Page 24: . An introduction to machine learning and probabilistic ...

24

Discovering networks

?

From data visualization to causal discovery

Page 25: . An introduction to machine learning and probabilistic ...

25

Networks in biology

Most processes in the cell are controlled by networks of interacting molecules:

Metabolic Network Signal Transduction Networks Regulatory Networks

Networks can be modeled at multiple levels of detail/ realism

Molecular level Concentration level Qualitative level

Decreasing detail

Page 26: . An introduction to machine learning and probabilistic ...

26

Molecular level: Lysis-Lysogeny circuit in Lambda phage

Arkin et al. (1998), Genetics 149(4):1633-48

5 genes, 67 parameters based on 50 years of researchStochastic simulation required supercomputer

Page 27: . An introduction to machine learning and probabilistic ...

27

Concentration level: metabolic pathways

Usually modeled with differential equations

w23

g1g2

g3g4

g5

w12

w55

Page 28: . An introduction to machine learning and probabilistic ...

28

Qualitative level: Boolean Networks

Page 29: . An introduction to machine learning and probabilistic ...

29

Probabilistic graphical models

Supports graph-based modeling at various levels of detail

Models can be learned from noisy, partial data Can model “inherently” stochastic phenomena, e.g.,

molecular-level fluctuations… But can also model deterministic, causal

processes. "The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities."-- James Clerk Maxwell

"Probability theory is nothing but common sense reduced tocalculation." -- Pierre Simon Laplace

Page 30: . An introduction to machine learning and probabilistic ...

30

Graphical models: outline

What are graphical models? Inference Structure learning

Page 31: . An introduction to machine learning and probabilistic ...

31

Simple probabilistic model:linear regression

Y

Y = + X + noise Deterministic (functional) relationship

X

Page 32: . An introduction to machine learning and probabilistic ...

32

Simple probabilistic model:linear regression

Y

Y = + X + noise Deterministic (functional) relationship

X

“Learning” = estimatingparameters , , from(x,y) pairs.

Can be estimate byleast squares

Is the empirical mean

Is the residual variance

Page 33: . An introduction to machine learning and probabilistic ...

33

Piecewise linear regression

Latent “switch” variable – hidden process at work

Page 34: . An introduction to machine learning and probabilistic ...

34

Probabilistic graphical model for piecewise linear regression

X

Y

Q

•Hidden variable Q chooses which set ofparameters to use for predicting Y.

•Value of Q depends on value of input X.

output

input

•This is an example of “mixtures of experts”

Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)

Page 35: . An introduction to machine learning and probabilistic ...

35

Classes of graphical models

Probabilistic modelsGraphical models

Directed Undirected

Bayes nets MRFs

DBNs

Page 36: . An introduction to machine learning and probabilistic ...

36

Family of Alarm

Bayesian Networks

Qualitative part:

Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence

Quantitative part: Set of conditional probability distributions

0.9 0.1

e

b

e

0.2 0.8

0.01 0.99

0.9 0.1

be

b

b

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Compact representation of probability distributions via conditional independence

Together:Define a unique distribution in a factored form

)|()|(),|()()(),,,,( ACPERPEBAPEPBPRCAEBP

Page 37: . An introduction to machine learning and probabilistic ...

37

Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients 37 variables 509 parameters

…instead of 254

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Page 38: . An introduction to machine learning and probabilistic ...

38

Success stories for graphical models

Multiple sequence alignment Forensic analysis Medical and fault diagnosis Speech recognition Visual tracking Channel coding at Shannon limit Genetic pedigree analysis …

Page 39: . An introduction to machine learning and probabilistic ...

39

Graphical models: outline

What are graphical models? p Inference Structure learning

Page 40: . An introduction to machine learning and probabilistic ...

40

Probabilistic Inference Posterior probabilities

Probability of any event given any evidence P(X|E)

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

Page 41: . An introduction to machine learning and probabilistic ...

41

Viterbi decoding

Y1 Y3

X1 X2 X3

Y2

Compute most probable explanation (MPE) of observed data

Hidden Markov Model (HMM)

“Tomato”

hidden

observed

Page 42: . An introduction to machine learning and probabilistic ...

42

Inference: computational issues

PCWP CO

HRBPHREKGHRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

MINOVL

PVSAT

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Easy Hard

Chains

TreesGrids

Dense, loopy graphs

Page 43: . An introduction to machine learning and probabilistic ...

43

Inference: computational issues

PCWP CO

HRBPHREKGHRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

MINOVL

PVSAT

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Easy Hard

Chains

TreesGrids

Dense, loopy graphs

Many difference inference algorithms,both exact and approximate

Page 44: . An introduction to machine learning and probabilistic ...

44

Bayesian inference

Bayesian probability treats parameters as random variables

Learning/ parameter estimation is replaced by probabilistic inference P(|D)

Example: Bayesian linear regression; parameters are = (, , )

X1

Y1

Xn

Yn

Parameters are tied (shared)across repetitions of the data

Page 45: . An introduction to machine learning and probabilistic ...

45

Bayesian inference

+ Elegant – no distinction between parameters and other hidden variables

+ Can use priors to learn from small data sets (c.f., one-shot learning by humans)

- Math can get hairy - Often computationally intractable

Page 46: . An introduction to machine learning and probabilistic ...

46

Graphical models: outline

What are graphical models? Inference Structure learning

p

p

Page 47: . An introduction to machine learning and probabilistic ...

47

Why Struggle for Accurate Structure?

Increases the number of parameters to be estimated

Wrong assumptions about domain structure

Cannot be compensated for by fitting parameters

Wrong assumptions about domain structure

Earthquake Alarm Set

Sound

Burglary Earthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arcMissing an arc

Page 48: . An introduction to machine learning and probabilistic ...

48

Score based Learning

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

E B

A

E

B

A

E

BA

Search for a structure that maximizes the score

Define scoring function that evaluates how well a structure matches the data

Page 49: . An introduction to machine learning and probabilistic ...

49

Learning Trees

Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree

If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees

Page 50: . An introduction to machine learning and probabilistic ...

50

Heuristic Search

Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search

Define a search space: search states are possible structures operators make small changes to structure

Traverse space looking for high-scoring structures Search techniques:

Greedy hill-climbing Best first search Simulated Annealing ...

Page 51: . An introduction to machine learning and probabilistic ...

51

Local Search Operations

Typical operations:

S C

E

D Reverse C EDelete C

E

Add C

D

S C

E

D

S C

E

D

S C

E

D

score = S({C,E} D) - S({E} D)

Page 52: . An introduction to machine learning and probabilistic ...

52

Problems with local search S

(G|D

)

Easy to get stuck in local optima

“truth”

you

Page 53: . An introduction to machine learning and probabilistic ...

53

Problems with local search II

E

R

B

A

C

P(G|D)Picking a single best model can be misleading

Page 54: . An introduction to machine learning and probabilistic ...

54

Problems with local search II

Small sample size many high scoring models Answer based on one model often useless Want features common to many models

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

P(G|D)Picking a single best model can be misleading

Page 55: . An introduction to machine learning and probabilistic ...

55

Bayesian Approach to Structure Learning

Posterior distribution over structures Estimate probability of features

Edge XY Path X… Y …

G

DGPGfDfP )|()()|(

Feature of G,e.g., XY

Indicator functionfor feature f

Bayesian scorefor G

Page 56: . An introduction to machine learning and probabilistic ...

56

Bayesian approach: computational issues

Posterior distribution over structures

G

DGPGfDfP )|()()|(

How compute sum over super-exponential number of graphs?

•MCMC over networks•MCMC over node-orderings (Rao-Blackwellisation)

Page 57: . An introduction to machine learning and probabilistic ...

57

Structure learning: other issues

Discovering latent variables Learning causal models Learning from interventional data Active learning

Page 58: . An introduction to machine learning and probabilistic ...

58

Discovering latent variables

a) 17 parameters b) 59 parameters

There are some techniques for automatically detecting thepossible presence of latent variables

Page 59: . An introduction to machine learning and probabilistic ...

59

Learning causal models

So far, we have only assumed that X -> Y -> Z means that Z is independent of X given Y.

However, we often want to interpret directed arrows causally.

This is uncontroversial for the arrow of time. But can we infer causality from static observational

data?

Page 60: . An introduction to machine learning and probabilistic ...

60

Learning causal models

We can infer causality from static observational data if we have at least four measured variables and certain “tetrad” conditions hold.

See books by Pearl and Spirtes et al. However, we can only learn up to Markov

equivalence, not matter how much data we have.

X Y Z

X Y Z

X Y Z

X Y Z

Page 61: . An introduction to machine learning and probabilistic ...

61

Learning from interventional data

The only way to distinguish between Markov equivalent networks is to perform interventions, e.g., gene knockouts.

We need to (slightly) modify our learning algorithms.

smoking

Yellowfingers

P(smoker|observe(yellow)) >> prior

smoking

Yellowfingers

P(smoker | do(paint yellow)) = prior

Cut arcs cominginto nodes whichwere set byintervention

Page 62: . An introduction to machine learning and probabilistic ...

62

Active learning

Which experiments (interventions) should we perform to learn structure as efficiently as possible?

This problem can be modeled using decision theory.

Exact solutions are wildly computationally intractable.

Can we come up with good approximate decision making techniques?

Can we implement hardware to automatically perform the experiments?

“AB: Automated Biologist”

Page 63: . An introduction to machine learning and probabilistic ...

63

Learning from relational data

Can we learn concepts from a set of relations between objects,instead of/ in addition to just their attributes?

Page 64: . An introduction to machine learning and probabilistic ...

64

Learning from relational data: approaches

Probabilistic relational models (PRMs) Reify a relationship (arcs) between nodes

(objects) by making into a node (hypergraph)

Inductive Logic Programming (ILP) Top-down, e.g., FOIL (generalization of C4.5) Bottom up, e.g., PROGOL (inverse deduction)

Page 65: . An introduction to machine learning and probabilistic ...

65

ILP for learning protein folding: input

yes no

TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

100 conjuncts describing structure of each pos/neg example

Page 66: . An introduction to machine learning and probabilistic ...

66

ILP for learning protein folding: results

PROGOL learned the following rule to predict if a protein will form a “four-helical up-and-down bundle”:

In English: “The protein P folds if it contains a long helix h1 at a secondary structure position between 1 and 3 and h1 is next to a second helix”

Page 67: . An introduction to machine learning and probabilistic ...

67

ILP: Pros and Cons

+ Can discover new predicates (concepts) automatically

+ Can learn relational models from relational (or flat) data

- Computationally intractable - Poor handling of noise

Page 68: . An introduction to machine learning and probabilistic ...

68

The future of machine learning for bioinformatics?

Oracle

Page 69: . An introduction to machine learning and probabilistic ...

69

Learner

Prior knowledge

Replicated experiments

Biological literature

Hypotheses

Expt.design

Real world

The future of machine learning for bioinformatics

•“Computer assisted pathway refinement”

Page 70: . An introduction to machine learning and probabilistic ...

70

The end

Page 71: . An introduction to machine learning and probabilistic ...

71

Decision trees

blue?

big?

oval?

no

no

yes

yes

Page 72: . An introduction to machine learning and probabilistic ...

72

Decision trees

blue?

big?

oval?

no

no

yes

yes

+ Handles mixed variables+ Handles missing data+ Efficient for large data sets+ Handles irrelevant attributes+ Easy to understand- Predictive power

Page 73: . An introduction to machine learning and probabilistic ...

73

Feedforward neural network

( ), ( ) 1/(1 )cxi i

i

f J s f x e

input Hidden layer Output

Weights on each arc Sigmoid function at each node

Page 74: . An introduction to machine learning and probabilistic ...

74

Feedforward neural network

input Hidden layer Output

- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predicts poorly

Page 75: . An introduction to machine learning and probabilistic ...

75

Nearest Neighbor Remember all your data When someone asks a question,

find the nearest old data point return the answer associated with it

Page 76: . An introduction to machine learning and probabilistic ...

76

Nearest Neighbor

?

- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predictive power

Page 77: . An introduction to machine learning and probabilistic ...

77

Support Vector Machines (SVMs)

Two key ideas: Large margins are good Kernel trick

Page 78: . An introduction to machine learning and probabilistic ...

78

Training data : l-dimensional vector with flag of true or false

2 /d w( ) 1 0,i iy b i x w

0b w x Separating hyperplane :

Inequalities :

Margin :

Support vectors :

Support vector expansion:

ii

iw x

Decision:

,{ }, , { 1,1}li iy y i ix x R

SVM: mathematical details

margin

Page 79: . An introduction to machine learning and probabilistic ...

79

Replace all inner products with kernels

Kernel function

Page 80: . An introduction to machine learning and probabilistic ...

80

SVMs: summary

- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predictive power

•Kernel trick can be used to make many linear methods non-linear e.g., kernel PCA, kernelized mutual information

•Large margin classifiers are good

General lessons from SVM success:

Page 81: . An introduction to machine learning and probabilistic ...

81

Boosting: summary

Can boost any weak learner Most commonly: boosted decision “stumps”

+ Handles mixed variables+ Handles missing data+ Efficient for large data sets+ Handles irrelevant attributes- Easy to understand+ Predictive power

Page 82: . An introduction to machine learning and probabilistic ...

82

Supervised learning: summary

Learn mapping F from inputs to outputs using a training set of (x,t) pairs

F can be drawn from different hypothesis spaces, e.g., decision trees, linear separators, linear in high dimensions, mixtures of linear

Algorithms offer a variety of tradeoffs Many good books, e.g.,

“The elements of statistical learning”,Hastie, Tibshirani, Friedman, 2001

“Pattern classification”, Duda, Hart, Stork, 2001

Page 83: . An introduction to machine learning and probabilistic ...

83

Inference Posterior probabilities

Probability of any event given any evidence Most likely explanation

Scenario that explains evidence Rational decision making

Maximize expected utility Value of Information

Effect of intervention

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

Page 84: . An introduction to machine learning and probabilistic ...

84

Assumption needed to makelearning work

We need to assume “Future futures will resemble past futures” (B. Russell)

Unlearnable hypothesis: “All emeralds are grue”, where “grue” means:green if observed before time t, blue afterwards.

Page 85: . An introduction to machine learning and probabilistic ...

85

Structure learning success stories: gene regulation network (Friedman et al.)

Yeast data [Hughes et al 2000]

600 genes 300 experiments

Page 86: . An introduction to machine learning and probabilistic ...

86

Structure learning success stories II: Phylogenetic Tree Reconstruction (Friedman et al.)

Input: Biological sequences

Human CGTTGC…

Chimp CCTAGG…

Orang CGAACG…….

Output: a phylogeny

leaf

10 billion years

Uses structural EM,with max-spanning-treein the inner loop

Page 87: . An introduction to machine learning and probabilistic ...

87

Instances of graphical models

Probabilistic modelsGraphical models

Directed Undirected

Bayes nets MRFs

DBNs

Hidden Markov Model (HMM)

Naïve Bayes classifier

Mixturesof experts

Kalman filtermodel Ising model

Page 88: . An introduction to machine learning and probabilistic ...

88

ML enabling technologies

Faster computers More data

The web Parallel corpora (machine translation) Multiple sequenced genomes Gene expression arrays

New ideas Kernel trick Large margins Boosting Graphical models …