.
An introduction to machine learning and probabilistic
graphical models
Kevin Murphy
MIT AI Lab
Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003
2
Overview
Supervised learning Unsupervised learning Graphical models Learning relational models
Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides
3
Supervised learningyes no
Color Shape Size Output
Blue Torus Big Y
Blue Square Small Y
Blue Star Small Y
Red Arrow Small N
F(x1, x2, x3) -> tLearn to approximate function
from a training set of (x,t) pairs
4
Supervised learning
X1 X2 X3 T
B T B Y
B S S Y
B S S Y
R A S N
X1 X2 X3 T
B A S ?
Y C S ?
Learner
Training data
Hypothesis
Testing dataT
Y
N
Prediction
5
Key issue: generalization
yes no
? ?Can’t just memorize the training set (overfitting)
6
Hypothesis spaces
Decision trees Neural networks K-nearest neighbors Naïve Bayes classifier Support vector machines (SVMs) Boosted decision stumps …
7
Perceptron(neural net with no hidden layers)
Linearly separable data
8
Which separating hyperplane?
9
The linear separator with the largest margin is the best one to pick
margin
10
What if the data is not linearly separable?
11
Kernel trick
x1x2
z1
z2
z3
kernel
2
2
2
xx
xyy
y
Kernel implicitly maps from 2D to 3D,making problem linearly separable
12
Support Vector Machines (SVMs)
Two key ideas: Large margins Kernel trick
13
Boosting
Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations
Boosting maximizes the margin
14
Supervised learning success stories
Face detection Steering an autonomous car across the US Detecting credit card fraud Medical diagnosis …
15
Unsupervised learning
What if there are no output labels?
16
K-means clustering1. Guess number of clusters, K
2. Guess initial cluster centers, 1, 2
3. Assign data points xi to nearest cluster center4. Re-compute cluster centers based on assignments
Re
itera
te
17
AutoClass (Cheeseman et al, 1986)
EM algorithm for mixtures of Gaussians “Soft” version of K-means Uses Bayesian criterion to select K Discovered new types of stars from spectral data Discovered new classes of proteins and introns
from DNA/protein sequence databases
18
Hierarchical clustering
.
Principal Component Analysis (PCA)
PCA seeks a projection that best represents the data in a least-squares sense.
PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest.
20
Discovering nonlinear manifolds
21
Combining supervised and unsupervised learning
22
Discovering rules (data mining)Occup. Income Educ. Sex Married Age
Student $10k MA M S 22
Student $20k PhD F S 24
Doctor $80k MD M M 30
Retired $30k HS F M 60
Find the most frequent patterns (association rules)
Num in household = 1 ^ num children = 0 => language = English
Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}
23
Unsupervised learning: summary
Clustering Hierarchical clustering Linear dimensionality reduction (PCA) Non-linear dim. Reduction Learning rules
24
Discovering networks
?
From data visualization to causal discovery
25
Networks in biology
Most processes in the cell are controlled by networks of interacting molecules:
Metabolic Network Signal Transduction Networks Regulatory Networks
Networks can be modeled at multiple levels of detail/ realism
Molecular level Concentration level Qualitative level
Decreasing detail
26
Molecular level: Lysis-Lysogeny circuit in Lambda phage
Arkin et al. (1998), Genetics 149(4):1633-48
5 genes, 67 parameters based on 50 years of researchStochastic simulation required supercomputer
27
Concentration level: metabolic pathways
Usually modeled with differential equations
w23
g1g2
g3g4
g5
w12
w55
28
Qualitative level: Boolean Networks
29
Probabilistic graphical models
Supports graph-based modeling at various levels of detail
Models can be learned from noisy, partial data Can model “inherently” stochastic phenomena, e.g.,
molecular-level fluctuations… But can also model deterministic, causal
processes. "The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities."-- James Clerk Maxwell
"Probability theory is nothing but common sense reduced tocalculation." -- Pierre Simon Laplace
30
Graphical models: outline
What are graphical models? Inference Structure learning
31
Simple probabilistic model:linear regression
Y
Y = + X + noise Deterministic (functional) relationship
X
32
Simple probabilistic model:linear regression
Y
Y = + X + noise Deterministic (functional) relationship
X
“Learning” = estimatingparameters , , from(x,y) pairs.
Can be estimate byleast squares
Is the empirical mean
Is the residual variance
33
Piecewise linear regression
Latent “switch” variable – hidden process at work
34
Probabilistic graphical model for piecewise linear regression
X
Y
Q
•Hidden variable Q chooses which set ofparameters to use for predicting Y.
•Value of Q depends on value of input X.
output
input
•This is an example of “mixtures of experts”
Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)
35
Classes of graphical models
Probabilistic modelsGraphical models
Directed Undirected
Bayes nets MRFs
DBNs
36
Family of Alarm
Bayesian Networks
Qualitative part:
Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence
Quantitative part: Set of conditional probability distributions
0.9 0.1
e
b
e
0.2 0.8
0.01 0.99
0.9 0.1
be
b
b
e
BE P(A | E,B)Earthquake
Radio
Burglary
Alarm
Call
Compact representation of probability distributions via conditional independence
Together:Define a unique distribution in a factored form
)|()|(),|()()(),,,,( ACPERPEBAPEPBPRCAEBP
37
Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients 37 variables 509 parameters
…instead of 254
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
38
Success stories for graphical models
Multiple sequence alignment Forensic analysis Medical and fault diagnosis Speech recognition Visual tracking Channel coding at Shannon limit Genetic pedigree analysis …
39
Graphical models: outline
What are graphical models? p Inference Structure learning
40
Probabilistic Inference Posterior probabilities
Probability of any event given any evidence P(X|E)
Earthquake
Radio
Burglary
Alarm
Call
Radio
Call
41
Viterbi decoding
Y1 Y3
X1 X2 X3
Y2
Compute most probable explanation (MPE) of observed data
Hidden Markov Model (HMM)
“Tomato”
hidden
observed
42
Inference: computational issues
PCWP CO
HRBPHREKGHRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
MINOVL
PVSAT
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
Easy Hard
Chains
TreesGrids
Dense, loopy graphs
43
Inference: computational issues
PCWP CO
HRBPHREKGHRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
MINOVL
PVSAT
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
Easy Hard
Chains
TreesGrids
Dense, loopy graphs
Many difference inference algorithms,both exact and approximate
44
Bayesian inference
Bayesian probability treats parameters as random variables
Learning/ parameter estimation is replaced by probabilistic inference P(|D)
Example: Bayesian linear regression; parameters are = (, , )
X1
Y1
Xn
Yn
Parameters are tied (shared)across repetitions of the data
45
Bayesian inference
+ Elegant – no distinction between parameters and other hidden variables
+ Can use priors to learn from small data sets (c.f., one-shot learning by humans)
- Math can get hairy - Often computationally intractable
46
Graphical models: outline
What are graphical models? Inference Structure learning
p
p
47
Why Struggle for Accurate Structure?
Increases the number of parameters to be estimated
Wrong assumptions about domain structure
Cannot be compensated for by fitting parameters
Wrong assumptions about domain structure
Earthquake Alarm Set
Sound
Burglary Earthquake Alarm Set
Sound
Burglary
Earthquake Alarm Set
Sound
Burglary
Adding an arcMissing an arc
48
Score based Learning
E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
E B
A
E
B
A
E
BA
Search for a structure that maximizes the score
Define scoring function that evaluates how well a structure matches the data
49
Learning Trees
Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree
If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees
50
Heuristic Search
Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search
Define a search space: search states are possible structures operators make small changes to structure
Traverse space looking for high-scoring structures Search techniques:
Greedy hill-climbing Best first search Simulated Annealing ...
51
Local Search Operations
Typical operations:
S C
E
D Reverse C EDelete C
E
Add C
D
S C
E
D
S C
E
D
S C
E
D
score = S({C,E} D) - S({E} D)
52
Problems with local search S
(G|D
)
Easy to get stuck in local optima
“truth”
you
53
Problems with local search II
E
R
B
A
C
P(G|D)Picking a single best model can be misleading
54
Problems with local search II
Small sample size many high scoring models Answer based on one model often useless Want features common to many models
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
P(G|D)Picking a single best model can be misleading
55
Bayesian Approach to Structure Learning
Posterior distribution over structures Estimate probability of features
Edge XY Path X… Y …
G
DGPGfDfP )|()()|(
Feature of G,e.g., XY
Indicator functionfor feature f
Bayesian scorefor G
56
Bayesian approach: computational issues
Posterior distribution over structures
G
DGPGfDfP )|()()|(
How compute sum over super-exponential number of graphs?
•MCMC over networks•MCMC over node-orderings (Rao-Blackwellisation)
57
Structure learning: other issues
Discovering latent variables Learning causal models Learning from interventional data Active learning
58
Discovering latent variables
a) 17 parameters b) 59 parameters
There are some techniques for automatically detecting thepossible presence of latent variables
59
Learning causal models
So far, we have only assumed that X -> Y -> Z means that Z is independent of X given Y.
However, we often want to interpret directed arrows causally.
This is uncontroversial for the arrow of time. But can we infer causality from static observational
data?
60
Learning causal models
We can infer causality from static observational data if we have at least four measured variables and certain “tetrad” conditions hold.
See books by Pearl and Spirtes et al. However, we can only learn up to Markov
equivalence, not matter how much data we have.
X Y Z
X Y Z
X Y Z
X Y Z
61
Learning from interventional data
The only way to distinguish between Markov equivalent networks is to perform interventions, e.g., gene knockouts.
We need to (slightly) modify our learning algorithms.
smoking
Yellowfingers
P(smoker|observe(yellow)) >> prior
smoking
Yellowfingers
P(smoker | do(paint yellow)) = prior
Cut arcs cominginto nodes whichwere set byintervention
62
Active learning
Which experiments (interventions) should we perform to learn structure as efficiently as possible?
This problem can be modeled using decision theory.
Exact solutions are wildly computationally intractable.
Can we come up with good approximate decision making techniques?
Can we implement hardware to automatically perform the experiments?
“AB: Automated Biologist”
63
Learning from relational data
Can we learn concepts from a set of relations between objects,instead of/ in addition to just their attributes?
64
Learning from relational data: approaches
Probabilistic relational models (PRMs) Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph)
Inductive Logic Programming (ILP) Top-down, e.g., FOIL (generalization of C4.5) Bottom up, e.g., PROGOL (inverse deduction)
65
ILP for learning protein folding: input
yes no
TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …
100 conjuncts describing structure of each pos/neg example
66
ILP for learning protein folding: results
PROGOL learned the following rule to predict if a protein will form a “four-helical up-and-down bundle”:
In English: “The protein P folds if it contains a long helix h1 at a secondary structure position between 1 and 3 and h1 is next to a second helix”
67
ILP: Pros and Cons
+ Can discover new predicates (concepts) automatically
+ Can learn relational models from relational (or flat) data
- Computationally intractable - Poor handling of noise
68
The future of machine learning for bioinformatics?
Oracle
69
Learner
Prior knowledge
Replicated experiments
Biological literature
Hypotheses
Expt.design
Real world
The future of machine learning for bioinformatics
•“Computer assisted pathway refinement”
70
The end
71
Decision trees
blue?
big?
oval?
no
no
yes
yes
72
Decision trees
blue?
big?
oval?
no
no
yes
yes
+ Handles mixed variables+ Handles missing data+ Efficient for large data sets+ Handles irrelevant attributes+ Easy to understand- Predictive power
73
Feedforward neural network
( ), ( ) 1/(1 )cxi i
i
f J s f x e
input Hidden layer Output
Weights on each arc Sigmoid function at each node
74
Feedforward neural network
input Hidden layer Output
- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predicts poorly
75
Nearest Neighbor Remember all your data When someone asks a question,
find the nearest old data point return the answer associated with it
76
Nearest Neighbor
?
- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predictive power
77
Support Vector Machines (SVMs)
Two key ideas: Large margins are good Kernel trick
78
Training data : l-dimensional vector with flag of true or false
2 /d w( ) 1 0,i iy b i x w
0b w x Separating hyperplane :
Inequalities :
Margin :
Support vectors :
Support vector expansion:
ii
iw x
Decision:
,{ }, , { 1,1}li iy y i ix x R
SVM: mathematical details
margin
79
Replace all inner products with kernels
Kernel function
80
SVMs: summary
- Handles mixed variables- Handles missing data- Efficient for large data sets- Handles irrelevant attributes- Easy to understand+ Predictive power
•Kernel trick can be used to make many linear methods non-linear e.g., kernel PCA, kernelized mutual information
•Large margin classifiers are good
General lessons from SVM success:
81
Boosting: summary
Can boost any weak learner Most commonly: boosted decision “stumps”
+ Handles mixed variables+ Handles missing data+ Efficient for large data sets+ Handles irrelevant attributes- Easy to understand+ Predictive power
82
Supervised learning: summary
Learn mapping F from inputs to outputs using a training set of (x,t) pairs
F can be drawn from different hypothesis spaces, e.g., decision trees, linear separators, linear in high dimensions, mixtures of linear
Algorithms offer a variety of tradeoffs Many good books, e.g.,
“The elements of statistical learning”,Hastie, Tibshirani, Friedman, 2001
“Pattern classification”, Duda, Hart, Stork, 2001
83
Inference Posterior probabilities
Probability of any event given any evidence Most likely explanation
Scenario that explains evidence Rational decision making
Maximize expected utility Value of Information
Effect of intervention
Earthquake
Radio
Burglary
Alarm
Call
Radio
Call
84
Assumption needed to makelearning work
We need to assume “Future futures will resemble past futures” (B. Russell)
Unlearnable hypothesis: “All emeralds are grue”, where “grue” means:green if observed before time t, blue afterwards.
85
Structure learning success stories: gene regulation network (Friedman et al.)
Yeast data [Hughes et al 2000]
600 genes 300 experiments
86
Structure learning success stories II: Phylogenetic Tree Reconstruction (Friedman et al.)
Input: Biological sequences
Human CGTTGC…
Chimp CCTAGG…
Orang CGAACG…….
Output: a phylogeny
leaf
10 billion years
Uses structural EM,with max-spanning-treein the inner loop
87
Instances of graphical models
Probabilistic modelsGraphical models
Directed Undirected
Bayes nets MRFs
DBNs
Hidden Markov Model (HMM)
Naïve Bayes classifier
Mixturesof experts
Kalman filtermodel Ising model
88
ML enabling technologies
Faster computers More data
The web Parallel corpora (machine translation) Multiple sequenced genomes Gene expression arrays
New ideas Kernel trick Large margins Boosting Graphical models …
Top Related