Lesson 8:
MachineLearning
(and the Legionella as a case study)
Lesson 8:
MachineLearning
(and the Legionella as a case study)
Biological Sequences Analysis, MTA
Introduction to Machine Learning
Introduction to Machine Learning
Biological Sequences Analysis, MTA
3 of 39
Some cool examplesSome cool examples
Introduction
Biological Sequences Analysis, MTA
4 of 39
Types of learningsTypes of learnings
Supervised learning - using "labeled" examples of input and desired output.
Unsupervised learning - Models a set of inputs: labeled examples are not available.
Reinforcement learning - Feedback on the actions from observing the environment (maximizing long term reward)
Introduction
ClusteringClustering
Biological Sequences Analysis, MTA
6 of 39
Clustering definitionClustering definition
Input: a set of instances Output: subsets (called clusters) so that
observations in the same cluster are similar.Is it supervised or not?
What does similar mean?
Clustering
Biological Sequences Analysis, MTA
7 of 39
K-means clusteringK-means clustering
0. Choose number of clusters (k)
1. Initiation: randomly generate k centers
2. Assignment of each point to nearest cluster center:
Clustering
Biological Sequences Analysis, MTA
8 of 39
K-means clusteringK-means clustering
0. Choose number of clusters (k)
1. Initiation: randomly generate k centers
2. Assignment of each point to nearest cluster center
3. Update location of centers:
Clustering
Biological Sequences Analysis, MTA
9 of 39
K-means clusteringK-means clustering
0. Choose number of clusters (k)
1. Initiation: randomly generate k centers
2. Assignment of each point to nearest cluster center
3. Update location of centers
4. Repeat 2-3 until no further changeK-means - Interactive demo
Clustering
Biological Sequences Analysis, MTA
10 of 39
Other clustering algorithmsOther clustering algorithms Take into account:
homogeneity: similarity of instances inside a cluster.
separation: dissimilarity of instances of different clusters.
Allow "fuzzy clustering": instances bleongs to more than one cluster.
Hierarchal clustering
Clustering
Biological Sequences Analysis, MTA
11 of 39
Hierarchical clusteringHierarchical clustering
12345
C1 C2 C3 C4 C5 C6 ..
Raw table
Hierarchicalclustering
Cluster criterion
ScoresSimilaritymatrix
Similarity criterion12345
Clustering
Biological Sequences Analysis, MTA
12 of 39
UPGMA (you should already know it…)
Neighbor-joining
Hierarchical clusteringHierarchical clustering
12345
C1 C2 C3 C4 C5 C6 ..
Cluster criterion
Scores
Similarity criterion12345
A
C
B
D
E
D
A
D
(C,B)A
E
((C,B),E)
Clustering
Biological Sequences Analysis, MTA
13 of 39
Wait a minute… A tree is clustering?!
Hierarchical clusteringHierarchical clusteringClustering
ClassifyingClassifying
Biological Sequences Analysis, MTA
15 of 39
What is classificationWhat is classification
Input: labeled training set and unlabeled data set.
Learn classifying (assigning labels), according to the features of the training set
Output: labels on the data set. Example: qualified boy/girlfriend
Classifying
Biological Sequences Analysis, MTA
16 of 39
Where to draw the line?!?!Where to draw the line?!?!
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6X
Y
Unqualified Qualified
Classifying
Biological Sequences Analysis, MTA
17 of 39
Where to draw the line?!?!Where to draw the line?!?!
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6X
Y
Unqualified Qualified
Classifying
Biological Sequences Analysis, MTA
18 of 39
Where to draw the line?!?!Where to draw the line?!?!
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6X
Y
Unqualified Qualified
Classifying
Biological Sequences Analysis, MTA
19 of 39
Where to draw the line?!?!Where to draw the line?!?!
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6X
Y
Unqualified Qualified
Classifying
Biological Sequences Analysis, MTA
20 of 39
Where to draw the line?!?!Where to draw the line?!?!
0 . 9 5
1
1 . 0 5
1 . 1
0 1 2 3 4 5 6E-Value
Effectors NonEffectors
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6X
Y
Unqualified Qualified
Now consider dozens of features…
Classifying
Biological Sequences Analysis, MTA
21 of 39
How to classifyHow to classify
KNN (K Nearest Neighbors) Decision trees SVM (Support Vector Machine) Naïve Bayes Baysian Networks NN (Neural Networks) Many many more…
Classifying
Biological Sequences Analysis, MTA
22 of 39
KNN (K Nearest Neighbors)KNN (K Nearest Neighbors)
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6
X
Y
Lazy (no pre-processing)
Local
Can deal with complex patterns
Classifying
Biological Sequences Analysis, MTA
23 of 39
Decision treesDecision trees
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6
X
Y
X ≥ 1.7
Y ≥ 36
X < 1.7
?
?
Y < 36
Tree actually means something!
Can deal with complex patterns
Classifying
Biological Sequences Analysis, MTA
24 of 39
SVM (Support Vector Machine)SVM (Support Vector Machine)
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6
X
Y
Classifying
Biological Sequences Analysis, MTA
25 of 39
SVM (Support Vector Machine)SVM (Support Vector Machine) Finds optimal linear separation
Maximizes the margin betweenthe two data sets
Can use transformation to higherdimension when not linearseparable.
Classifying
Biological Sequences Analysis, MTA
26 of 39
Naïve BayseNaïve Bayse
X
PP( |
X)P( |
X)and
Can easily compute:
P( |Y)
P( |Y)
andCan do the same for:
Classifying
Score( ) = P( |X,Y)
Score( ) = P( |X) · P( |Y)Score( ) = P( |X,Y)
Score( ) = P( |X) · P( |Y)
Biological Sequences Analysis, MTA
27 of 39
Naïve Bayse –graphical representationNaïve Bayse –graphical representation
P( |X)
P( |Y)
X Y ZP( |
Z)
Score( ) = P( |X,Y,Z) = P( |X)· P( |Y)· P( |Z)
What if there are dependencies??
Classifying
Biological Sequences Analysis, MTA
28 of 39
Baysian NetworkBaysian Network
P( |X,Z)
P( |Y)
X Y
ZP( X|
Z)
Score( ) = P( |X,Y,Z) = P( |X,Z) · P( |Y)
Baysian Network takes dependencies into account
Classifying
Biological Sequences Analysis, MTA
29 of 39
Use a labeled test set (in addition to the training set)
Cross validation: 10-fold
Leave-one-out
How to choose a classifier(estimate performances)?How to choose a classifier(estimate performances)?
Classifying
Legionalla pneumophilacase-study
Legionalla pneumophilacase-study
Biological Sequences Analysis, MTA
31 of 39
How did it all begin? How did it all begin?
Legionella pneumophila
Biological Sequences Analysis, MTA
32 of 39
Legionnaire disease nowadaysLegionnaire disease nowadays
Legionella pneumophila
Biological Sequences Analysis, MTA
33 of 39
Legionella pneumophila Legionella pneumophila
Legionella pneumophila
Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology
Biological Sequences Analysis, MTA
34 of 39
Identifying the effectorsIdentifying the effectors
Legionella pneumophila
Biological Sequences Analysis, MTA
35 of 39
Homology to host proteins
Regulatory
elements
Genome proximity to
other effectors
Secretion signalAbundance in Metazoa / Bacteria
GC contentSequence homology
The featuresThe features
Legionella pneumophila
Biological Sequences Analysis, MTA
36 of 39
The effectors machineThe effectors machine
5
5
Legionella pneumophila
Biological Sequences Analysis, MTA
37 of 39
The big pictureThe big pictureSimilarity to
known effectors
Regulatory elements
Features
Similarity tohost proteins
G-C content
Secretory signals
Feature selection
NN
SVMNaïve Bayes
Bayesian Net
Voting
Classification algorithms
Experimentalvalidation
Predictedeffectors
Prior knowledge
Trainedmodel
Unclassifiedgenes
Predictednon-effectors
Newly validatedeffectors
Non-effectors
Validatedeffectors
Abundance in Metazoa\Bacteria
Genome arrangement
Legionella pneumophila
Biological Sequences Analysis, MTA
38 of 39
Does it really work??Does it really work??
Machine learning
Biological Sequences Analysis, MTA
39 of 39
Top Related