Machine Learning
-
Upload
paolo-marcatili -
Category
Documents
-
view
178 -
download
2
Transcript of Machine Learning
![Page 1: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/1.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 1
Machine Learning Methods:an overview
Master in Bioinformatica – April 9th, 2010
Paolo MarcatiliUniversity of Rome “Sapienza”Dept. of Biochemical Sciences “Rossi Fanelli”
Overview
Supervised Learning
Unsupervised Learning
Caveats
![Page 2: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/2.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 2
Agenda
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
Overview
![Page 3: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/3.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 3
WhyOverview
Large amount of data
Large dimensionality
![Page 4: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/4.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 4
Large amount of data
Large dimensionality
Complex dynamics
Data Noisiness
WhyOverview
![Page 5: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/5.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 5
Large amount of data
Large dimensionality
Complex dynamics
Data Noisiness
Computational efficiency
Because we can
WhyOverview
![Page 6: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/6.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 6
How
Numerical analysis
Graphs
Systems theory
Geometry
Statistics
Probability!!
Overview
![Page 7: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/7.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 7
How
Numerical analysis
Graphs
Systems theory
Geometry
Statistics
Probability!!
Probability and statistics are fundamentalThey provide a solid framework forcreating models and acquire knowledge
Overview
![Page 8: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/8.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 8
Datasets
Most common data used with ML:
Genomes (genes, promoters, phylogeny, regulation...)
Proteomes (secondary/tertiary structure, disorder, motifs, epitopes...)
Clinical Data (drug evaluation, medical protocols, tool design...)
Interactomic (PPI prediction and filtering, complexes...)
Metabolomic (metabolic pathways identification, flux analysis, essentiality)
Overview
![Page 9: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/9.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 9
Methods
Machine Learning can
Overview
![Page 10: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/10.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 10
Methods
Machine Learning can
Predict unknownfunction values
Overview
![Page 11: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/11.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 11
Methods
Machine Learning can
Predict unknownfunction values
Infer classes andassign samples
Overview
![Page 12: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/12.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 12
Methods
Machine Learning can
Predict unknownfunction values
Infer classes andassign samples
Overview
![Page 13: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/13.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 13
Methods
Machine Learning can not
Overview
![Page 14: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/14.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 14
Methods
Machine Learning can not
Provide knowledge
Overview
![Page 15: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/15.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 15
Methods
Machine Learning can not
Provide knowledge
Overview
![Page 16: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/16.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 16
Methods
Machine Learning can not
Provide knowledge Learn
Overview
![Page 17: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/17.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 17
Methods
Information is
In the data? In the model?
Overview
![Page 18: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/18.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 18
Methods
Work Schema:
Choose a Learning-Validation Setting
Prepare data (Training, Test, Validation sets)
Train (1 or more times)
Validate
Use
Overview
![Page 19: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/19.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 19
Love all, trust a few, do wrong to none. Overview
4 patients, 4 controls
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Patients
Control
![Page 20: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/20.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 20
Love all, trust a few, do wrong to none. Overview
2 more
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5 6 7
Patients
Control
![Page 21: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/21.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 21
Love all, trust a few, do wrong to none. Overview
10 more
0
0.5
1
1.5
2
2.5
3
0 2 4 6 8 10 12 14 16
Patients
Control
![Page 22: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/22.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 22
Assessment
Prediction of unknown data!
Problems: Few data, robustness.
Overview
![Page 23: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/23.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 23
Assessment
Prediction of unknown data!
Problems: Few data, robustness.
Solutions:
Training, Test and Validation sets
Leave one Out
K-fold Cross Validation
Overview
![Page 24: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/24.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 24
Assessment
50% Training set: used to tune the model parameters
25% Test set: used to verify that the machine has “learnt”
25% Validation set: final assessment of the results
Unfeasible with few data
Overview
![Page 25: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/25.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 25
Assessment
Leave-one-out:
for each sample Ai
Training set: all the samples - {Ai}Test set: {Ai }
Repeat
Computationally intensive, good estimate of the mean errorhigh variance
Overview
![Page 26: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/26.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 26
Assessment
K-fold cross validation:
Divide your data in K subsets S1..k
Training set: all the samples - Si
Test set: Si
Repeat
good compromise
Overview
![Page 27: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/27.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 27
Assessment
Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.
Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.
Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present
Overview
![Page 28: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/28.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 28
Assessment
Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.
Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.
Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present
receiver operating characteristic (ROC)is a graphical plot of the
sensitivity vs. (1 - specificity)
for a binary classifier system as its discrimination threshold is varied.
Overview
![Page 29: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/29.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 29
AssessmentOverview
Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.
Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.
Predictive Value Positive: TP / [ TP + FP ]Given a test is positive, the likelihood disease is present
receiver operating characteristic (ROC)is a graphical plot of the
sensitivity vs. (1 - specificity)
for a binary classifier system as its discrimination threshold is varied.
Area under ROC (AROC) is often used as a parameter to compare different classifiers
![Page 30: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/30.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 30
AgendaSupervised Learning
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
![Page 31: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/31.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 31
Supervised LearningSupervised Learning
Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data
![Page 32: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/32.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 32
Supervised LearningSupervised Learning
Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data
Example:use microarray data,different condition
classes: genes related/unrelated to different cancer types
![Page 33: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/33.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 33
Supervised LearningSupervised Learning
Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data
Example:use microarray data,different condition
classes: genes related/unrelated to different cancer types
![Page 34: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/34.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 34
Support Vector MachinesSupervised Learning
Basic idea:Plot your data in an N-dimensional space
Find the best hyperplane that separates the different classes
Further samples can be classified using the region of the space they belong to
![Page 35: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/35.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 35
Support Vector MachinesSupervised Learning
length
weight
FailPass
![Page 36: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/36.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 36
Support Vector MachinesSupervised Learning
margin
FailPass
length
weight
FailPass
![Page 37: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/37.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 37
Support Vector MachinesSupervised Learning
Optimal Hyperplane (OHP)
simple kind of SVM (called an LSVM)
margin
Support vectorsmaximum
margin
FailPass
length
weight
FailPass
![Page 38: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/38.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 38
Support Vector MachinesSupervised Learning
Original Data
What if data are not linearly separable?
![Page 39: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/39.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 39
Support Vector MachinesSupervised Learning
Original Data
What if data are not linearly separable?
Original Data
Allow mismatchessoft margins
(add a weight matrix)
![Page 40: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/40.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 40
Support Vector MachinesSupervised Learning
weight2
length2
weight * length
Hyperplane
Original Data
What if data are not linearly separable?
![Page 41: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/41.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 41
Support Vector MachinesSupervised Learning
Only Inner product is needed to calculate Dual problem and decision function
weight2
length2
weight * length
Hypersurface
length
sd
Kernelization
Hyperplane
Original Data
What if data are not linearly separable?The Kernel trick!
![Page 42: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/42.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 42
SVM exampleSupervised Learning
Knowledge-based analysis of microarray geneexpression data by using support vector machinesMichael P. S. Brown*, William Noble Grundy†‡, David Lin*, Nello Cristianini§, Charles Walsh Sugnet¶, Terrence S. Furey*,Manuel Ares, Jr.¶, and David Haussler*
We introduce a method of functionally classifying genes by usinggene expression data from DNA microarray hybridization experiments.The method is based on the theory of support vectormachines (SVMs). SVMs are considered a supervised computerlearning method because they exploit prior knowledge of genefunction to identify unknown genes of similar function fromexpression data. SVMs avoid several problems associated withunsupervised clustering methods, such as hierarchical clusteringand self-organizing maps. SVMs have many mathematical featuresthat make them attractive for gene expression analysis, includingtheir flexibility in choosing a similarity function, sparseness ofsolution when dealing with large data sets, the ability to handlelarge feature spaces, and the ability to identify outliers. We testseveral SVMs that use different similarity metrics, as well as someother supervised learning methods, and find that the SVMs bestidentify sets of genes with a common function using expressiondata. Finally, we use SVMs to predict functional roles for uncharacterizedyeast ORFs based on their expression data.
To judge overall performance, we define the cost of using themethod M as C(M) 5 fp(M) 1 2zfn(M), where fp(M) is the numberof false positives for method M, and fn(M) is the number of falsenegatives for method M. The false negatives are weighted moreheavily than the false positives because, for these data, the numberof positive examples is small compared with the number of negatives.
![Page 43: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/43.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 43
Hidden Markov ModelsSupervised Learning
There is a regular and a biased coin.
You don't know which one is being used.
During the game the coins are exchanged with a certain fixed probability
All you know is the output sequence
![Page 44: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/44.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 44
Hidden Markov ModelsSupervised Learning
There is a regular and a biased coin.
You don't know which one is being used.
During the game the coins are exchanged with a certain fixed probability
All you know is the output sequence
HHTHTHTHTHTTTTHTHHTHHHHHHHHHTHTHTHHTHTHHHHTHTH
Given a set the parameters, which is the probability of the output seq.? Which parameters are more likely to have produced the output? Which coin was being used at a certain point of the sequence?
![Page 45: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/45.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 45
Hidden Markov ModelsSupervised Learning
![Page 46: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/46.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 46
Decision treesSupervised Learning
Mimics the behavior of an expert
![Page 47: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/47.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 47
Decision treesSupervised Learning
Mimics the behavior of an expert
![Page 48: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/48.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 48
Mimics the behavior of an expert
Decision treesSupervised Learning
![Page 49: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/49.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 49
Mimics the behavior of an expert
Decision treesSupervised Learning
![Page 50: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/50.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 50
Pros: Easy to interpreter Statistical analysis Informative results
Cons: A single variable Not optimal Not robust
Majority rules!
Decision treesSupervised Learning
![Page 51: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/51.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 51
Random ForestsSupervised Learning
Split the data in several subsets,construct a DT for each set
Each DT expresses a vote, the majority wins
Much more accurate and robust (bootstrap)
![Page 52: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/52.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 52
Random ForestsSupervised Learning
Split the data in several subsets,construct a DT for each set
Each DT expresses a vote, the majority wins
Much more accurate and robust (bootstrap)
Prediction of protein–protein interactions using random decision forest framework
Xue-Wen Chen * and Mei Liu
Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions.
Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.
![Page 53: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/53.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 53
Bayesian NetworksSupervised Learning
The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation
Not all correlations or cause-effect relationships between variables are significative
![Page 54: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/54.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 54
Bayesian NetworksSupervised Learning
The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation
Not all correlations or cause-effect relationships between variables are significative
Consider only meaningful links!
![Page 55: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/55.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 55
Bayesian NetworksSupervised Learning
I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:
A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call
![Page 56: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/56.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 56
Bayesian NetworksSupervised Learning
I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:
A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call
Bayes Theorem again!
![Page 57: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/57.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 57
Bayesian NetworksSupervised Learning
We don't know the joint probability distribution,how can we learn it from the data?
Optimize the likelyhood, i.e. the probability that the model generated the data
Maximum likelyhood (simplest) Maximum posterior Marginal likelyhood (hardest)
We don't know which relationship is present between variables,how can we learn it from the data?
Connections in a graph are over-exponential, enumeration is impossibleEuristics, random sampling, monte carlo
Does independence assumption hold? Is the correlation informative? (BIC, Occam's razor, AIC)
![Page 58: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/58.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 58
Neural NetworksSupervised Learning
Neural Networks interpolate functionsThey have nothing to do with brains
![Page 59: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/59.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 59
Neural NetworksSupervised Learning
Neural Networks interpolate functionsThey have nothing to do with brains
![Page 60: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/60.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 60
Neural NetworksSupervised Learning
Neural Networks interpolate functionsThey have nothing to do with brains
![Page 61: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/61.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 61
Neural NetworksSupervised Learning
Parameter settings:
![Page 62: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/62.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 62
Neural NetworksSupervised Learning
Parameter settings: avoid overfitting
![Page 63: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/63.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 63
Neural NetworksSupervised Learning
Parameter settings: avoid overfitting
Learning --> validation --> usageNo underlying model, but it often works
![Page 64: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/64.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 64
Neural NetworksSupervised Learning
Protein Disorder Prediction:Implications for Structural ProteomicsRune Linding,1,4,* Lars Juhl Jensen,1,2,4 Francesca Diella,3 Peer Bork,1,2 Toby J. Gibson,1 and Robert B. Russell1
Abstract
A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.
![Page 65: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/65.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 65
AgendaUnsupervised Learning
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
![Page 66: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/66.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 66
Unsupervised LearningUnsupervised Learning
If we have no idea of actual data classification, we can try to guess
![Page 67: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/67.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 67
ClusteringUnsupervised Learning
Put together similar objects to define classes
![Page 68: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/68.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 68
ClusteringUnsupervised Learning
Put together similar objects to define classes
![Page 69: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/69.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 69
ClusteringUnsupervised Learning
K-meansHierarchical top-downHierarchical down-upFuzzy
Put together similar objects to define classes
How?
![Page 70: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/70.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 70
ClusteringUnsupervised Learning
EuclideanCorrelationSpearman RankManhattan
Put together similar objects to define classes
Which metric?How?
![Page 71: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/71.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 71
ClusteringUnsupervised Learning
Put together similar objects to define classes
Which metric? Which “shape”?
CompactConcaveOutliersInner radiuscluster separation
How?
![Page 72: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/72.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 72
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 73: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/73.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 73
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 74: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/74.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 74
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 75: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/75.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 75
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 76: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/76.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 76
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 77: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/77.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 77
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 78: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/78.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 78
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 79: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/79.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 79
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
![Page 80: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/80.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 80
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 81: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/81.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 81
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 82: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/82.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 82
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 83: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/83.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 83
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 84: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/84.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 84
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 85: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/85.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 85
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 86: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/86.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 86
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 87: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/87.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 87
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 88: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/88.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 88
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 89: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/89.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 89
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 90: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/90.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 90
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 91: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/91.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 91
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 92: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/92.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 92
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 93: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/93.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 93
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 94: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/94.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 94
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 95: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/95.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 95
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 96: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/96.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 96
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 97: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/97.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 97
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 98: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/98.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 98
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 99: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/99.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 99
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 100: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/100.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 100
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 101: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/101.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 101
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 102: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/102.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 102
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 103: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/103.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 103
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 104: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/104.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 104
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 105: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/105.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 105
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 106: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/106.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 106
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 107: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/107.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 107
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
![Page 108: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/108.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 108
PCAUnsupervised Learning
Multidimensional data (hard to visualize)
Data variability is not equally distributed
![Page 109: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/109.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 109
PCAUnsupervised Learning
Multidimensional data (hard to visualize)
Data variability is not equally distributed
Correlation between variables
Change coordinate system, remove correlation
retain only most variable coordinates
How: (generalized eigenvectors, SVD)
Pro: noise (and information) reduction
![Page 110: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/110.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 110
AgendaCaveats
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
![Page 111: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/111.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 111
Data independence
Training set, Test set and Validation set must be clearly separated
E.g. neural network to infer gene function from sequence
training set: annotated gene sequences, deposit date before Jan 2007test set: annotated gene sequences, deposit date after Jan 2007
But annotation of new sequences is often inferred from old sequences!
Caveats
![Page 112: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/112.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 112
Biases
Data should be unbiased, i.e. it should be a good sample of our “space”
E.g. neural network to find disordered regionstraining set: solved structures, residues in SEQRES but not in ATOM
But solved structures are typically small, globular, cytoplasmatic proteins
Caveats
![Page 113: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/113.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 113
Take-home message
Always look at data. ML methods are extremely error-prone
Use probability and statistics were possible
In this order: Model, Data, Validation, Algorithm
Be careful with biases, redundancy, hidden variables
Occam's Razor: simpler is better
Be careful with overfitting and overparametrizing
Common sense is a powerful tool (but don't abuse it)
Caveats
![Page 114: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/114.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 114
References
• Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR (2007) A Primer on Learning in Bayesian Networks for Computational Biology. PLoS Comput Biol 3(8): e129. doi:10.1371/journal.pcbi.0030129
• Tarca AL, Carey VJ, Chen X-w, Romero R, Drăghici S (2007) Machine Learning and Its Applications to Biology. PLoS Comput Biol 3(6): e116. doi:10.1371/journal.pcbi.0030116
• Sean R Eddy (2004) What is a hidden Markov model? Nature Biotechnology 22, 1315 - 1316 (2004) doi:10.1038/nbt1004-1315
• http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
![Page 115: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/115.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 115
Bayes TheoremSupplementary
a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.
If a person has a positive test, how likely is it for him to be infected?
A BAB
E
![Page 116: Machine Learning](https://reader035.fdocuments.net/reader035/viewer/2022070319/557f2b2ad8b42a46658b4988/html5/thumbnails/116.jpg)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 116
Bayes TheoremSupplementary
a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.
If a person has a positive test, how likely is it for him to be infected?
P(A|T) =P(T|A)*P(A) / (P(T|A)*P(A) + P(T|¬A)*P(¬A))
P(A|T) = 49.97%
A BAB
E