Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models....

32
Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Transcript of Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models....

Page 1: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Bioinformatics - Lecture 07

BioinformaticsClusters and networks

Martin Saturka

http://www.bioplexity.org/lectures/

EBI version 0.4

Creative Commons Attribution-Share Alike 2.5 License

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 2: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Learning on profiles

Supervised and unsupervised methods on expression data.approximation, clustering, classification, inference.crisp and fuzzy relation models. graphical models.

Main topicsvector approaches- SVM, ANN, kernels- classificationdata organization- clustering technics- map generationregulatory systems- Bayesian networks- algebraic description

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 3: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Inference types

reasoninglogical

standard logical rules based reasoningstatistical

frequent co-occurence based reasoning

deduction (logic, recursion)A, A → B ` B

induction (frequentist statistics)many A, B ` A ∼ Bfew A, ¬B ` A → B

abduction (Bayesian statistics)A1 → B, ..., An → B, B ` Ai

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 4: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Machine learning methods

supervised learning methodswith known correct outputs on training data

approximationmeasured data to (continuous) output functiongrowth rate → nutrition supply regulation

classificationmeasured data to discrete output functionexpression profiles → illness diagnosis

regressioncontinuous measured data (cor)relationsa gene expression magnitude → growth rate

unsupervised learning methodswithout known desired outputs on used data

data granulationinternal data organization and distribution

data visualizationoverall outer view onto the internal data

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 5: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

MLE

maximal likelihood estimationL(y |X ) = Pr(X |Y = y)conditional probability as a function of the unkown conditionwith known outcome, a reverse view on probability

Bernoulli trials example

L(θ |H = 11, T = 10) ≡ Pr(H = 11, T = 10 | p = θ) =`21

11

´θ11(1− θ)10

0 = ∂/∂θ L(θ |H, T ) =`21

11

´θ10(1− θ)9(11− 21θ) → θ = 11/21

used when without a better modelmaximizations inside dynamic programming technics

variance estimation leads to biased sample variance ×

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 6: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Regression

linear regressionleast squares

for homoskedastic distributionssample mean the best estimation

least absolute deviationsrobust versionsample median a safe estimation

minx̄∈R(∑

i |xi − x̄ |2)

minx̄∈R (∑

i |xi − x̄ |)⇓ ⇓(∑

i |xi − x̄ |2)′

= 0 (∑

i |xi − x̄ |)′ = 0

→ arithmetic mean → medianx̄ =

∑i xi/n #xi : xi < x̄ = #xi : xi > x̄

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 7: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Parametrization

parametrized curve crossingassumption: equal amounts of above / below pointsthe same probabilities to cross / not to cross the curvecross count distribution approaches normal distributionover/under-fitting if not in (N − 1)/2±

√(N − 1)/2

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 8: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Distinctions

empirical risk minimization for discrimination

metamethodologyboosting

increasing weights of wrong result training casesprobably approximately correct learning

to achieve high probabilities to make convenient predictions

particular methodssupport vector machinesartificial neural networkscase-based reasoningnearest neighbour algorithm(naive) bayes classifierdecision treesrandom forests

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 9: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

SVM

Support vector machines

linear classifiermaximal margin linear separationminimal distances for misclassified

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 10: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Kernel methods

non-linear into linear separations in higher dimensional space

F(.)

linear discriminant given by dot product < F (xi), F (xj) >

back into low-dimensional space by a kernel K (xi , xj)

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 11: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

VC dimension

Vapnik-Chervonenkis dimension

classification error estimationmore power a method has more prone it is to overfittingmisclassifications for a binary f (α) classificatoriid samples drawn from an unknown distribution

R(α) probability of misclassification - in real usageRemp(α) fraction of misclasified cases of a training set

then with probability 1− η, training set of size N

R(α) < Remp(α) +√

h(1+ln(2N/h))−ln(η/4)N

h the VC dimensionsize of maximal sets that f (α) can shatter3 for a line classifier in a 2D space

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 12: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

ANN

artificial neural networks

wA1

Xi inputs

i

iA

iB

C

w

w

w

w

w

A2

B2

B3

B4

C4

2 6

7

8

1

3

4

5

w48

w8o

w

w

w

w

w

w15

16

25

26

27

38

w

w

w

5o

6o

7o

output

n hidden neurons

w weights

neuron activation functionf2(e) = f2(iAwA2 + iBwB2 − c2)fi is non-linear, usually sigmoid, with ci given constants

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 13: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

ANN learning

error backpropagation

iterative weight adjustingcompute errors for each training caseδ = desired − computedpropagate the δ backward: δ5 = δ · w5oδ1 = δ5 · w15 + δ6 · w16, ...

adjust weights to new valueswnew

A1 = wA1 + η · δ1 · df1(e)/de · iAwnew

15 = w15 + η · δ5 · df5(e)/de · f1(e)...

kind of gradient descent methodother (sigmoid function) parameters can be adjusted as wellconverges to a local minimum of errors

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 14: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

SOM

self-organizing maps

kind of an unsupervised version of ANNsthe map is commonly a 2D arrayarray nodes exhibit a simple propertyeach input connected to each outputused to visualize multidimensional datasimilar parts should behave similarly

competitive learning of the networknodes compete to represent particular data objectseach node of the array has its vector of weightsinitially either random or two principal componentsiterative node weights / property adjusting

take a random data objectfind its best matching node according to nodes’ weightsadjust node weights / property to be more similar to the dataadjust somewhat other neighboring nodes too

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 15: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

GTM

generative topographic map

GTM characteristicsnon-linear latent variable modelprobabilistic counterpart to the SOM modela generative model

actual data are being modeled as created by mappingsfrom a low-dimensional space into the actualhigh-dimensional spacedata visualization is gained according to Bayes’ theoremthe latent-to-data space mappings are Gaussian distributionscreated densities are iteratively fitted to approximate realdata distributionknown Gaussian mixtures and radial basis functionsalgorithms

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 16: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Nearest neighbor

case-based reasoning classificationdiagnosis set to the most similar determined casehow to measure distances between particular cases?

k -NNtake the k most similar cases, each of them has a votesimple but frequently works for (binary) classification

common problemswhich properties are significant, which are just noisesuitable sizes of similar cases, how to avoid outliers

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 17: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Relations

the right descriptive features - the right similar cases

search for important gene expressions and patient cases

unsupervised methodsdata clustering

for the similar data casesdata mining

for the important features

supervised methodsBayesian network inference

informatics and statisticsminimum message length

informatics and algebrainductive logic programming

informatics and logic

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 18: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Cliques

graph approach to clusteringtransform given data table to a graph - vertices for genesedges for gene pairs with similarity above a thresholdto find the least graph alteration to result in a clique graph

CAST algorithmiterative heuristic clique generationa clique construction from available vertices

initiate with a vertex of maximal degreewhile a distant vertex taken or close vertex freeadd the closest vertex into the cliqueremove the farthest vertex from the clique

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 19: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Clustering

standard clustering technics

to make separated homogenous groups

center based methodsk-means as the standardc-means, qt-clustering

hierarchy methodsagglomerative bottom-updivisive top-down

combinationstwo-steps approach

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 20: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Cluster structures

single

complete

linkage

hierarchical clustering neighbor joining

k-means clustering qt-clustering

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 21: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Distances

how to measure object-object (dis)similarity

Euclidean distance [∑

i(xi − y1)2]1/2

Manhattan distance∑

i |xi − y1|Power distance [

∑i(xi − y1)

p]1/p

maximum distance maxi{|xi − yi |}Pearson’s correlation dot product for normalized datapercentage disagreement fraction of xi 6= yi

metric significancedifferent powers usually do not significantly alter resultsmore different distance measuring should change clustercompositions

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 22: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Hierarchical clustering

neighbor joining - common agglomerative method

hierarchical tree creation

joining the most similar clusterssingle linkage - nearest neighbor

distances according to the most similar cluster objectscomplete linkage - furthest neighbor

distances according to the most distant cluster objectsaverage linkage

cluster distances as mean distances of respective elementsWards method - information loss minimization

takes minimal variance increase for possible cluster pairs

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 23: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

The k-means

the most frequently used kind of clustering

k-mean clustering algorithmstart: choose initial k centersiterate for objects (e.g. genes) being clustered:compute new distances to centers, choose the nearest onefor each cluster compute new centerend when no cluster changesto put less weight on similar microarrays

prosusually fast, does not compute all object-object distances

consamount and initial positions of centers highly affect results

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 24: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Center count

do not add more clusters when it does not increase gainedinformation sufficiently

center selectionrandom

make k-means clustering several timesPCA

principal components lie in data cloudsdata objects

choose distant objects, with weightstwo steps

take larger amount of clusters, then do hierarchicalclustering on the result centers

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 25: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Cluster fit

how convenient are the gained clusters

k-mean clusteringintra-cluster vs. out-of-cluster distances for objectsratios of inter-cluster to intra-cluster distances

the Dunn’s indexinter / intra variances

hierarchical clusteringvariances of each clustersimilarity for cluster meansbootstrapping for objects with suitable inner structures

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 26: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Alternative clustering

qt (quality threshold) clusteringchoose maximal cluster diameter instead of center counttry to make maximal cluster around each data objectstake the one with the greatest amount of objects insidecall it recursively on the rest of the datamore computation intensive - more plausible than k-meanscan be done alike for maximal cluster sizes

soft c-meanseach object (gene) is in more clusters (gene families)object belonging degrees, sums equal to onesimilar to k-means, stop when small cluster changessuitable for lower amounts of clusters

spectral clusteringobject segmentation according to similarity Laplacian matrixeigenvector of the second smallest eigenvalue

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 27: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Dependency description

used for characterization, classification and compression

BNBayesian networkswhat depends on what, which variables are independentthen fast, suitable inference computing

MMLminimal mesage lengthshortest form of an object / feature descriptionreal amount of information an object contains

ILPinductive logic programmingto prove most of the positive / least of the negative casesaccurate given objects vs. background characterization

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 28: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Graphical models

joint distributionsPr(x1, x2, x3) = Pr(x1 | x2, x3) · Pr(x2 | x3) · Pr(x3)intractable for slightly larger amounts of variablesused to compute important probabilities themselvesused for conditional probabilities→ for Bayesian inference

conditional independenceA and B independent under C: A ⊥⊥ B |CA ⊥⊥ B |C: Pr(A, B |C) = Pr(A |C) · Pr(B |C)after a few rearrengements: Pr(A |B, C) = Pr(A |C)

Markov processes are just one example of conditionalindependence

graphs and relationssimplifying the structure with stating just some validdependencies, no edges → (conditional) independenceedges stating the only dependencies

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 29: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Bayesian systems

1 2 3

4

56

7 8 9

DAG of dependencies

x depends on x x5 1 2

Pr(x5 | x1, x2, x3, x4) = Pr(x5 | x1, x2)

any node is - given all its parents - (conditionally)independent with all the nodes which are not itsdescendants (e.g. x3, x5 independent at all)

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 30: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Bayesian network

inferenceinference along the DAG is computed acoording to thedecomposition of conditional probabilitiesinference along the opposite directions is computedacoording to the Bayesian approach

A C

B

Pr(A, B, C) = Pr(A|B, C) · Pr(B|C) · Pr(C)how to compute Pr(A = True|C = True)

Pr(A = True|C = True) = Pr(C = True, A = True)/ Pr(C = True)=

PB Pr(C = True, B, A = True)/

PA,B Pr(C = True, B, A)

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 31: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Naive Bayes

...

S

w w2

wn1

S ... mail/spam state

w individual wordsi

assumption of independent outcomesused e.g. for spam classificationS = argmaxs Pr(S = s)

∏j Pr(Oj = wj |S = s)

possible to compute fast online with 104 items

Martin Saturka www.Bioplexity.org Bioinformatics - Learning

Page 32: Bioinformatics Clusters and networks Martin Saturka http ... · crisp and fuzzy relation models. graphical models. Main topics vector approaches - SVM, ANN, kernels - classification

Items to remember

Nota bene:

classification methods

learning technicsvectors, kernels, SVM, ANNconditional independence

clustering methodshierarchical clusteringk-means, qt-clustering

Martin Saturka www.Bioplexity.org Bioinformatics - Learning