Data preprocessing and unsupervised learning methods in Bioinformatics

99
Data Preprocessing Unsupervised learning Elena Sügis [email protected] Introduction to Bioinformatics, LVSC2016

Transcript of Data preprocessing and unsupervised learning methods in Bioinformatics

Page 1: Data preprocessing and unsupervised learning methods in Bioinformatics

Data Preprocessing Unsupervised learning Elena Sügis [email protected]

Introduction to Bioinformatics, LVSC2016

Page 2: Data preprocessing and unsupervised learning methods in Bioinformatics

? ??

Page 3: Data preprocessing and unsupervised learning methods in Bioinformatics

VS VS

Questions we ask

Page 4: Data preprocessing and unsupervised learning methods in Bioinformatics

Questions we ask

Page 5: Data preprocessing and unsupervised learning methods in Bioinformatics

Questions we ask

Page 6: Data preprocessing and unsupervised learning methods in Bioinformatics

Experiments

Page 7: Data preprocessing and unsupervised learning methods in Bioinformatics

Data comes in different forms

Page 8: Data preprocessing and unsupervised learning methods in Bioinformatics

Data ≠ Knowledge

Page 9: Data preprocessing and unsupervised learning methods in Bioinformatics

Simple data analysis pipeline

Data Black Magic Result

high quality data machine learning method

awesome result

Page 10: Data preprocessing and unsupervised learning methods in Bioinformatics

Simple data analysis pipeline

Data Black Magic Result

poor quality data machine learning method

not so awesome result

Page 11: Data preprocessing and unsupervised learning methods in Bioinformatics

Data Preprocessing

Clean

Page 12: Data preprocessing and unsupervised learning methods in Bioinformatics

Massage your data

Page 13: Data preprocessing and unsupervised learning methods in Bioinformatics

80 %

Page 14: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

?

Page 15: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Summarize/ plot raw data

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

Page 16: Data preprocessing and unsupervised learning methods in Bioinformatics

Meet your data

Page 17: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Summarize/ plot raw data

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

Page 18: Data preprocessing and unsupervised learning methods in Bioinformatics

Missing Values

Origins: •  Malfunctioning measurement equipment •  Very low intensity signal •  Deleted due to inconsistency with other recorded

data •  Data removed/not entered by mistake

Page 19: Data preprocessing and unsupervised learning methods in Bioinformatics

Missing Values How to deal with them: •  Filter out

•  Replace missing values by 0

•  Replace by the mean, median value

•  K nearest neighbor imputation (KNN imputation)

•  Expectation—Maximization (EM) based imputations

Page 20: Data preprocessing and unsupervised learning methods in Bioinformatics

k-nearest neighbors

Page 21: Data preprocessing and unsupervised learning methods in Bioinformatics

KNN •  We are given a gene expression matrix M •  Let X=(x1, x2, …, xi, …, xn) be a vector in the matrix M

with a missing value at xi at the dimension i

•  Find in the gene expression data matrix matrix vectors X1 , X2 , …, Xk , such that they are the k closest vectors to X in M (with a chosen distance measure) among the vectors that do not have a missing value at dimension i

•  Replace the missing value xi with the mean (or median) of X1 i, X2 i, …, Xk i , i.e., mean (median) of the values at dimension i of vectors X1 , X2 , …, Xk

Page 22: Data preprocessing and unsupervised learning methods in Bioinformatics

KNN

Healthy people Patients

Gene expression matrix

Page 23: Data preprocessing and unsupervised learning methods in Bioinformatics

Imputed missing values

Healthy people Patients

Gene expression matrix

Page 24: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Summarize/ plot raw data

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

Page 25: Data preprocessing and unsupervised learning methods in Bioinformatics

Technical vs Biological

Page 26: Data preprocessing and unsupervised learning methods in Bioinformatics

Normalization & Standardization

Objective: adjust measurements so that they can be appropriately compared among samples Key ideas: •  Remove technological biases •  Make samples comparable Methods: •  Z-scores (centering and scaling) •  Logarithmization •  Quantile normalization •  Linear model based normalization

Page 27: Data preprocessing and unsupervised learning methods in Bioinformatics

Z-scores Centering a variable is subtracting the mean of the variable from each data point so that the new variable's mean is 0. Scaling a variable is multiplying each data point by a constant in order to alter the range of the data.

where: µ is the mean of the population. σ is the standard deviation of the population.

z = x −µσ

Page 28: Data preprocessing and unsupervised learning methods in Bioinformatics

transforms the data by a linear projection onto a lower-dimensional space that preserves as

much data variation as possible

Principal Component Analysis

Page 29: Data preprocessing and unsupervised learning methods in Bioinformatics

Principal Component Analysis Objective: Reduce dimensionality while preserving as much variance as possible

http://setosa.io/ev/principal-component-analysis/

Page 30: Data preprocessing and unsupervised learning methods in Bioinformatics

Visualize normalized data

Groups

Healthy

Patients group1

Patients group2

Page 31: Data preprocessing and unsupervised learning methods in Bioinformatics

Visual inspection after normalization

Page 32: Data preprocessing and unsupervised learning methods in Bioinformatics

Visual Inspection. PCA

Highlight groups

Patients Healthy people

Page 33: Data preprocessing and unsupervised learning methods in Bioinformatics

Arrrgh!!! Why aren’t you together ?!?!

Page 34: Data preprocessing and unsupervised learning methods in Bioinformatics

Visual Inspection. PCA

Color by experiment/dataset/day etc.

DAY1 DAY2

Page 35: Data preprocessing and unsupervised learning methods in Bioinformatics

Batch Effects

Measurements are affected by: •  Laboratory conditions •  Reagent lots •  Personnel differences

are technical sources of variation that have been added to the samples during handling. They are unrelated to the biological or scientific variables in a study.

Major problem : might be correlated with an outcome of interest and lead to incorrect conclusions

Page 36: Data preprocessing and unsupervised learning methods in Bioinformatics

Fighting The Batch Effects Experimental design solutions: •  Shorter experiment time •  Equally distributed samples between multiple laboratories and

across different processing times, etc. •  Provide info about changes in personnel, reagents, storage and

laboratories

Statistical solutions: •  ComBat •  SVA(Surrogate variable analysis, SVD+linear models) •  PAMR (Mean-centering) •  DWD (Distance-weighted discrimination based on SVM) •  Ratio_G (Geometric ratio-based)

J.T. Leek, Nature Reviews Genetics 11, 733-739 (October 2010,) Chao Chen, PlosOne, 2011

Page 37: Data preprocessing and unsupervised learning methods in Bioinformatics
Page 38: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Summarize/ plot raw data

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

Page 39: Data preprocessing and unsupervised learning methods in Bioinformatics

Outliers Detection

Page 40: Data preprocessing and unsupervised learning methods in Bioinformatics

Interquartile rate

outlier

Page 41: Data preprocessing and unsupervised learning methods in Bioinformatics
Page 42: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Summarize/ plot raw data

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

Page 43: Data preprocessing and unsupervised learning methods in Bioinformatics

IF YOU TORTURE THE DATA LONG ENOUGH

IT WILL CONFESS TO ANYTHING

Ronald Coase, Economist, Nobel Prize winner

Page 44: Data preprocessing and unsupervised learning methods in Bioinformatics

Clustering is finding groups of objects such that: similar (or related) to the objects in the same group and different from (or unrelated) to the objects in other groups

What is cluster analysis?

Page 45: Data preprocessing and unsupervised learning methods in Bioinformatics

Properties

•  Classes/labels for each instance are derived only from the data

•  For that reason, cluster analysis is referred to as unsupervised classification

Page 46: Data preprocessing and unsupervised learning methods in Bioinformatics

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data •  Data visualization

Why to cluster biological data?

Page 47: Data preprocessing and unsupervised learning methods in Bioinformatics

Intuition building

cardiopulmonary/metabolic disorders

neurological diseases

sensory conditions

cerebral vascular accident

cancer

http://bmcgeriatr.biomedcentral.com/articles/10.1186/1471-2318-11-45

presence of one or more additional diseases or disorders co-occurring with a primary disease or disorder

Page 48: Data preprocessing and unsupervised learning methods in Bioinformatics

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data •  Data visualization

Why to cluster biological data?

Page 49: Data preprocessing and unsupervised learning methods in Bioinformatics

Hypothesis generation

SAHA

Trichostatin.A

Valproic.acid

Cyproconazole

PC80

PC70

CdC

AO3

Triadimenol

Triadimefon

PC53

Tubacin

CH3HgCl

Rotenon

Pb.acetate

Mannitol

Thimerosal

EGFILKRHOCACTN1BCAR1ITGB3ACTN4MYH9CAV1HGFMETDPP4MYLKPLD1ITGA4ITGB1ROCK1MMP14RHOBMMP2CAPN1PTPN1SRCPLCG1RAC2MYH10BAIAP2STAT3RND3MMP9RAC1RHOASH3PXD2ACSF1DIAPH1

-3

-2

-1

0

1

2

3

SA

HA

Tr

icho

stat

in A

Va

lpro

ic a

cid

Cyp

roco

nazo

le

PC

B18

0 P

CB

170

CdC

l 2 A

s 2O

3

Tria

dim

enol

Tr

iadi

mef

on

PC

B15

3 Tu

baci

n M

eHg

Rot

enon

P

b-ac

etat

e M

anni

tol

Thim

eros

al

SAHA

Trichostatin.A

Valproic.acid

Cyproconazole

PC80

PC70

CdC

AO3

Triadimenol

Triadimefon

PC53

Tubacin

CH3HgCl

Rotenon

Pb.acetate

Mannitol

Thimerosal

EGFILKRHOCACTN1BCAR1ITGB3ACTN4MYH9CAV1HGFMETDPP4MYLKPLD1ITGA4ITGB1ROCK1MMP14RHOBMMP2CAPN1PTPN1SRCPLCG1RAC2MYH10BAIAP2STAT3RND3MMP9RAC1RHOASH3PXD2ACSF1DIAPH1

-3

-2

-1

0

1

2

3

HD

AC

i

SAHA

Trichostatin.A

Valproic.acid

Cyproconazole

PC80

PC70

CdC

AO3

Triadimenol

Triadimefon

PC53

Tubacin

CH3HgCl

Rotenon

Pb.acetate

Mannitol

Thimerosal

EGFILKRHOCACTN1BCAR1ITGB3ACTN4MYH9CAV1HGFMETDPP4MYLKPLD1ITGA4ITGB1ROCK1MMP14RHOBMMP2CAPN1PTPN1SRCPLCG1RAC2MYH10BAIAP2STAT3RND3MMP9RAC1RHOASH3PXD2ACSF1DIAPH1

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3 Color coded scaled fold change (FC) vs control

Page 50: Data preprocessing and unsupervised learning methods in Bioinformatics

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data •  Data visualization

Why to cluster biological data?

Page 51: Data preprocessing and unsupervised learning methods in Bioinformatics

Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

Page 52: Data preprocessing and unsupervised learning methods in Bioinformatics

•  Intuition building Finding hidden internal structure of the high-dimensional data

•  Hypothesis generation Finding and characterizing similar groups of objects in the data

•  Knowledge discovery in data Ex. Underlying rules, reoccurring patterns, topics, etc.

•  Summarizing / compressing large data •  Data visualization

Why to cluster biological data?

Page 53: Data preprocessing and unsupervised learning methods in Bioinformatics

Summarizing/compressing the data

Page 54: Data preprocessing and unsupervised learning methods in Bioinformatics

Partitional vs Hierarchical

Creates a nested and hierarchical set of partitions/clusters

Each sample(point) is assigned to a unique cluster

Page 55: Data preprocessing and unsupervised learning methods in Bioinformatics

Fuzzy vs Non-Fuzzy Fuzzy vs Non-Fuzzy

Each object belongs to eachcluster with some weight(the weight can be zero)

Each object belongs to exactly one cluster

Each object belongs to each cluster with some weight

Each object belongs to exactly one cluster

Page 56: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering Hierarchical clustering

Hierarchical clustering is usually depicted as a dendrogram (tree)

Hierarchical clustering is usually depicted as a dendrogram (tree)

Page 57: Data preprocessing and unsupervised learning methods in Bioinformatics

•  Each subtree corresponds to a cluster •  Height of branching shows distance

Hierarchical clustering

• Each subtree corresponds to a cluster• Height of branching shows distance

Hierarchical clustering

Page 58: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (0)

Algorithm for Agglomerative Hierarchical Clustering:Join the two closest objects

Algorithm for Agglomerative Hierarchical Clustering: Join the two closest objects

Hierarchical clustering

Page 59: Data preprocessing and unsupervised learning methods in Bioinformatics

Join the two closest objects

Hierarchical clustering (1)

Join the two closest objects

Hierarchical clustering

Page 60: Data preprocessing and unsupervised learning methods in Bioinformatics

Keep joining the closest pairs

Hierarchical clustering (2)

Keep joining the closest pairs

Hierarchical clustering

Page 61: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (3)

Keep joining the closest pairsKeep joining the closest pairs

Hierarchical clustering

Page 62: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (4)

Keep joining the closest pairsKeep joining the closest pairs

Hierarchical clustering

Page 63: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (5)

Keep joining the closest pairsKeep joining the closest pairs

Hierarchical clustering

Page 64: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (10)

After 10 steps we have 4 clusters left

After 10 steps we have 4 clusters left

Hierarchical clustering

Page 65: Data preprocessing and unsupervised learning methods in Bioinformatics

Q: Which clusters do we merge next? Hierarchical clustering (10)

After 10 steps we have 4 clusters left

Page 66: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN)

Several ways to measure distance between clusters: •  Single linkage(MIN)

Hierarchical clustering

Page 67: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)

Several ways to measure distance between clusters: •  Single linkage(MIN) •  Complete linkage(MAX)

Hierarchical clustering

Page 68: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)• Average linkage• Weighted• Unweighted• ...

Several ways to measure distance between clusters: •  Single linkage (MIN) •  Complete linkage (MAX) •  Average linkage

•  Weighted •  Unweighted ...

•  Ward’s method

Hierarchical clustering

Page 69: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (11)

In this example and at this stage we have the same result as in partitional clustering

In this example and at this stage we have the same result as in partitional clustering

Hierarchical clustering

Page 70: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (12)

In the final step the two remaining clusters are joined into a single cluster

In the final step the two remaining clusters are joined into a single cluster

Hierarchical clustering

Page 71: Data preprocessing and unsupervised learning methods in Bioinformatics

Hierarchical clustering (13)

In the final step the two remaining clusters are joined into a single cluster

In the final step the two remaining clusters are joined into a single cluster

Hierarchical clustering

Page 72: Data preprocessing and unsupervised learning methods in Bioinformatics

Examples of Hierarchical Clustering in Bioinformatics

Examples of Hierarchical Clustering in Bioinformatics

PhylogenyGene expression clustering

Page 73: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means clustering

•  Partitional, non-fuzzy •  Partitions the data into K clusters •  K is given by the user

Algorithm: •  Choose K initial centers for the clusters •  Assign each object to its closest center •  Recalculate cluster centers •  Repeat until converges

Page 74: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means (1) K-means (1)

Page 75: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means (2) K-means (2)

Page 76: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means (3) K-means (3)

Page 77: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means (4) K-means (4)

Page 78: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means (5) K-means (6)

Page 79: Data preprocessing and unsupervised learning methods in Bioinformatics

Elbow method Estimate the number of clusters

Page 80: Data preprocessing and unsupervised learning methods in Bioinformatics

K-means clustering summary

•  One of the fastest clustering algorithms •  Therefore very widely used •  Sensitive to the choice of initial centres

•  many algorithms to choose initial centres cleverly

•  Assumes that the mean can be calculated •  can be used on vector data •  cannot be used on sequences

(what is the mean of A and T?)

Page 81: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids clustering

•  The same as K-means, except that the center is required to be at an object

•  Medoid - an object which has minimal total distance to all other objects in its cluster

•  Can be used on more complex data, with any distance measure

•  Slower than K-means

Page 82: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (1) K-medoids (1)

Page 83: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (2) K-medoids (2)

Page 84: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (3) K-medoids (3)

Page 85: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (4) K-medoids (4)

Page 86: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (5) K-medoids (5)

Page 87: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (6) K-medoids (6)

Page 88: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (7) K-medoids (7)

Page 89: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (8) K-medoids (8)

Page 90: Data preprocessing and unsupervised learning methods in Bioinformatics

K-medoids (9) K-medoids (9)

Page 91: Data preprocessing and unsupervised learning methods in Bioinformatics

Examples of K-means and K-medoids in Bioinformatics

Gene expression clustering

Sequence clustering

Examples of K-means and K-medoids in Bioinformatics

Page 92: Data preprocessing and unsupervised learning methods in Bioinformatics

Distance measures Distance measuresDistance of vectors and

• Euclidean distance

• Manhattan distance

• Correlation distance

Distance of sequences and

• Hamming distance => 3

• Levenshtein distance

x = (x1, . . . , xn) y = (y1, . . . , yn)

d(x, y) =

vuutnX

i=1

(xi � yi)2

d(x, y) =nX

i=1

|xi � yi|

d(x, y) = 1� r(x, y)is Pearson

correlation coefficientr(x, y)

ACCTTG TACCTGACCTTGTACCTG

.ACCTTGTACC.TG => 2

Page 93: Data preprocessing and unsupervised learning methods in Bioinformatics
Page 94: Data preprocessing and unsupervised learning methods in Bioinformatics

Interpretation

Summarize/ plot raw data

Impute missing values

Normalize/ Standardize

Handle outliers

Data analysis

Import data validation

Page 95: Data preprocessing and unsupervised learning methods in Bioinformatics

Put it into words & Discover

https://mastrianascience.wikispaces.com/file/view/scientific_method_wordle.png/253466620/694x406/scientific_method_wordle.png

Page 96: Data preprocessing and unsupervised learning methods in Bioinformatics

Gene ontology

•  Molecular Function - elemental activity or task

• Biological Process - broad objective or goal • Cellular Component - location or complex

What found genes are doing

Page 97: Data preprocessing and unsupervised learning methods in Bioinformatics

Functional annotations & Significance

statistical significance of having drawn a sample consisting of a specific number of k successes out of n total draws from a population of size N containing K successes.

Page 98: Data preprocessing and unsupervised learning methods in Bioinformatics

Cluster annotation

GOsummaries https://www.bioconductor.org/packages/release/bioc/html/GOsummaries.html

Page 99: Data preprocessing and unsupervised learning methods in Bioinformatics

Practice time!