Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics...

Post on 04-Jan-2016

244 views 0 download

Tags:

Transcript of Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Application I: Computational Genomics...

Eric Xing

© Eric Xing @ CMU, 2006-2010 1

Machine LearningMachine Learning

Application I: Application I:

Computational GenomicsComputational Genomics

Eric XingEric Xing

Lecture 18, August 16, 2010

Eric Xing

© Eric Xing @ CMU, 2006-2010 2

Biological Data Analysis

Dynamic, noisy, heterogeneous, high-dimensional data

“High-resolution” inference Parsimonious Scalability Stability Sample complexity Confidence bound

Eric Xing

© Eric Xing @ CMU, 2006-2010 3

Healthy

Sick

ACTCGTACGTAGACCTAGCATTTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Single nucleotide polymorphism (SNP)

Causal (or "associated") SNP

Genetic Basis of Diseases

Eric Xing

© Eric Xing @ CMU, 2006-2010 4

A . . . . T . . G . . . . . C . . . . . . T . . . . . A . . G .

A . . . . A . . C . . . . . C . . . . . . T . . . . . A . . G .

T . . . . T . . G . . . . . G . . . . . . T . . . . . T . . C .

A . . . . A . . C . . . . . C . . . . . . T . . . . . T . . C .

A . . . . T . . G . . . . . G . . . . . . A . . . . . A . . G .

T . . . . T . . C . . . . . G . . . . . . A . . . . . A . . C .

A . . . . A . . C . . . . . C . . . . . . A . . . . . T . . C .

Data

Genotype Phenotype

• Cancer: Dunning et al. 2009.• Diabetes: Dupuis et al. 2010.• Atopic dermatitis: Esparza-Gordillo et al. 2009.• Arthritis: Suzuki et al. 2008

Standard Approach

causal SNPcausal SNP

a univariate a univariate phenotypephenotype::e.g., disease/controle.g., disease/control

Genetic Association Mapping

Eric Xing

© Eric Xing @ CMU, 2006-2010 5

Healthy

Cancer

ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Causal SNPs

Genetic Basis of Complex Diseases

Eric Xing

© Eric Xing @ CMU, 2006-2010 6

Genetic Basis of Complex Diseases

Healthy

Cancer

ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Causal SNPs

Biological mechanism

Eric Xing

© Eric Xing @ CMU, 2006-2010 7

Intermediate Phenotype

Genetic Basis of Complex Diseases

Healthy

Cancer

ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA

ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA

TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA

ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA

ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA

TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA

ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA

Causal SNPs

Clinical records

Gene expression

Association to intermediate phenotypes

Eric Xing

© Eric Xing @ CMU, 2006-2010 8

Association with Phenome

ACGTTTTACTGTACAATT

Multivariate complex syndrome (e.g., asthma)Multivariate complex syndrome (e.g., asthma)age at onset, history of eczemaage at onset, history of eczema

genome-wide expression profilegenome-wide expression profile

ACGTTTTACTGTACAATT

Traditional Approachcausal SNPcausal SNP

a univariate phenotype:a univariate phenotype:gene expression levelgene expression level

Structured Association

Eric Xing

© Eric Xing @ CMU, 2006-2010 9

Considerone phenotype at a time

Consider multiple correlated phenotypes (phenome) jointly

vs.

Standard Approach New Approach

Phenotypes Phenome

Structured Association : a New Paradigm

Eric Xing

© Eric Xing @ CMU, 2006-2010 10

Subnetworks for lung physiology Subnetwork for

quality of life

Genetic Association for Asthma Clinical Traits

TCGACGTTTTACTGTACAATT

Statistical challengeHow to find

associations to a multivariate entity in a graph?

Eric Xing

© Eric Xing @ CMU, 2006-2010 11

Gene Expression

Trait Analysis

TCGACGTTTTACTGTACAATT

Genes

Samples

Statistical challengeHow to find

associations to a multivariate entity in a tree?

Eric Xing

© Eric Xing @ CMU, 2006-2010 12

Sparse Associations

Pleotropic effectsPleotropic effects

Epistatic effectsEpistatic effects

Eric Xing

© Eric Xing @ CMU, 2006-2010 13

Sparse Learning

Linear Model:

Lasso (Sparse Linear Regression)

Why sparse solution?

penalizing

constraining

[R.Tibshirani 96]

Eric Xing

© Eric Xing @ CMU, 2006-2010 14

Multi-Task Extension Multi-Task Linear Model:

Coefficients for k-th task:

Coefficient Matrix:

Input:

Output:

Coefficients for a variable (2nd)

Coefficients for a task (2nd)

Eric Xing

© Eric Xing @ CMU, 2006-2010 15

Background: Sparse multivariate regression for disease association studies

Structured association – a new paradigm Association to a graph-structured phenome

Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)

Association to a tree-structured phenome Tree-guided group lasso (Kim & Xing, ICML 2010)

Outline

Eric Xing

© Eric Xing @ CMU, 2006-2010 16

T G

A A

C C

A T

G A

A G

T A

GenotypeTrait

Multivariate Regression for Single-Trait Association Analysis

Xy

2.1 x=

x= β

Association Strength

?

Eric Xing

© Eric Xing @ CMU, 2006-2010 17

T G

A A

C C

A T

G A

A G

T A

GenotypeTrait

Multivariate Regression for Single-Trait Association Analysis

Many non-zero associations: Which SNPs are truly significant?

2.1 x=

Association Strength

Eric Xing

© Eric Xing @ CMU, 2006-2010 18

Genotype

x=2.1

Trait

Lasso for Reducing False Positives (Tibshirani, 1996)

Many zero associations (sparse results), but what if there are multiple related traits?

+

J

j 1

|βj |

T G

A A

C C

A T

G A

A G

T A

Lasso Penalty for sparsity

Association Strength

Eric Xing

© Eric Xing @ CMU, 2006-2010 19

Genotype

(3.4, 1.5, 2.1, 0.9, 1.8)

Multivariate Regression for Multiple-Trait Association Analysis

T G

A A

C C

A T

G A

A G

T A

How to combine information across multiple traits to increase the power?

Association Strength

x=

Association strength between SNP j and Trait i: βj,i

Allerg

y fo

r cat

s

Allerg

y fo

r roa

ches

Allerg

y in

spr

ing

FEVFEF

Allergy Lung physiology

LD

Syntheti

c lethal

+ ji,

|βj,i|

Eric Xing

© Eric Xing @ CMU, 2006-2010 20

Genotype

(3.4, 1.5, 2.1, 0.9, 1.8)

Trait

Multivariate Regression for Multiple-Trait Association Analysis

T G

A A

C C

A T

G A

A G

T A

Allergy Lung physiology

Association Strength

x=

+We introduce

graph-guided fusion penalty

Association strength between SNP j and Trait i: βj,i

+ ji,

|βj,i|

Eric Xing

© Eric Xing @ CMU, 2006-2010 21

Multiple-trait Association:Graph-Constrained Fused Lasso

Step 1: Thresholded correlation graph of phenotypes

ACGTTTTACTGTACAATT

Step 2: Graph-constrained fused lasso

Lasso Penalty

Graph-constrained fusion penalty

Fusion

Eric Xing

© Eric Xing @ CMU, 2006-2010 22

Fusion Penalty

Fusion Penalty: | βjk - βjm |

For two correlated traits (connected in the network), the association strengths may have similar values.

ACGTTTTACTGTACAATT

SNP j

Trait m

Trait k

Association strength between SNP j and Trait k: βjk

Association strength between SNP j and Trait m: βjm

Eric Xing

© Eric Xing @ CMU, 2006-2010 23

ACGTTTTACTGTACAATT

Overall effect

Graph-Constrained Fused Lasso

Fusion effect propagates to the entire network Association between SNPs and subnetworks of traits

Eric Xing

© Eric Xing @ CMU, 2006-2010 24

ACGTTTTACTGTACAATT

Multiple-trait Association: Graph-Weighted Fused Lasso

Subnetwork structure is embedded as a densely connected nodes with large edge weights

Edges with small weights are effectively ignored

Overall effect

Eric Xing

© Eric Xing @ CMU, 2006-2010 25

Estimating Parameters

Quadratic programming formulation Graph-constrained fused lasso

Graph-weighted fused lasso

Many publicly available software packages for solving convex optimization problems can be used

Eric Xing

© Eric Xing @ CMU, 2006-2010 26

Improving Scalability

Iterative optimization• Update βk

• Update djk’s, djml’s

Original problem

Equivalently

Using a variational formulation

Eric Xing

© Eric Xing @ CMU, 2006-2010 27

Previous Works vs. Our Approach

Previous approach Our approach

PCA-based approach (Weller et al., 1996, Mangin et al., 1998)

Implicit representation of trait correlations

Hard to interpret the derived traits

Explicit representation of trait correlations

Extension of module network for eQTL study (Lee et al., 2009)

Average traits within each trait cluster

Loss of information

Original data for traits are used

Network-based approach (Chen et al., 2008, Emilsson et al., 2008)

Separate association analysis for each trait (no information sharing)

Single-trait association are combined in light of trait network modules

Joint association analysis of multiple traits

Eric Xing

© Eric Xing @ CMU, 2006-2010 28

• 50 SNPs taken from HapMap chromosome 7, CEU population

• 10 traits

Trait Correlation Matrix

True Regression Coefficients

Single SNP-Single Trait Test

Significant at α = 0.01

LassoGraph-guided Fused Lasso

Thresholded Trait Correlation Network

Simulation Results

Phenotypes

SN

Ps

No association

High association

Eric Xing

© Eric Xing @ CMU, 2006-2010 29

Simulation Results

Eric Xing

© Eric Xing @ CMU, 2006-2010 30

Asthma Association Study

543 severe asthma patients from the Severe Asthma Research Program (SARP)

Genotypes : 34 SNPs in IL4R gene 40kb region of chromosome 16 Impute missing genotypes with PHASE (Li and Stephens, 2003)

Traits : 53 asthma-related clinical traits Quality of Life: emotion, environment, activity, symptom Family history: number of siblings with allergy, does the father has asthma? Asthma symptoms: chest tightness, wheeziness

Eric Xing

© Eric Xing @ CMU, 2006-2010 31

Asthma and IL4R Gene

T cell

In normal individuals: allergen ignored by the immune system

Allergen (ragweed, grass, etc.)

Eric Xing

© Eric Xing @ CMU, 2006-2010 32

Asthma and IL4R Gene

T cell

In asthma patientsAllergen (ragweed, grass, etc.)

Th2 cell B cell

ImmunoglobulinE (IgE) antibody

Inflammation!

• Airway in lung narrows• Difficult to breathe• Asthma symptoms

Interleukin-4 Receptor regulates IgE antibody production

Eric Xing

© Eric Xing @ CMU, 2006-2010 33

IL4R Gene

Chr16: 27,325,251-27,376,098

IL4R

SNPs in intron

SNPs in exon

SNPs in promotor region

exonsintrons

Eric Xing

© Eric Xing @ CMU, 2006-2010 34

Asthma Trait Network

Subnetwork for quality of life

Subnetwork for lung physiology

Phenotype Correlation Network

Subnetwork for Asthma symptoms

Eric Xing

© Eric Xing @ CMU, 2006-2010 35

Results from Single-SNP/Trait Test

Single-Marker Single-Trait Test

SN

Ps

Permutation test α = 0.05

Permutation test α = 0.01

No association

High association

Trait Network

Phenotypes

Phe

noty

pes

Lung physiology-related traits I• Baseline FEV1 predicted value: MPVLung • Pre FEF 25-75 predicted value • Average nitric oxide value: online • Body Mass Index • Postbronchodilation FEV1, liters: Spirometry • Baseline FEV1 % predicted: Spirometry • Baseline predrug FEV1, % predicted • Baseline predrug FEV1, % predicted

Q551R SNP• Codes for amino-acid changes in the intracellular signaling portion of the receptor• Exon 11

Eric Xing

© Eric Xing @ CMU, 2006-2010 36

Trait Network

Lasso Graph-guided Fused Lasso

Comparison of Gflasso with Others

Single-Marker Single-Trait Test

SN

Ps

Phenotypes

Phe

noty

pes

?

Lung physiology-related traits I• Baseline FEV1 predicted value: MPVLung • Pre FEF 25-75 predicted value • Average nitric oxide value: online • Body Mass Index • Postbronchodilation FEV1, liters: Spirometry • Baseline FEV1 % predicted: Spirometry • Baseline predrug FEV1, % predicted • Baseline predrug FEV1, % predicted

Q551R SNP• Codes for amino-acid changes in the intracellular signaling portion of the receptor• Exon 11

No association

High association

Eric Xing

© Eric Xing @ CMU, 2006-2010 37

Linkage Disequilibrium Structure in IL-4R gene

SNP Q551R

SNP rs3024660

SNP rs3024622

r2 =0.64

r2 =0.07

Eric Xing

© Eric Xing @ CMU, 2006-2010 38

IL4R Gene

Chr16: 27,325,251-27,376,098

IL4R

SNPs in intron

SNPs in exon

SNPs in promotor region

exonsintrons

Q551R

rs3024660

rs3024622

Eric Xing

© Eric Xing @ CMU, 2006-2010 39

Association to a Tree-structured

Phenome

TCGACGTTTTACTGTACAATT

Eric Xing

© Eric Xing @ CMU, 2006-2010 40

TCGACGTTTTACTGTACAATT

Tree-guided Group Lasso

Why tree? Tree represents a clustering structure

Scalability to a very large number of phenotypes Graph : O(|V|2) edges Tree : O(|V|) edges

Expression quantitative trait mapping (eQTL) Agglomerative hierarchical clustering is a popular tool

Eric Xing

© Eric Xing @ CMU, 2006-2010 41

Tree-Guided Group Lasso

In a simple case of two genes

• Low height• Tight correlation• Joint selection

• Large height• Weak correlation• Separate selection

h

h

Gen

otyp

es

Gen

otyp

es

Eric Xing

© Eric Xing @ CMU, 2006-2010 42

L1 penalty• Lasso penalty• Separate selection

L2 penalty • Group lasso• Joint selection

h

Elastic net

Select the child nodes jointly or separately?

Tree-Guided Group Lasso

In a simple case of two genes

Tree-guided group lasso

Eric Xing

© Eric Xing @ CMU, 2006-2010 43

h2

h1

Select the child nodes jointly or separately?

Joint selection

Separate selection

Tree-Guided Group Lasso

For a general tree

Tree-guided group lasso

Eric Xing

© Eric Xing @ CMU, 2006-2010 44

h2

h1

Select the child nodes jointly or separately?

Tree-Guided Group Lasso

For a general tree

Tree-guided group lasso

Joint selection

Separate selection

Eric Xing

© Eric Xing @ CMU, 2006-2010 45

Previously, in Jenatton, Audibert & Bach, 2009

Balanced Shrinkage

Eric Xing

© Eric Xing @ CMU, 2006-2010 46

Estimating Parameters

Second-order cone program

Many publicly available software packages for solving convex optimization problems can be used

Also, variational formulation

Eric Xing

© Eric Xing @ CMU, 2006-2010 47

Illustration with Simulated Data

True association strengths Lasso

Tree-guided group lasso

SN

Ps

Phenotypes

No association

High association

Eric Xing

© Eric Xing @ CMU, 2006-2010 48

Yeast eQTL Analysis

Tree-guided group lasso

Single-Marker Single-Trait Test

SN

Ps

Phenotypes

No association

High association

Hierarchical clustering tree

Eric Xing

© Eric Xing @ CMU, 2006-2010 49

Structured Association

Phenome Structure

Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)

Graph

Tree-guided fused lasso (Kim & Xing, Submitted)

Tree

Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted)

Dynamic Trait

Genome Structure

Stochastic block regression (Kim & Xing, UAI, 2008)

Linkage Disequilibrium

Multi-population group lasso (Puniyani, Kim, Xing, Submitted)

Population Structure

EpistasisACGTTTTACTGTACAATT

Group lasso with networks(Lee, Kim, Xing, Submitted)

Eric Xing

© Eric Xing @ CMU, 2006-2010 50

Structured Association

Phenome Structure

Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)

Graph

Tree-guided fused lasso (Kim & Xing, Submitted)

Tree

Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted)

Dynamic Trait

Genome Structure

Stochastic block regression (Kim & Xing, UAI, 2008)

Linkage Disequilibrium

Multi-population group lasso (Puniyani, Kim, Xing, Submitted)

Population Structure

EpistasisACGTTTTACTGTACAATT

Group lasso with networks(Lee, Kim, Xing, Submitted)

Eric Xing

© Eric Xing @ CMU, 2006-2010 51

Computation Time

Eric Xing

© Eric Xing @ CMU, 2006-2010 52

Proximal Gradient Descent

Original Problem:

Approximation Problem:

Gradient of theApproximation:

Eric Xing

© Eric Xing @ CMU, 2006-2010 53

Geometric Interpretation

Smooth approximation

Uppermost LineNonsmooth

Uppermost LineSmooth

Eric Xing

© Eric Xing @ CMU, 2006-2010 54

Convergence Rate

Theorem: If we require and set , the

number of iterations is upper bounded by:

Remarks: state of the art IPM method for for SOCP converges at a rate

Eric Xing

© Eric Xing @ CMU, 2006-2010 55

Multi-Task Time Complexity

Pre-compute:

Per-iteration Complexity (computing gradient)

Tree:

Graph:

IPM for SOCP

Proximal-Gradient

IPM for SOCP

Proximal-Gradient

Proximal-Gradient: Independent of Sample Size Linear in #.of Tasks

Eric Xing

© Eric Xing @ CMU, 2006-2010 56

Experiments

Multi-task Graph Structured Sparse Learning (GFlasso)

Eric Xing

© Eric Xing @ CMU, 2006-2010 57

Experiments

Multi-task Tree-Structured Sparse Learning (TreeLasso)

Eric Xing

© Eric Xing @ CMU, 2006-2010 58

Conclusions

Novel statistical methods for joint association analysis to correlated phenotypes Graph-structured phenome : graph-guided fused lasso Tree-structured phenome : tree-guided group lasso

Advantages Greater power to detect weak association signals Fewer false positives Joint association to multiple correlated phenotypes

Other structures In phenotypes: dynamic trait In genotypes: linkage disequilibrium, population structure, epistasis

58

Eric Xing

© Eric Xing @ CMU, 2006-2010 59

Reference• Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, Series B

58:267–288.• Weller J, Wiggans G, Vanraden P, Ron M (1996) Application of a canonical transformation to detection of quantitative trait

loci with the aid of genetic markers in a multi-trait experiment. Theoretical and Applied Genetics 92:998–1002.• Mangin B, Thoquet B, Grimsley N (1998) Pleiotropic QTL analysis. Biometrics 54:89–99.• Chen Y, Zhu J, Lum P, Yang X, Pinto S, et al. (2008) Variations in DNA elucidate molecular networks that cause disease.

Nature 452:429–35.• Lee SI, Dudley A, Drubin D, Silver P, Krogan N, et al. (2009) Learning a prior on regulatory potential from eQTL data.

PLoS Genetics 5:e1000358.• Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, et al. (2008) Genetics of gene expression and its effect on

disease. Nature 452:423–28.• Kim S, Xing EP (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS

Genetics 5(8): e1000587.• Kim S, Xing EP (2008) Sparse feature learning in high-dimensional space via block regularized regression. In

Proceedings of the 24th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 325-332. AUAI Press.

• Kim S, Xing EP (2010) Exploiting a hierarchical clustering tree of gene-expression traits in eQTL analysis. Submitted.• Kim S, Howrylak J, Xing EP (2010) Dynamic-trait association analysis via temporally-smoothed lasso. Submitted.• Puniyani K, Kim S, Xing EP (2010) Multi-population GWA mapping via multi-task regularized regression. Submitted.• Lee S, Kim S, Xing EP (2010) Leveraging genetic interaction networks and regulatory pathways for joint mapping of

epistatic and marginal eQTLs. Submitted.• Dunning AM et al. (2009) Association of ESR1 gene tagging SNPs with breast cancer risk. Hum Mol Genet. 18(6):1131-9. • Esparza-Gordillo J et al. (2009) A common variant on chromosome 11q13 is associated with atopic dermatitis. Nature

Genetics 41:596-601.• Suzuki A et al. (2008) Functional SNPs in CD244 increase the risk of rheumatoid arthritis in a Japanese population.

Nature Genetics 40:1224-1229.• Dupuis J et al. (2010) New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.

Nature Genetics 42:105-116.59