Incorporating linkage disequilibrium blocks in Genome-Wide ...
Transcript of Incorporating linkage disequilibrium blocks in Genome-Wide ...
Incorporating linkage disequilibrium blocks inGenome-Wide Association Studies
Alia DehmanChristophe Ambroise
Pierre Neuvial
Laboratoire Statistique et Génome
July 2nd, 2013
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 1 / 22
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22
Genome-Wide Association Studies
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 3 / 22
Genome-Wide Association Studies The regression model
The regression model
To identify genetic markers that are significantly associated with aphenotype of interest.
Phenotypic trait : qualitative or quantitativeGenetic markers : Single Nucleotide Polymorphisms (SNP)
The regression model
Yi = β0 +
p∑j=1
Xijβj + εi , i = 1, . . . , n
I n : number of individualsI p : number of covariatesI Yi : response for the individual iI X.j : observations for covariate j (coded in 0, 1 or 2)
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 4 / 22
Genome-Wide Association Studies Sparsity and high-dimension contexts
Sparsity and high-dimension contexts
Sparsity : Only a subset of SNPs is significantly associated with thephenotype.
Card{j, βj 6= 0} � p
High-dimension : Many thousands of markers vs a few hundredobservations.
p� n
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 5 / 22
Genome-Wide Association Studies Biological context : LD
The LD measures
Linkage Disequilibrium (or Gametic Disequilibrium) : Is thenon-random association of alleles at two or more loci. Its amount dependson the difference between observed allelic frequencies and those expectedfrom a homogenous, randomly distributed model.
Zj the indicator of the presence of minor allele for SNP j.Zj ∼ B(pj)
D(j, k) = cov(Zj , Zk)
r2(j, k) = corr(Zj , Zk)
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 6 / 22
Genome-Wide Association Studies Biological context : LD
How to estimate it ?
snp vv vV VVuu a b cuU d e fUU g h i
snp v Vu α β
U γ δ
Only the genotype data table is observed
α, β, γ, δ are estimateda system of equations. e.g : α = 2a+ b+ d+ pe
with p the « probability » of the haplotype (uv, UV ).
⇒ estimating p, then (α, β, γ, δ) and finally D = pUV − pUpV .
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 7 / 22
Genome-Wide Association Studies Biological context : LD
The LD-block structure
the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)
LD structured in blocks
Dimensions: 50 x 50Column
Row
10
20
30
40
10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 8 / 22
Genome-Wide Association Studies Biological context : LD
The LD-block structure
the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)
LD structured in blocks
Dimensions: 50 x 50Column
Row
10
20
30
40
10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 8 / 22
Taking the group structure into account
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 9 / 22
Taking the group structure into account Classical approach
Classical approach : tag-SNP
To deal with high-dimensional problems and dependence among SNP :
based on LDselection of « representative » SNP of each LD-block : tagging
Loss of information
Loss of power : tag-SNP not necessarly the causal SNP.
A different approach :a block-selection
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 10 / 22
Taking the group structure into account A Two-Step Approach
A Two-Step Approach
Inference of blocks
only the genotype data X are used.a p× p matrix LD pairwise measures is calculated.Ward Constrained Hierarchichal Clustering (R package rioja)
Selection of blocks associated with phenotype
The Group Lasso : well-adapted to group-structured variables
β̂λ = argminβ
∑i
(yi −Xi.β)2+ λ
G∑g=1
√pg||βg||2).
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 11 / 22
Taking the group structure into account Competing methods
Competing methods
Lasso
β̂l1= argmin
β
∑i
(yi −Xi.β)2 + λ||β||1,
Elastic-Net
β̂EN
= argminβ
∑i
(yi −Xi.β)2 + λ1||β||1 + λ2||β||22,
with λ, λ1 and λ2 three regularization parameters.(R package quadrupen)
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 12 / 22
Taking the group structure into account Competing methods
Evaluation
Parametersn = 200, p = 512, K = 9 groups of sizes(2, 2, 4, 8, 16, 32, 64, 128, 256).The first 2 SNPs of groups of sizes 2, 2, 4, 8 are associated with thephenotype.cov(X.j , X.j′) = ρ 1j=j′ .Coefficient of determination : R2 = 0.2.
Definition of associated SNPs
SNP-level Block-levelALIA DEHMAN (Statistique et Génome) July 2nd, 2013 13 / 22
Results
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 14 / 22
Results True number of clusters
True number of clusters
ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2
Figure: The number of clusters is set to 9.ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 15 / 22
Results Misspecified number of clusters
Misspecified number of clusters : too few
ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
Figure: The number of clusters is set to 5.ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 16 / 22
Results Misspecified number of clusters
Misspecified number of clusters : too many
ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
Figure: The number of clusters is set to 13.ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 17 / 22
Current works
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersMisspecified number of clusters
4 Current works
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 18 / 22
Current works
Memory requirement of the clustering
Ward Constrained Hierarchichal Clustering
d(A,B) =nAnBnA + nB
(1
n2ASA,A +
1
n2BSB,B −
2
nAnBSA,B
)
rioja cWardType of entry p × p dissimilarity
matrixthe n × p designmatrix
Time complexity O(np2) O(np2)
Memory complexity O(p2) O(np)
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 19 / 22
Current works
Automatic model selection
Inferring the number of clusters :maximal gap (Bühlmann et. al., 2012, arXiv :1209.5908v1)BIC criterionGap Statistic (Tibshirani et. al., 2001, JRSSB)
Tree-Group Lasso
β̂Tree
= argminβ
∑i
(yi −Xi.β)2 + λ
d∑h=0
Gh∑g=1
ωhg ||βhg ||2.
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 20 / 22
Current works
Automatic model selection
ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
Figure: ρ = 0 : 1 cluster, ρ = 0.1 : 5 clusters, ρ = 0.2 : 6 clustersALIA DEHMAN (Statistique et Génome) July 2nd, 2013 21 / 22
Thank you for your attention !
ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 22 / 22