Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare...

Scalable Statistical Inference for Massive Health Science Data

Xihong Lin

Department of Biostatistics and Department of Statistics

Harvard University

Examples of Genome, Exposome and Phenome

Smartphone DataWhole Genome Sequencing

Electronic Medical Records

Genome

ExposomePhenome

Our Niche in Big Data Era: Scalable Statistical Inference

Data KnowledgeScalable

ActionsStatistical Inference

Broader

Partnership

Goal: To solve big problems

Whole Genome Sequencing Studies (2015-)

CCGATCCAAGTCCATATATACCGATTTAACCGAA

CCGATCCAAGTCCATATATACCAATTTAACCGAA

CCGATCCAAGTCCATACATACCGATTTAACCGAACCGATCCAAGTCCATACATACCGATTTAACCGAA

CCAATCCAAGTTCATATATACCGATTTGACCGAA

CCGATCTAAGTCCATATATACCGATTTAACCGAA

CCGATCCAAGTCCATACATACCGATTTAACCGAA

WGS Covers 100% of the Genome

Rare Variants

GWAS Common Variants <3%

Rare variants are more likely to cause diseases and their coded proteins are more likely to be drug targets.

TOPMed Freeze 5 (n=54,000): 430M Variants (97% are rare variants)

1000 Genomes N=1000

GSP(NHGRI)N=200,000

TOPMed(NHLBI)

N=150,000

Large Scale WGS Timeline2018

Biobanks(N=millions)

First Goal of WGS Analysis:Signal Detection

Scan the genome to identify genomic regions associated with diseases/traits

Challenges in Rare Variant Analysis of WGS Data

• Simple single SNP analysis does not work

• Need to perform SNP-set analysis

• Estimation is very difficult

APOE Promoter

LDL= 𝐆𝐆𝐆𝐆 + 𝐞𝐞

Sequencing Kernel Association Test (SKAT)• Wu, et al, 2011, AJHG.

(Citations=1400)• SMMAT• STAAR

Generalized Higher Criticism (GHC) /Generalized Berk-Jones (GBJ)/ACAT

• Murkerjee, et al, Ann. Stat, 2015

• Barnett, et al 2017 (GHC), JASA

• Sun and Lin (GBJ), 2017

• Liu, et al (ACAT), 2018

Model:

Sparse AlternativeDense Alternative

Test for Dense & Sparse High-Dimensional Alternatives

𝐘𝐘 = 𝐆𝐆𝐆𝐆 + 𝐞𝐞

Model and Hypothesis

Yi is phenotype (outcome) (i = 1, · · · , n)

Xi contains q covariates

Gi contains p SNPs (AA, AB, BB=0,1,2) in a SNV set, e.g.,variants in the promoter region of APOE.

α and β contain regression coefficients.

µi = E (Yi |Gi ,Xi)

h(µi) = XTi α +G

Hypothesis of no gene/network effect (p might be large):

H0 : β = 0 and H1 : β 6= 0 (weak).

March 8, 2019 1 / 23

• 𝑝𝑝 = dim(𝜷𝜷) might not be small

• Full GLMs hard to fit due to rare variants

• Solution:

• Use score statistics 𝑍𝑍𝑗𝑗 = ∑𝑖𝑖=1𝑛𝑛 𝐺𝐺𝑖𝑖𝑗𝑗(𝑌𝑌𝑖𝑖 − �𝜇𝜇𝑖𝑖𝑖)

• Scability: Fit the null same null model 𝑔𝑔 𝜇𝜇𝑖𝑖 = 𝑿𝑿𝒊𝒊′𝜶𝜶 only once

when scanning the genome

Challenges Addressed in Scalable Inference for WGS Data

Dense/Sparse Alternatives

Unknown Truth: k = p1−α of βj ’s 6= 0

Hypothesis

H0 : β = 0H1 : Some βj 6= 0

Dense alternative (α < 1/2):

Ex: p = 100, α = 0.4⇒ k = 16

Sparse alternative (α > 1/2):

Ex: p = 100, α = 0.6⇒ k = 7

March 8, 2019 2 / 23

Difficulties in Testing

No global optimal most powerful test exists.

Test optimality depends on

Genotype matrix(G ) : Sparsity, LD (correlation)

Signals β: Sparsity, strength, and sign

Distribution of Y

March 8, 2019 3 / 23

• Burden(B) (if all variants are causal with effects (β’s) in the same direction)

𝐵𝐵 = �𝑗𝑗

𝑝𝑝

𝑤𝑤𝑗𝑗𝑍𝑍𝑗𝑗

• SKAT (if there are neural variants and/or with effects (β’s) in different directions)

𝑆𝑆 = �𝑗𝑗

𝑝𝑝

𝑤𝑤𝑗𝑗𝑍𝑍𝑗𝑗2

Dense Regime

Sparse Regime: Higher Criticism (HC) (Tukey,1976)

S(t) =

p∑j=1

1{|Zj |≥t}

Assumes Σ = Ip or sparse (G is a low coherence matrix)

The HC test statistic is (Ingster, 1998; Donoho and Jin, 2003;Arias-Castro, et al, 2011)

HC = supt>0

{S(t)− 2pΦ(t)√

2pΦ(t)(1− 2Φ(t))

March 8, 2019 5 / 23

The Higher Criticism

−4 −2 0 2 4

Histogram of the Zi

argmax{HC(t)}

March 8, 2019 6 / 23

Linear Regression: Existing Results on DetectionBoundary

Dense Regime (α ≤ 12) Sparse Regime (α > 1

A �√

pα− 12

n⇒ all tests

powerless.A <

√2t log p

n, t <

ρ∗gaussian(α) ⇒ all tests pow-erless.

A �√

pα− 12

n⇒ SKAT pow-

erfulA >

√2t log p

r, t >

ρ∗gaussian(α) ⇒ HC powerful.

Setting

Low coherence matrix G (sparse correlation Σ)

A=signal strength of β.

Sparsity index: k = p1−α

March 8, 2019 7 / 23

The results for binary regression are different fromlinear regression (Mukherjee, et al, 2015, Ann Stat)

If design matrices are too sparse, then signal detection isimpossible no matter how strong signals are.

Two point detection boundary: Maximal Sparsity of G andMinimal Signal Strength β.

March 8, 2019 8 / 23

Asymptotic p-values for HC Does Not Work Wellfor Finte p

The supremum of this standardized empirical process follows aGumbel distribution asymptotically.

Jaeschke (1979) shows that this converges in distribution at anabysmal rate of O{(log p)−1/2}

March 8, 2019 9 / 23

Slow Convergence to Asymptotic Distribution of HC

0 1 2 3 4

Theoretical; p=∞ Empirical; p=102 Empirical; p=106

In genetic studies, gene and network sizes

(p=# of SNPs=dozens to thousands)

March 8, 2019 10 / 23

Analytic p-values for HC for Finite p(Barnett and Lin, Biometrika, 2015)

Letting h be the observed HC statistic:

p-value = pr

(supt>0

{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))

}≥ h

)There exists 0 < t1 < · · · < tp, such that

p-value = 1− pr

{S∗(tk) ≤ p − k}

Then apply the chain rule of conditioning to get a product ofbinomial probabilities.

March 8, 2019 11 / 23

Simulation Study of Type I error rates of HC:Analytic(Exact) vs Asymptotic

α 10 50

1·0 9·92× 10−1(7·31× 10−1) 1·01(1·59× 10−1)

1·0× 10−1 1·01× 10−1(6·03× 10−2) 9·75× 10−2(4·90× 10−3)

1·0× 10−2 1·12× 10−2(7·30× 10−3) 9·80× 10−3(4·00× 10−4)

March 8, 2019 12 / 23

Need to account for Correlation among SNPs (LD))

CHRNA3-5 Gene Region

March 8, 2019 13 / 23

Accounting for correlation: Innovated HC (iHC)(Hall and Jin, 2011)

Letting UUT = Cov(Z ) = Σ

Define the transformed (decorrelated) test statistics:

Z∗ = U

L−−−→n→∞

MVN(0, Ip)

S∗(t) =

p∑j=1

1{|Z∗j |≥t}

The innovated Higher Criticism test (iHC) statistic is:

iHC = supt>0

{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))

March 8, 2019 14 / 23

Decorrelating using dampens true signals and causesiHC to lose power: CGEM Breast Cancer GWAS:FGFR2 gene

Marginal test statistics

−4 −2 0 2 4

March 8, 2019 15 / 23

Generalized Higher Critcism (GHC) (Barnett, et al,2016, JASA)

Recall

S(t) =

p∑j=1

1{|Zj |≥t}

Now we allow Σ to have arbitrary correlation structure.

S(t) is no longer binomial. Instead we approximate withBeta-binomial, matching on first two moments.

The Generalized Higher Criticism (GHC) test statistic is:

GHC = supt>0

S(t)− 2pΦ(t)√Var(S(t))

GHC achieves the same as detection boundary as HC .

March 8, 2019 16 / 23

The variance estimator Var(S(t))

Let rn = 2p(1−p)

∑1≤k<l≤p(Σkl)

n and let Hi(t) be the Hermite

polynomials: H0(t) = 1, H1(t) = t, H2(t) = t2 − 1 and so on. Then

(S(tk), S(tj)

)= p[2Φ(max{tj , tk})− 4Φ(tj)Φ(tk)]

+4p(p − 1)φ(tj)φ(tk)∞∑i=1

H2i−1(tj)H2i−1(tk)r 2i

March 8, 2019 17 / 23

Analytic p-values for the GHC

Letting h be the observed GHC statistic:

p-value = pr

supt>0

S(t)− 2pΦ(t)√Var(S(t))

There exists 0 < t1 < · · · < tp, such that

p-value = 1− pr

{S(tk) ≤ p − k}

March 8, 2019 18 / 23

Generalized Berk-Jones

Motivation: GHC works well in the very sparse signal case butless well in the moderately sparse signal case in finite samples.

Let s be the realized value of S(t).

Berk-Jones (Sup LR test):

BJ = maxt>0

{Pr [S(t) = s|π = s/p]

Pr [S(t) = s|π = π0]

{π0 <

}Generalized Berk-Jones (Account for correlation):

GBJ = maxt>0

{Pr [S(t) = s|π = s/p, cor(Z) = Σ]

Pr [S(t) = s|π = π0, cor(Z) = Σ]

{π0 <

March 8, 2019 20 / 23

Inference using Generalized Higher Criticism andGeneralized Berk-Jones

The distribution of S(t) is over-dispersed binomial and its exactdistribution is hard to calculate.

Approximate the distribution of S(t) using extendedbeta-binomial.

The sups in GHC and GBJ are achieved at the design points andboth GHC/GBJ and their distributions are calculated analyticallyusing approximations.

March 8, 2019 21 / 23

Rejection Boundary Comparisons: GHC vs GBJ

20 SNPs, 100% correlated with ρ=0.3

2 4 6 8 10 12 14 16 18 20

Ordered Magnitudes of Test Statistics, |Z|(j)

Note how we gain ’volume’ in the rejection region near the expectedsignals.

March 8, 2019 22 / 23

Simulation (Main advantage of GBJ: Power gain infinite sample for moderate sparsity)

200 SNPs, ρ1=0.3, ρ2=0, ρ3=0, R2=0.01

2 4 6 8 10 12 14

Number of causal SNPs

Extremely sparse regime: 1-3 causal. Moderately sparse regime: 4-13causal. Dense regime: 14+ causal.

March 8, 2019 23 / 23

Key features:

• A general method for combining p-values.• Super fast computation under arbitrary correlation and robust to

correlation. • Powerful when signals are sparse.• Can be used for constructing robust test.

Sparse Regime: ACAT: Aggregated Cauchy Association Test

Yaowu Liu, et al (JASA 2018, AJHG, 2019)

𝑇𝑇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = �𝑖𝑖=1

𝑑𝑑

𝑤𝑤𝑖𝑖 tan 0.5 − 𝒑𝒑𝒊𝒊 𝜋𝜋

Transform p-value to Cauchy

Weights

Aggregated Cauchy Association Test (ACAT)

Dense signals Sparse signals

Alternative

Neutral variantCausal variant

SlowFastComputation

SKAT / Burden MinP/GHC/GBJ Tests

No prior knowledge about the sparsity of signals. Need robust test.

Existing SNV-set tests

Assumptions: 𝐼𝐼. 𝑝𝑝𝑖𝑖 |𝑍𝑍𝑖𝑖| (z-score) 𝐼𝐼𝐼𝐼. ∀𝑖𝑖, 𝑗𝑗, 𝑍𝑍𝑖𝑖 ,𝑍𝑍𝑗𝑗 ~𝑁𝑁2 0,Θ𝑖𝑖𝑗𝑗Theorem: For any Σ ≥ 0, we have

lim𝑡𝑡→+∞

𝑃𝑃{𝑇𝑇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 > 𝑡𝑡}𝑃𝑃{Cauchy 0,1 > 𝑡𝑡}

P-value calculation:p − value ≈ 1/2 − {𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡𝑎𝑎𝑛𝑛(𝑇𝑇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴)}/π

Correlation of p-values Not required Super fast

Tail is Cauchy

Theory about ACAT

IndependentPerfectly dependent

Sample mean ( �𝑋𝑋 = 1

𝑑𝑑∑𝑖𝑖=1𝑑𝑑 𝑋𝑋𝑖𝑖)

𝑋𝑋𝑖𝑖 ~ Cauchy(0,1)

𝑋𝑋𝑖𝑖 ~ Normal(0,1) �𝑋𝑋 ~ N(0,1/d)

�𝑋𝑋 ~ Cauchy(0,1)�𝑋𝑋 ~ Cauchy(0,1)

�𝑋𝑋 ~ N(0,1)

≈Cauchy(0,1)

General Dependency

Heavy tail makes Cauchy distribution insensitive to correlation

Some insights

P-values

0.35 0.510.25 1.000.15 1.960.05 6.31

0.45 0.16

2e-03 159 5e-03 63.7

Cauchy values

ACAT uses a few smallest p-values to represent the significance.

ACAT is powerful against sparse alternatives

…………MAC<10 MAC>=10

SNVs in a region

Burden 𝑝𝑝0 𝑝𝑝1 𝑝𝑝2 𝑝𝑝3 𝑝𝑝𝑘𝑘……

𝑇𝑇

P-value ≈ 1 − 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑇𝑇) Super fastAccurate

𝑤𝑤𝑖𝑖,𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴−𝑉𝑉 = 𝑤𝑤𝑖𝑖,𝑆𝑆𝑆𝑆𝐴𝐴𝐴𝐴 × )𝑀𝑀𝑀𝑀𝐹𝐹𝑖𝑖(1 −𝑀𝑀𝑀𝑀𝐹𝐹𝑖𝑖

Saddlepoint method (Dey, et al, 2017)

ACAT-V for testing a SNV-set

Key features:

• Boost RV analysis power by optimally combining statistical evidence of MAFs (default in SKAT), functional annotations, and phenotypic information

• Computationally scalable

• Applicable to any given variant-set

STAAR: variant-Set Test for Association using Annotation infoRmation

Xihao Li and Zilin Li

Optimal weighting: True effect sizes (unknown)

Signal Regions (Effect Sizes (𝜷𝜷)) in the Genome

Question: Which functional scores to use boost power of RV association analysis in a variant-Set

Use Functional Annotations to Prioritize Variants in a Variant-Set

Functional Annotation Database

Annovar

ENCODE

EPIGENOME

Individual Scores

Choosing Weights 𝒘𝒘𝒋𝒋 to Empower WGS Association Analysis

>260 annotations

15 Types of Annotations

80% built on hg38

Genome Functional Variant Annotations (GSP+TOPMED) Hufeng Zhou)

Dynamically incorporate multiple annotation weights in RV Tests

Coding Variants Non-coding Variants

Existing Integrative Annotation Scores are Mainly Driven by Protein and Conservation Scores with Little Correlation with Epigenetic Scores

• APC1: Epigenetics• APC2: Conservation• APC3: Protein Function• APC4: Negative Selection• APC5: Distance to Coding• APC6: Mutation Density• APC7: Transcription Factor• APC8: MapAbility• APC9: Distance to TEE/TSE• APC10: MicroRNA

Correlation Heatmap with Annotation PCs (GSP Freeze 1, hg38)

STAAR: Incorporate Multiple Functional Scores to Boost Power of RV Association Analysis Using ACAT

Type I Error Rates Using STAAR are Protected: Simulated WGS data Using COSI (n = 10,000)

𝜶𝜶 = 𝟏𝟏𝟎𝟎−𝟔𝟔 Continuous Traits Dichotomous Traits

STAAR-B 1.1 × 10−6 1.0 × 10−6

STAAR-S 9.9 × 10−7 7.8 × 10−7

STAAR-O 9.3 × 10−7 1.0 × 10−6

STAAR-O uses ACAT to combine STAAR-B and STAAR-O

ARIC WGS data of LPA (AA, n=1800): Significant 4KB Sliding Windows in Chr 6

LPA (AA): Significant 4KB Sliding Windows in Chr 6

Area 1 and Area 2: Weights

• Scalable statistical inference is a critical niche for analysis of big data.

• It is important to integrate domain science and computational science in scalable statistical inference to accelerate statistical science and scientific discovery.

• “Optimal” statistical inference needs to context-specific, e.g., dense and sparse regimes for high-dimensional hypothesis testing

• Asymptotic and finite sample results are both important.

Final Remarks

Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare...

Documents

Transcript of Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare...

Rare coding variants in 35 genes associate with circulating ......2020/12/23 · 1 Rare coding variants in 35 genes associate with circulating lipid levels – a multi-ancestry analysis

Search for Rare Copy-Number Variants in Congenital Heart ...circgenetics.ahajournals.org/content/circcvg/early/2015/12/04/CIRC... · Search for Rare Copy-Number Variants in Congenital

Identification of rare and common variants in BNIP3L: a … · 2020. 5. 11. · PRIMARY RESEARCH Open Access Identification of rare and common variants in BNIP3L: a schizophrenia

Computational methods for the analysis of rare variants · Computational methods for the analysis of rare variants Shamil Sunyaev ... Alkes Price, Lee-Jen Wei, Paul de Bakker, Shaun

Rare and common variants of APOB and PCSK9 in Korean ... · RESEARCH ARTICLE Rare and common variants of APOB and PCSK9 in Korean patients with extremely low low-density lipoprotein-cholest

Rare and common variants: twenty arguments G.Gibson

National Human Genome Research Institute Home - Nancy’s … · 2014. 8. 1. · Nancy’s $100M Project. Common Variants Rare Variants Relating Variation to Phenotype Genome ...

Novel and rare functional genomic variants in multiple ... · Johar et al. J Transl Med DOI 10.1186/s12967-015-0525-x RESEARCH Novel and rare functional genomic variants in multiple

Common and Rare Variants in SCN10A Modulate the Risk …circgenetics.ahajournals.org/content/circcvg/early/2014/07/31/CIRC... · Common and Rare Variants in SCN10A Modulate the Risk

Surviving Rare Sherman Variants - Freethe.shadock.free.fr/Surviving_Rare_Sherman_Variants.pdfSurviving rare Sherman variants ... tank and serve to restore the Sherman Jumbo “Cobra

Evaluating the contribution of rare variants to type 2 diabetes and … · 2017-11-28 · Results To determine the extent to which private and rare variants contribute to T2D and

Beyond GWAS. Outline Multiple testing Gene-environment interaction Gene-gene interaction Rare variants Pharmacogenetics, Phamacogenomics.

Genetic Analysis of Rare Human Variants of Regulators of G ...

Quantitative trait loci, genome wide association mapping ...€¦ · GWAS (Genome wide association mapping) Coupling of molecular variants to (quantitative) traits, like weight of

Common genetic variants influence human subcortical brain ...enigma.ini.usc.edu/wp-content/uploads/2015/07/ENIGMA-GWAS-Nat… · Allsubcorticalgenome-widesignificantvariantsidentifiedinthedis-covery

Contribution of Global Rare Copy-Number Variants … AJHG 91 489...ARTICLE Contribution of Global Rare Copy-Number Variants to the Risk of Sporadic Congenital Heart Disease Rachel

Improving the interpretation of rare variants through ... · Sian Ellard, Thalia Antoniadi and Emma Baple Improving the interpretation of rare variants through inter-disciplinary

Finding Disease Variants in Mendelian Disorders By …staff.ustc.edu.cn/~ynyang/group-meeting/2012-2013/rare-variants/... · Finding Disease Variants in Mendelian Disorders By Using

Common genetic variants contribute to risk of rare …...Common genetic variants contribute to risk of rare severe neurodevelopmental disorders Mari e. K. Niemi 1, Hilary C. Martin

Family-specific aggregation of lipid GWAS variants confers ...