Genome-wide association studies

Genome-wide association studies

Usman Roshan

SNP

• Single nucleotide polymorphism• Specific position and specific chromosome

SNP genotype

Suppose this is the DNA on chromosome 1 starting from position 1.

There is a SNP C/G on position 5, C/T on position 14, and G/T on position 21. This person is heterozygous in the first SNP and homozygous in the other two.

F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC

SNP genotype representation

The example

F: AACACAATTAGTACAATTATGAC

M: AACAGAATTAGTACAATTATGAC

is represented as

CG CC GG …

SNP genotype

• For several individuals

A/T C/T G/T …

H0: AA TT GG …

H1: AT CC GT …

H2: AA CT GT …

.

.

.

SNP genotype encoding

• If SNP is A/B (alphabetically ordered) then count number of times we see B.

• Previous example becomesA/T C/T G/T … A/T C/T G/T …

H0: AA TT GG … 0 2 0 …H1: AT CC GT … =>1 0 1 …H2: AA CT GT … 0 1 1 …

Now we have data in numerical format

Genome wide association studies (GWAS)

• Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.

• Design:– Identify population structure– Select case subjects (those with disease)– Select control subjects (healthy)– Genotype a million SNPs for each subject– Determine which SNP is associated.

Example GWAS

A/T C/G A/G …

Case 1 AA CC AA

Case 2 AT CG AA

Case 3 AA CG AA

Control 1 TT GG GG

Control 2 TT CC GG

Control 3 TA CG GG

Encoded data

A/T C/G A/G A/T C/G A/G

Case1 AA CC AA 0 0 0

Case2 AT CG AA 1 1 0

Case3 AA CG AA => 0 1 0

Con1 TT GG GG 2 2 2

Con2 TT CC GG 2 0 2

Con3 TA CG GG 1 1 2

Ranking SNPs

SNP1 SNP2 SNP3 SNP1 SNP2 SNP3

A/T C/G A/G A/T C/G A/G

Case1 AA CC AA 0 0 0

Case2 AT CG AA 1 1 0

Case3 AA CG AA => 0 1 0

Con1 TT GG GG 2 2 2

Con2 TT CC GG 2 0 2

Con3 TA CG GG 1 1 2

A good ranking strategy would produce SNP3, SNP1, SNP2

Chi-square test

• Gold standard is the univariate non-parametric chi-square test with two degrees of freedom.

• Search for SNPs that deviate from the independence assumption.

• Rank SNPs by p-values

Case-control example

• Study of 100 people:– Case: 50 subjects with cancer– Control: 50 subjects without cancer

• Count number of alleles and form a 2x2 contingency table

• Relative risk:RR = Pr(disease | one copy of risk allele)/

Pr(disease | zero copies of risk allele)(Jewell ‘03)

• Due to sampling we cannot estimate the relative risk from a case-control study

• But we can estimate the odds-ratio

982Control

9010Case

#Allele2#Allele1

Symmetry in odds ratio

• The odds ratio is symmetric in disease and genotype:

OR = Odds(D|G=1)/Odds(D|G=0) =

= Odds(G|D=1)/Odds(G|D=0)

• Great! Because we can estimate P(G|D) from a case control study. We can now use the OR as an estimate of one’s risk of disease.

Example

• Odds of risk allele in case = (10/100)/(90/100)=1/9

• Odds of risk allele in control = (2/100)/(98/100)=1/49

• Odds ratio of risk allele = 49/9

982Control

9010Case

#Allele2

(wildtype)

#Allele1

(risk)

What about significance?

• Okay, so the OR measures the risk. But is it significant? Perhaps it is due to chance.

• Let’s look at the chi-square test for measuring significance.

Statistical test of association (P-values)

• P-value = probability of the observed data (or worse) under the null hypothesis

• Example:– Suppose we are given a series of coin-tosses– We feel that a biased coin produced the tosses– We can ask the following question: what is the probability

that a fair coin produced the tosses?– If this probability is very small then we can say there is a

small chance that a fair coin produced the observed tosses.– In this example the null hypothesis is the fair coin and the

alternative hypothesis is the biased coin

Binomial distribution

• Bernoulli random variable: – Two outcomes: success of failure– Example: coin toss

• Binomial random variable:– Number of successes in a series of independent Bernoulli trials

• Example: – Probability of heads=0.5– Given four coin tosses what is the probability of three heads? – Possible outcomes: HHHT, HHTH HTHH, HHHT– Each outcome has probability = 0.5^4– Total probability = 4 * 0.5^4

Binomial distribution

• Bernoulli trial probability of success=p, probability of failure = 1-p

• Given n independent Bernoulli trials what is the probability of k successes?

• Binomial applet: http://www.stat.tamu.edu/~west/applets/binomialdemo.html

n

k

pk (1 p)n k

Hypothesis testing under Binomial hypothesis

• Null hypothesis: fair coin (probability of heads = probability of tails = 0.5)

• Data: HHHHTHTHHHHHHHTHTHTH• P-value under null hypothesis = probability

that #heads >= 15• This probability is 0.021• Since it is below 0.05 we can reject the null

hypothesis

Chi-square statistic

• Define four random variables Xi each of which is binomially distributed Xi ~ B(n, pi) where n=c1+c2+c3+c4 is the total number of subjects and pi is the probability of success of Xi.

• Each variable Xi represents the number of case and control subjects with number of risk and wildtype alleles.

• The expected value E(Xi) = npi since each Xi is binomial.

c4 (X4)c3 (X3)Control

c2 (X2)c1 (X1)Case

#Allele2 (wildtype)

#Allele1 (risk)

Chi-square statistic

Define the statistic:

where

ci = observed frequency for ith outcomeei = expected frequency for ith outcomen = total outcomes

The probability distribution of this statistic is given by thechi-square distribution with n-1 degrees of freedom.Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf

Great. But how do we use this to get a SNP p-value?

2 =(ci ei )

2

eii=1

n

Null hypothesis for case control contingency table

• We have two random variables:– D: disease status– G: allele type.

• Null hypothesis: the two variables are independent of each other (unrelated)

• Under independence – P(D,G)= P(D)P(G)– P(D=case) = (c1+c2)/n– P(G=risk) = (c1+c3)/n

• Expected values– E(X1) = P(D=case)P(G=risk)n

• We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value).

• SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.

c4c3Control

c2c1Case

#Allele2

(wildtype)

#Allele1

(risk)

Chi-square statistic exercise

482Control

3515Case

#Allele2#Allele1• Compute expected valuesand chi-square statistic• Compute chi-square p-value by referring tochi-square distribution

Logistic regression

• The odds ratio estimated from the contingency table directly has a skewed sampling distribution.

• A better (discriminative) approach is to model the log likelihood ratio log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words:

• Why:– Log likelihood ratio is a powerful statistic– Modeling as a linear function yields a simple algorithm to estimate

parameters

• G is number of copies of the risk allele

• With some manipulation this becomes

Pr(Dcase |G) 1

1 e (w TGw0 )

log(Pr(G |Dcase)

Pr(G |Dcontrol)) wTG w0

How to find w and w0?

• And so ew is our odds ratio. But how do we find w and w0?

– We assume that one’s disease status D given their genotype G is a Bernoulli random variable.

– Using this we form the sample likelihood

– Differentiate the likelihood by w and w0

– Use gradient descent

Genome-wide association studies

Documents

Transcript of Genome-wide association studies