Download - Miranda Durkie January 2010. Association is a statistical measure of the co- occurrence of certain phenotypic traits with certain alleles. An association.

Miranda DurkieJanuary 2010

Association is a statistical measure of the co-occurrence of certain phenotypic traits with certain alleles.

An association study is an examination of genetic variation across a given genome, designed to identify genetic associations with observable traits.

1. Direct causation: having allele A makes you susceptible to disease D. Possession of A may not be sufficient in itself to give you D but it makes it more likely you’ll develop D.

2. Natural selection: people who have disease D may be more likely to survive and reproduce if they have allele A.

3. Population stratification: the population contains several distinct genetic subsets and both disease D and allele A both happen to be more common in one particular subset.

4. Type 1 error: association studies test a large number of markers to find significant associations (p < 0.05). However by chance 5% of results will be significant at p = 0.05 and 1% at p = 0.01. Therefore data needs correction and in the past this was not done adequately so results could not be replicated.

5. Linkage disequilibrium: aim of association studies is to discover associations caused by linkage disequilibrium of allele A and disease D.

Linkage analysis is used to track the inheritance of alleles within a family.

Linked markers or alleles are only separated if a recombination event occurs.

The closer a marker is it to disease/susceptibility allele the less likely it is to be separated by recombination over several generations. This leads to a common haplotype which occurs more often than would be expected by chance.

Within an individual family this linkage will extend up to 20cM but for association studies only few kb

Linkage disequilibrium is the non-random association between two or more alleles located together on the same chromosome.

2 markers with alleles Aa and Bb Frequency of allele A=p and a=1-p Frequency of allele B=q and b=1-q If there is no association then AB occurs

at frequency pq However if frequency of AB>pq then AB

must be in postive LD.

Linkage is the relationship between alleles, whilst association is the relationship between alleles and phenotypes.

Association studies do not study families but instead look for differences in allele frequencies between different groups of individuals with defined phenotypes.

For both studies, the disease-causing mutation and/or susceptibility allele does not need to be known. Instead SNPs or other markers such as di-, tri- or tetra-nucleotide repeats which are in linkage disequilibrium with the disease/susceptibility allele are used.

1. Identify SNPs to analyse2. Genotype all SNPs in subset of

the samples3. Identify tagSNPs4. Genotype tagSNPs in all

samples5. Analyse data

Work out region of interest, or choose regions of known homology from a mouse or other animal model.

Work out size of area you wish to study is e.g. choose a 1Mb region around your locus of interest and choose one SNP every 500bp.

If possible include SNPs that have been validated in the same ethnic group as the one you are studying.

Prioritise SNPs with higher polymorphic frequencies (>10%)

If looking within genes prioritise possible functional variants e.g. non-synonymous SNPs within exons

Read current literature to find if out if any of the SNPs have been associated with similar phenotypes in other studies

Ensure that there are no SNPs under the primer or probe binding sites which could lead to non-amplification of one allele and skew your results

Due to advances in technology majority of current association studies now look at whole genome = genome-wide association studies (GWAS)

Ensure cases and controls are ethnically matched Ensure methodology is robust, accurate and high-

throughput e.g. SNParrays - which one? Exonic only? Platform? Cost? No of SNPs?

Genotype at least 96 controls and if you wish 96 cases Record the genotypes conservatively i.e. if unsure

mark as unknown Analyse the data to

Check for deviation from Hardy-Weinberg equilibrium for all alleles - if a deviation is found it is likely that genotyping errors have been made so re-check

Calculate LD scores for SNPs in the region Identify tagSNPs (also called haplotype tagging or

htSNPs)

Over 10 million SNPs in human genome Linked SNPs are often inherited together as a

block and the genotypes of these SNPs can be used to generate a haplotype.

The key SNPs that uniquely define the haplotype are called tagSNPs or haplotype tagging SNPs

HapMap project started in 2002 and was international collaboration to describe common patterns of genetic variation between individuals

Identified around 500,000 key tagSNPs which can be used to generate inferred haplotypes of surrounding SNPs

This has made genome-wide scans more efficient and comprehensive.

Commercially available SNP arrays have been designed by several companies e.g. Affymetrix and Illumina to cover hundreds of thousands of SNPs across the whole genome.

They can have slightly different target SNPs e.g. Illumina Human-1 focuses on exonic SNPs thus concentrating on potential functional variants.

These arrays use tagSNPs to maximise the amount of data generated by as few SNPs as possible.

In recognition of the potential role of CNVs in complex disease susceptibility many arrays also study CNVs.

Must ensure sufficient cases and controls are tested to reach statistical significance

The lower the odds ratio for an increase in susceptibility, the more samples are required for the testing to reach statistical significance.

It is estimated that common susceptibility loci are likely to have odds ratios (OR) of 1.1 to 1.5.

Therefore, for example, in order to achieve 90% power to detect an allele with 0.2 frequency and an OR of 1.2, more than 6000 affected cases and more than double that number of normal controls are required.

If the frequency of the variant is only 0.05 you would need 20,000 cases.

Do single-point analysis first by looking at individuals SNPs and calculating 2 and odds ratios.

Need to apply a correction for multiple testing e.g. Bonferroni correction is conservative correction used for studying multiple alleles that are in LD with each other (non-independent tests)

Once you have tested each individual SNP for association you can then construct haplotypes and study them for association with the disease/trait

Use bioinformatics programs such as HelixTree, SNPHAP and Stata

Because of the problems with sample size for detecting low susceptibility traits, meta-analysis has been increasingly used. Meta-analysis of GWA datasets can increase the power to detect association signals by increasing sample size and by examining more variants throughout the genome than each dataset alone.

2007 Wellcome Trust published GWA study looking at 2,000 cases of seven common diseases and 3,000 shared controls.

Found 24 associations: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes.

Linked 10 genes to common disorders not previously known

Colorectal cancer GWA has found 10 associated SNPs, 5 of which are linked to TGFβ superfamily signalling pathway

GWA studies have led to the discovery of at least 24 loci linked to type 2 diabetes

Mainly linked to insulin secretion pathway rather than insulin resistance

However it is estimated that these loci only account for 5% of the factors contributing to heritability of T2D

Studies of hundreds of thousands or even thousands of thousands of individual required to identify low susceptibility alleles

CNVs associations found linked to schizophrenia, alzheimers and parkinsons

Study of gene-gene and gene-environment interactions crucial which may be missed by single-point GWA

Majority of associated variants will not be functional therefore work will be required to identify causal variants

SNPs account for 78% variation in genome but only 26% of total nucleotide differences

Further study of CNVs will be crucial Study of rare rather than common variants

(1000G) Study of regulatory variants Next generation sequencing