Miranda DurkieJanuary 2010
Association is a statistical measure of the co-occurrence of certain phenotypic traits with certain alleles.
An association study is an examination of genetic variation across a given genome, designed to identify genetic associations with observable traits.
1. Direct causation: having allele A makes you susceptible to disease D. Possession of A may not be sufficient in itself to give you D but it makes it more likely you’ll develop D.
2. Natural selection: people who have disease D may be more likely to survive and reproduce if they have allele A.
3. Population stratification: the population contains several distinct genetic subsets and both disease D and allele A both happen to be more common in one particular subset.
4. Type 1 error: association studies test a large number of markers to find significant associations (p < 0.05). However by chance 5% of results will be significant at p = 0.05 and 1% at p = 0.01. Therefore data needs correction and in the past this was not done adequately so results could not be replicated.
5. Linkage disequilibrium: aim of association studies is to discover associations caused by linkage disequilibrium of allele A and disease D.
Linkage analysis is used to track the inheritance of alleles within a family.
Linked markers or alleles are only separated if a recombination event occurs.
The closer a marker is it to disease/susceptibility allele the less likely it is to be separated by recombination over several generations. This leads to a common haplotype which occurs more often than would be expected by chance.
Within an individual family this linkage will extend up to 20cM but for association studies only few kb
Linkage disequilibrium is the non-random association between two or more alleles located together on the same chromosome.
2 markers with alleles Aa and Bb Frequency of allele A=p and a=1-p Frequency of allele B=q and b=1-q If there is no association then AB occurs
at frequency pq However if frequency of AB>pq then AB
must be in postive LD.
Linkage is the relationship between alleles, whilst association is the relationship between alleles and phenotypes.
Association studies do not study families but instead look for differences in allele frequencies between different groups of individuals with defined phenotypes.
For both studies, the disease-causing mutation and/or susceptibility allele does not need to be known. Instead SNPs or other markers such as di-, tri- or tetra-nucleotide repeats which are in linkage disequilibrium with the disease/susceptibility allele are used.
1. Identify SNPs to analyse2. Genotype all SNPs in subset of
the samples3. Identify tagSNPs4. Genotype tagSNPs in all
samples5. Analyse data
Work out region of interest, or choose regions of known homology from a mouse or other animal model.
Work out size of area you wish to study is e.g. choose a 1Mb region around your locus of interest and choose one SNP every 500bp.
If possible include SNPs that have been validated in the same ethnic group as the one you are studying.
Prioritise SNPs with higher polymorphic frequencies (>10%)
If looking within genes prioritise possible functional variants e.g. non-synonymous SNPs within exons
Read current literature to find if out if any of the SNPs have been associated with similar phenotypes in other studies
Ensure that there are no SNPs under the primer or probe binding sites which could lead to non-amplification of one allele and skew your results
Due to advances in technology majority of current association studies now look at whole genome = genome-wide association studies (GWAS)
Ensure cases and controls are ethnically matched Ensure methodology is robust, accurate and high-
throughput e.g. SNParrays - which one? Exonic only? Platform? Cost? No of SNPs?
Genotype at least 96 controls and if you wish 96 cases Record the genotypes conservatively i.e. if unsure
mark as unknown Analyse the data to
Check for deviation from Hardy-Weinberg equilibrium for all alleles - if a deviation is found it is likely that genotyping errors have been made so re-check
Calculate LD scores for SNPs in the region Identify tagSNPs (also called haplotype tagging or
htSNPs)
Over 10 million SNPs in human genome Linked SNPs are often inherited together as a
block and the genotypes of these SNPs can be used to generate a haplotype.
The key SNPs that uniquely define the haplotype are called tagSNPs or haplotype tagging SNPs
HapMap project started in 2002 and was international collaboration to describe common patterns of genetic variation between individuals
Identified around 500,000 key tagSNPs which can be used to generate inferred haplotypes of surrounding SNPs
This has made genome-wide scans more efficient and comprehensive.
Commercially available SNP arrays have been designed by several companies e.g. Affymetrix and Illumina to cover hundreds of thousands of SNPs across the whole genome.
They can have slightly different target SNPs e.g. Illumina Human-1 focuses on exonic SNPs thus concentrating on potential functional variants.
These arrays use tagSNPs to maximise the amount of data generated by as few SNPs as possible.
In recognition of the potential role of CNVs in complex disease susceptibility many arrays also study CNVs.
Must ensure sufficient cases and controls are tested to reach statistical significance
The lower the odds ratio for an increase in susceptibility, the more samples are required for the testing to reach statistical significance.
It is estimated that common susceptibility loci are likely to have odds ratios (OR) of 1.1 to 1.5.
Therefore, for example, in order to achieve 90% power to detect an allele with 0.2 frequency and an OR of 1.2, more than 6000 affected cases and more than double that number of normal controls are required.
If the frequency of the variant is only 0.05 you would need 20,000 cases.
Do single-point analysis first by looking at individuals SNPs and calculating 2 and odds ratios.
Need to apply a correction for multiple testing e.g. Bonferroni correction is conservative correction used for studying multiple alleles that are in LD with each other (non-independent tests)
Once you have tested each individual SNP for association you can then construct haplotypes and study them for association with the disease/trait
Use bioinformatics programs such as HelixTree, SNPHAP and Stata
Because of the problems with sample size for detecting low susceptibility traits, meta-analysis has been increasingly used. Meta-analysis of GWA datasets can increase the power to detect association signals by increasing sample size and by examining more variants throughout the genome than each dataset alone.
2007 Wellcome Trust published GWA study looking at 2,000 cases of seven common diseases and 3,000 shared controls.
Found 24 associations: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes.
Linked 10 genes to common disorders not previously known
Colorectal cancer GWA has found 10 associated SNPs, 5 of which are linked to TGFβ superfamily signalling pathway
GWA studies have led to the discovery of at least 24 loci linked to type 2 diabetes
Mainly linked to insulin secretion pathway rather than insulin resistance
However it is estimated that these loci only account for 5% of the factors contributing to heritability of T2D
Studies of hundreds of thousands or even thousands of thousands of individual required to identify low susceptibility alleles
CNVs associations found linked to schizophrenia, alzheimers and parkinsons
Study of gene-gene and gene-environment interactions crucial which may be missed by single-point GWA
Majority of associated variants will not be functional therefore work will be required to identify causal variants
SNPs account for 78% variation in genome but only 26% of total nucleotide differences
Further study of CNVs will be crucial Study of rare rather than common variants
(1000G) Study of regulatory variants Next generation sequencing
Top Related