# LDPS - Newell

date post

10-Apr-2022Category

## Documents

view

0download

0

Embed Size (px)

### Transcript of LDPS - Newell

Microsoft Word - LDPS - Newell.docxA Short Course On Linkage Disequilibria And Population Structure Developed By Mark Newell,

With Editorial Contributions By Bill Beavis And Jean-Luc Jannink

1. Introduction Molecular markers can be used in a variety of ways to explore or make inferences about populations. These can include exploring population structure, looking at the extent of linkage disequilibrium, quantitative trait locus (QTL) mapping, marker-assisted selection, and genomic selection. The objectives of this module are to:

i. Learn about the major sources of genetic variation and their role in breeding. ii. Grasp the concept of population structure in populations of breeding lines,

become aware of the ways in which it arises, and how it may affect analyses. iii. Understand linkage disequilibrium related to its importance for identification of

QTL, factors responsible for the generation and decay of LD, and population/genetic parameters that affect the level of LD in populations.

iv. Explore and visualize the level of population structure using principal components analysis.

v. Define discrete subpopulations for genetic data by understanding and implementing cluster analysis.

2. Genetic diversity

Genetic diversity is the variation on which the breeder is able to select. It is most often described for a random mating population where no selection occurs between individuals. There are three major sources of genetic variation.

i. Mutation, a random change in DNA sequence, is the ultimate source of all variation in populations.

ii. Recombination from crossing individuals does not create new alleles but creates new combinations of alleles.

iii. Migration, from a breeding perspective is bringing in new alleles from other sources including germplasm banks or breeders from other regions.

3. Population structure

Genetic relatedness within a population of lines is not always constant across pairs of individuals; this leads to some level of population structure. Population structure can be defined as the differential relatedness between lines in a population. The different levels of relatedness that result in population structure are most often caused by two phenomena in breeding populations, subpopulation structure and pedigree. Differential relatedness caused by pedigree simply refers to the fact that individuals within a family are more related to each other than to individuals outside of the family. This idea can be extended to larger groups of individuals that are not related by known pedigree. A group of individuals in a population that are more similar to each other than to individuals outside the group can be referred to as a subpopulation. Thus, the variation present in a population due to structure can be described in terms of the variation within a subpopulation versus the variation between subpopulations.

• A clear example of subpopulation structure can be seen in barley where structure

is largely due to three factors, including morphology, phenology, and end-use. The figure below shows the subpopulation structure for barley with two distinctive groups within each factor. In the case for barley, breeders generally cross lines within each class, which results in differing levels of relatedness between groups of lines.

• Structure due to pedigree is usually a result that occurs due to strategies implemented by breeders. For example, breeders tend to identify lines with good performance and recycle them in a breeding program to develop new populations for selection.

Study Questions:

1) How can the three major sources of genetic variation be applied in a breeding program to increase genetic diversity?

2) What are the driving factors that result in population structure for a crop of your choice? Are the identified factors a product of subpopulation structure and/or pedigree?

4. Linkage disequilibrium (Flint-Garcia et al. 2003)

• Linkage disequilibrium (LD) is the non-independence of alleles at two loci. This concept can be illustrated in a simple 2x2 contingency table. The table below shows the case for two loci, A and B (each with two alleles), when the loci are in linkage equilibrium. In this case, the joint probability (shaded red) is equal to the product of the marginal probabilities (shaded blue), thus the alleles at locus A and B are independent. Intuitively, “independence” means that knowing what allele is present at locus A does not help you guess what allele is present at locus B (or vice versa).

Locus A

Barley

pApB+pAqB=pA qApB+qAqB=qA

For the case when there is linkage disequilibrium between the two loci, the joint probability does not equal the product of the marginal probabilities, instead there is a deviation denoted as D (disequilibrium). In this situation, the probability of an allele at one locus is dependent on the allele at the other locus and vice versa. Thus, linkage disequilibrium can be thought of as the dependence of alleles at two loci. Intuitively, dependence means that knowing what allele is present at locus A does help you guess what allele is present at locus B. In the table below, for example, if D is positive and allele A1 is present, the probability that B1 is present is greater than pB.

Locus A

Pr(B1)=pB Pr(A1B1)=pApB+D Pr(A2B1)=qApB-D pApB+qApB=pB

Pr(B2)=qB Pr(A1B2)=pAqB-D Pr(A2B2)=qAqB+D pAqB+qAqB=qB

pApB+pAqB=pA qApB+qAqB=qA

Study Questions:

1) Describe the concept of LD in your own words. 2) In the tables above, use algebra to show that pApB+qApB=pB. Explain why in a given

row or column of the table above, one cell has +D and the other –D.

• There are multiple measures of LD between two loci, including D, D’, and r2.

o The first and simplest method of estimating LD is denoted D and is calculated as:

D=pAB-pApB

In this calculation, D is equal to the difference between the joint frequency of the two alleles (pAB) and the product of their marginal frequencies. The value D ranges between -0.25 and 0.25 and is highly dependent on allele frequencies.

o Standardized D, denoted D’, is scaled based on the observed allele frequencies, therefore it will range from zero to one. It is calculated as:

D’=–D/min(pAqB,qApB) for D<0 D’=D/min(pApB,qAqB) for D>0

In this standardization procedure, D’ is less dependent on allele frequencies than D, although if one haplotype has a low frequency, D’ is often close to one.

o Lastly, the most common estimate of LD is r2, or the squared correlation coefficient between two loci. It is calculated as:

r2=D2/pAqApBqB

This estimate is typically the preferred estimate for genome-wide association studies.

Study Questions:

1) Scores for eight individuals at two SNP sites are shown in the table below. What is the level of LD measured as D, D’, and r2 for this data?

Si

2

Ind. 1 A G Ind. 2 A G Ind. 3 A G Ind. 4 A G Ind. 5 T G Ind. 6 T G Ind. 7 T C Ind. 8 T C

2) Consider a variable x = 1 when allele A1 is present and x=0 when allele A2 is present.

Likewise, a variable y = 1 when allele B1 is present and y=0 when allele B2 is present. Calculate the covariance between x and y. Give this covariance in terms of pA, qA, pB, and qB.

• Why is LD important?

o In general, we do not know what loci in the genome affect a trait, and we

cannot look at all loci because there are too many of them. Instead, we look at a subset of loci that can be scored using DNA marker technologies. These markers themselves do not affect the phenotype, but LD between a marker and a QTL generates a relationship between the marker and the trait, giving evidence of the importance of that genomic region. LD is also important because it estimates the ability to predict an allele present at one locus from the allele present at another locus. Imagine that a marker locus (M) and a QTL locus (Q) are in LD with r2 = 0.4. Although the breeder cannot select directly for Q, since M explains approximately 40% of the variation at Q, then selecting on M shifts the frequency of Q, resulting in the desired change of phenotype.

• Causes of LD: There are four major driving factors causing LD between loci. These include mutation, migration, drift/inbreeding, and population structure. Figures below depict the four causes of LD show haploid genotypes (gametes or haplotypes) in populations.

o Mutation: LD in the base population is equal to zero because the ‘b’

locus is monomorphic. After a single mutation occurs in one of the haplotypes, namely from ‘b’ to ‘B,’ the LD between the A and B loci is no longer equal to zero, instead it is 0.05. Thus, a single mutation in a population can lead to LD between to loci.

o Migration: A small population of haplotypes migrate into a larger population, where the alleles have different states. Alleles having different states simply means that the alleles are fixed at both loci for differing alleles. As seen, the migrating haplotypes are fixed for the A and B loci with the ‘a’ and ‘b’ aleles, respectively, and the base population is fixed for ‘A’ and ‘B.’ The influx of migrating haplotypes with allele frequencies differing from those of the base population increases the LD from zero to 0.18.

o Drift/Sampling: In the example below, when drift due to inbreeding occurs by random sampling alone, the ‘A’ allele happened to be transmitted more often with the ‘B’ allele, leading to LD between the two loci.

pAB-pApB= 0 pAB-pApB= 0.05

A B b

a

ba

o Population structure: Two distinct subpopulations are present and LD is zero when calculated within subpopulations. In contrast, the level of LD between locus A and B increases to 0.25 when the subpopulations are taken together.

• Decay of LD o Recombination is the only force systematically causing LD to decay.

⊗

A B

a b

a b

a b

a b

a b

a b

a b

pAB-pApB= 0.25

over longer periods of time for loci that are closer than for loci that are farther apart.

• Population/genetic parameters that affect of LD o Linkage: From the graphs above it can be seen that when two loci have

tighter linkage (less recombination) the LD between them is more persistent over time. Likewise, for two loci that are loosely linked (greater recombination), LD decays at a faster rate. In fact, the expected value of LD measured as r2 is

! = 1

1 + 4!

where ! is the effective population size and is the recombination rate. From this expectation we can see that as the recombination rate, c, increases the LD, r2, decreases.

o Effective population size: Application of the same expectation of LD as above, shows that as the effective population size, !, is increased, the extent of LD decreases. This occurs because as Ne increases, the power of drift to increase LD decreases and so the expectation reaches a lower level.

o Mating system: The extent of LD is generally higher for autogamous than allogamous crops. The reason for this is that for an autogamous crop there are fewer effective recombination events. An effective recombination is a crossover event that results in recombination generating new (non-parental) allele combinations. If a locus is homozygous in an individual, recombinations with it are not effective.

Study Questions:

1) Is the concept of LD important for QTL mapping in F2 populations? 2) What is the difference in the level of LD in an F2 compared to and F7 population?

What would the implications of this be on the marker density requirement and mapping resolution?

3) Verify that a recombination with a homozygous locus does not create allele combinations that are not already present in the parent.

Generation

Recombination

0.001

0.01

0.05

0.1

0.2

0.5

Generation

Recombination

0.001

0.01

0.05

0.1

0.2

0.5

5. Principal components analysis (PCA)

The major objective of PCA is to summarize and visualize genetic diversity for a set of lines based on marker data. Most often, marker data includes hundreds to thousands of markers, PCA allows us to summarize and visualize this diversity.

• Imagine we have two variables, denoted x1 and x2 with the following relationship (left). The first PC, also called the first eigenvector, can be thought of as the perpendicular distances (blue line) of a linear regression through the direction of maximum variance. As such, the second PC follows the same definition except that it is through the direction of maximum variance orthogonal to all previous PCs, in this case PC1. This means that each PC is uncorrelated to all previous, an important characteristic of PCA. After PCA is computed (right), hidden structure in the data can be more easily seen.

• PCA also summarizes the diversity in terms of variation. The first eigenvalue is

the percent of variation explained by the first PC, for the preceding example this is equal to 99.7%. As such, the second eigenvalue is equal to 0.3%. Since the first PC is the direction of maximum variance, its eigenvalue is always the largest and each consecutive PC accounts for less than the one before.

• The above example is simple compared to a typical data set comprised of hundreds or thousands of lines and markers. The following example is a set of 1816 barley lines scored for 1416 SNPs (Hamblin et al. 2010). For a data set of this size, it is nearly impossible to explore its diversity without PCA. By plotting PC1 versus PC2, one can see that there are at least four distinct clusters within this data set. These distinct clusters represent 2-row, 6-row, spring, and winter types. For this example, PC1 and PC2 account for 24.5 and 10.1% of the variation in the data.

x1

x2

5

10

15

P C 2

PCA

Study Questions:

1) PCA is a robust approach that can be used for a wide variety of data sets besides genotypes, what other types of data related to plant breeding could PCA be applied?

2) The PCs can be thought of as a subset of variables that explain the majority of the variation for a given data set. If the first few PCs explain a lot of the variation, what is explained by the larger PCs?

6. Cluster Analysis

Similar to PCA, the purpose of cluster analysis is to explore genetic diversity, but in a more clear-cut fashion.

a) A common approach to cluster analysis is K-means clustering, where K is a pre- determined number of clusters. This is an iterative procedure with the following steps:

i. Initialization: An initial set number of K means (seed points) are

determined; these are the initial means for each of K clusters. ii. Each observation is then assigned to the nearest cluster mean.

iii. Means for each cluster are then re-calculated and observations are re- assigned to the nearest K mean.

iv. Steps ii and iii are then repeated until no more changes occur.

• Running K-means clustering on the barley data set from above where K is equal to 6 gives the following results:

PC1 (24.5%) P

-20 -10 0 10 20

• By looking at the PC plots with each line colored by the cluster, we can start to visualize the different clusters with respect to one another. The PC plot of PC1 versus PC3 also demonstrates the value of PCs beyond PC1 and PC2. Although PC3 accounts for only 4.5% of the variation in the data, it adequately separates what seemed to be a single cluster when looking at PC1 and PC2, the bottom left group with low PC1 and low PC2 values.

b) Another common approach to cluster analysis for genetic data is hierarchical

clustering. This approach sequentially lumps or splits observations to make clusters. • Applying the hierarchical approach to the barley data set we can visualize the

results using a cluster dendrogram. Observations are arrayed along the x- axis and the y-axis refers to the distance between breakpoints. For example, the horizontal line at 4e+05 indicates that there are two major groups with a distance between them of 4e+05. The user determines the height (distance along the y-axis) at which a horizontal line is drawn and the number of clusters is chosen, this is drawn below in red for 6 clusters. The user may determine this by using the PC plots, cluster dendrogram, and any prior information that is known about the germplasm.

PC1 (24.5%)

P C

2 (1

0. 1%

P C

3 (4

0e +0 0

1e +0 5

2e +0 5

3e +0 5

4e +0 5

M an

ha tta

n D

is ta

nc e

• Hierarchical clustering is implemented on a matrix of distance between every pair of observations called the distance matrix. For the example above the Manhattan distance was used but another common distance is the Euclidean distance. Additionally, the user must also define the type of linkage method to use, this is the criterion which the algorithm uses to lump or split observations. For genetic data, the most common linkage method is Ward’s linkage that attempts to minimize the variance within clusters and maximize the variance between clusters.

• Similar to K-means clustering, we can look at the PC plots to explore the results for hierarchical clustering to see how the lines were assigned to clusters.

Study Questions:

1) It is apparent that results differ with respect to the clustering method, propose a design to explore a data set combining PCA and various cluster analyses.

2)

P C

3 (4

With Editorial Contributions By Bill Beavis And Jean-Luc Jannink

1. Introduction Molecular markers can be used in a variety of ways to explore or make inferences about populations. These can include exploring population structure, looking at the extent of linkage disequilibrium, quantitative trait locus (QTL) mapping, marker-assisted selection, and genomic selection. The objectives of this module are to:

i. Learn about the major sources of genetic variation and their role in breeding. ii. Grasp the concept of population structure in populations of breeding lines,

become aware of the ways in which it arises, and how it may affect analyses. iii. Understand linkage disequilibrium related to its importance for identification of

QTL, factors responsible for the generation and decay of LD, and population/genetic parameters that affect the level of LD in populations.

iv. Explore and visualize the level of population structure using principal components analysis.

v. Define discrete subpopulations for genetic data by understanding and implementing cluster analysis.

2. Genetic diversity

Genetic diversity is the variation on which the breeder is able to select. It is most often described for a random mating population where no selection occurs between individuals. There are three major sources of genetic variation.

i. Mutation, a random change in DNA sequence, is the ultimate source of all variation in populations.

ii. Recombination from crossing individuals does not create new alleles but creates new combinations of alleles.

iii. Migration, from a breeding perspective is bringing in new alleles from other sources including germplasm banks or breeders from other regions.

3. Population structure

Genetic relatedness within a population of lines is not always constant across pairs of individuals; this leads to some level of population structure. Population structure can be defined as the differential relatedness between lines in a population. The different levels of relatedness that result in population structure are most often caused by two phenomena in breeding populations, subpopulation structure and pedigree. Differential relatedness caused by pedigree simply refers to the fact that individuals within a family are more related to each other than to individuals outside of the family. This idea can be extended to larger groups of individuals that are not related by known pedigree. A group of individuals in a population that are more similar to each other than to individuals outside the group can be referred to as a subpopulation. Thus, the variation present in a population due to structure can be described in terms of the variation within a subpopulation versus the variation between subpopulations.

• A clear example of subpopulation structure can be seen in barley where structure

is largely due to three factors, including morphology, phenology, and end-use. The figure below shows the subpopulation structure for barley with two distinctive groups within each factor. In the case for barley, breeders generally cross lines within each class, which results in differing levels of relatedness between groups of lines.

• Structure due to pedigree is usually a result that occurs due to strategies implemented by breeders. For example, breeders tend to identify lines with good performance and recycle them in a breeding program to develop new populations for selection.

Study Questions:

1) How can the three major sources of genetic variation be applied in a breeding program to increase genetic diversity?

2) What are the driving factors that result in population structure for a crop of your choice? Are the identified factors a product of subpopulation structure and/or pedigree?

4. Linkage disequilibrium (Flint-Garcia et al. 2003)

• Linkage disequilibrium (LD) is the non-independence of alleles at two loci. This concept can be illustrated in a simple 2x2 contingency table. The table below shows the case for two loci, A and B (each with two alleles), when the loci are in linkage equilibrium. In this case, the joint probability (shaded red) is equal to the product of the marginal probabilities (shaded blue), thus the alleles at locus A and B are independent. Intuitively, “independence” means that knowing what allele is present at locus A does not help you guess what allele is present at locus B (or vice versa).

Locus A

Barley

pApB+pAqB=pA qApB+qAqB=qA

For the case when there is linkage disequilibrium between the two loci, the joint probability does not equal the product of the marginal probabilities, instead there is a deviation denoted as D (disequilibrium). In this situation, the probability of an allele at one locus is dependent on the allele at the other locus and vice versa. Thus, linkage disequilibrium can be thought of as the dependence of alleles at two loci. Intuitively, dependence means that knowing what allele is present at locus A does help you guess what allele is present at locus B. In the table below, for example, if D is positive and allele A1 is present, the probability that B1 is present is greater than pB.

Locus A

Pr(B1)=pB Pr(A1B1)=pApB+D Pr(A2B1)=qApB-D pApB+qApB=pB

Pr(B2)=qB Pr(A1B2)=pAqB-D Pr(A2B2)=qAqB+D pAqB+qAqB=qB

pApB+pAqB=pA qApB+qAqB=qA

Study Questions:

1) Describe the concept of LD in your own words. 2) In the tables above, use algebra to show that pApB+qApB=pB. Explain why in a given

row or column of the table above, one cell has +D and the other –D.

• There are multiple measures of LD between two loci, including D, D’, and r2.

o The first and simplest method of estimating LD is denoted D and is calculated as:

D=pAB-pApB

In this calculation, D is equal to the difference between the joint frequency of the two alleles (pAB) and the product of their marginal frequencies. The value D ranges between -0.25 and 0.25 and is highly dependent on allele frequencies.

o Standardized D, denoted D’, is scaled based on the observed allele frequencies, therefore it will range from zero to one. It is calculated as:

D’=–D/min(pAqB,qApB) for D<0 D’=D/min(pApB,qAqB) for D>0

In this standardization procedure, D’ is less dependent on allele frequencies than D, although if one haplotype has a low frequency, D’ is often close to one.

o Lastly, the most common estimate of LD is r2, or the squared correlation coefficient between two loci. It is calculated as:

r2=D2/pAqApBqB

This estimate is typically the preferred estimate for genome-wide association studies.

Study Questions:

1) Scores for eight individuals at two SNP sites are shown in the table below. What is the level of LD measured as D, D’, and r2 for this data?

Si

2

Ind. 1 A G Ind. 2 A G Ind. 3 A G Ind. 4 A G Ind. 5 T G Ind. 6 T G Ind. 7 T C Ind. 8 T C

2) Consider a variable x = 1 when allele A1 is present and x=0 when allele A2 is present.

Likewise, a variable y = 1 when allele B1 is present and y=0 when allele B2 is present. Calculate the covariance between x and y. Give this covariance in terms of pA, qA, pB, and qB.

• Why is LD important?

o In general, we do not know what loci in the genome affect a trait, and we

cannot look at all loci because there are too many of them. Instead, we look at a subset of loci that can be scored using DNA marker technologies. These markers themselves do not affect the phenotype, but LD between a marker and a QTL generates a relationship between the marker and the trait, giving evidence of the importance of that genomic region. LD is also important because it estimates the ability to predict an allele present at one locus from the allele present at another locus. Imagine that a marker locus (M) and a QTL locus (Q) are in LD with r2 = 0.4. Although the breeder cannot select directly for Q, since M explains approximately 40% of the variation at Q, then selecting on M shifts the frequency of Q, resulting in the desired change of phenotype.

• Causes of LD: There are four major driving factors causing LD between loci. These include mutation, migration, drift/inbreeding, and population structure. Figures below depict the four causes of LD show haploid genotypes (gametes or haplotypes) in populations.

o Mutation: LD in the base population is equal to zero because the ‘b’

locus is monomorphic. After a single mutation occurs in one of the haplotypes, namely from ‘b’ to ‘B,’ the LD between the A and B loci is no longer equal to zero, instead it is 0.05. Thus, a single mutation in a population can lead to LD between to loci.

o Migration: A small population of haplotypes migrate into a larger population, where the alleles have different states. Alleles having different states simply means that the alleles are fixed at both loci for differing alleles. As seen, the migrating haplotypes are fixed for the A and B loci with the ‘a’ and ‘b’ aleles, respectively, and the base population is fixed for ‘A’ and ‘B.’ The influx of migrating haplotypes with allele frequencies differing from those of the base population increases the LD from zero to 0.18.

o Drift/Sampling: In the example below, when drift due to inbreeding occurs by random sampling alone, the ‘A’ allele happened to be transmitted more often with the ‘B’ allele, leading to LD between the two loci.

pAB-pApB= 0 pAB-pApB= 0.05

A B b

a

ba

o Population structure: Two distinct subpopulations are present and LD is zero when calculated within subpopulations. In contrast, the level of LD between locus A and B increases to 0.25 when the subpopulations are taken together.

• Decay of LD o Recombination is the only force systematically causing LD to decay.

⊗

A B

a b

a b

a b

a b

a b

a b

a b

pAB-pApB= 0.25

over longer periods of time for loci that are closer than for loci that are farther apart.

• Population/genetic parameters that affect of LD o Linkage: From the graphs above it can be seen that when two loci have

tighter linkage (less recombination) the LD between them is more persistent over time. Likewise, for two loci that are loosely linked (greater recombination), LD decays at a faster rate. In fact, the expected value of LD measured as r2 is

! = 1

1 + 4!

where ! is the effective population size and is the recombination rate. From this expectation we can see that as the recombination rate, c, increases the LD, r2, decreases.

o Effective population size: Application of the same expectation of LD as above, shows that as the effective population size, !, is increased, the extent of LD decreases. This occurs because as Ne increases, the power of drift to increase LD decreases and so the expectation reaches a lower level.

o Mating system: The extent of LD is generally higher for autogamous than allogamous crops. The reason for this is that for an autogamous crop there are fewer effective recombination events. An effective recombination is a crossover event that results in recombination generating new (non-parental) allele combinations. If a locus is homozygous in an individual, recombinations with it are not effective.

Study Questions:

1) Is the concept of LD important for QTL mapping in F2 populations? 2) What is the difference in the level of LD in an F2 compared to and F7 population?

What would the implications of this be on the marker density requirement and mapping resolution?

3) Verify that a recombination with a homozygous locus does not create allele combinations that are not already present in the parent.

Generation

Recombination

0.001

0.01

0.05

0.1

0.2

0.5

Generation

Recombination

0.001

0.01

0.05

0.1

0.2

0.5

5. Principal components analysis (PCA)

The major objective of PCA is to summarize and visualize genetic diversity for a set of lines based on marker data. Most often, marker data includes hundreds to thousands of markers, PCA allows us to summarize and visualize this diversity.

• Imagine we have two variables, denoted x1 and x2 with the following relationship (left). The first PC, also called the first eigenvector, can be thought of as the perpendicular distances (blue line) of a linear regression through the direction of maximum variance. As such, the second PC follows the same definition except that it is through the direction of maximum variance orthogonal to all previous PCs, in this case PC1. This means that each PC is uncorrelated to all previous, an important characteristic of PCA. After PCA is computed (right), hidden structure in the data can be more easily seen.

• PCA also summarizes the diversity in terms of variation. The first eigenvalue is

the percent of variation explained by the first PC, for the preceding example this is equal to 99.7%. As such, the second eigenvalue is equal to 0.3%. Since the first PC is the direction of maximum variance, its eigenvalue is always the largest and each consecutive PC accounts for less than the one before.

• The above example is simple compared to a typical data set comprised of hundreds or thousands of lines and markers. The following example is a set of 1816 barley lines scored for 1416 SNPs (Hamblin et al. 2010). For a data set of this size, it is nearly impossible to explore its diversity without PCA. By plotting PC1 versus PC2, one can see that there are at least four distinct clusters within this data set. These distinct clusters represent 2-row, 6-row, spring, and winter types. For this example, PC1 and PC2 account for 24.5 and 10.1% of the variation in the data.

x1

x2

5

10

15

P C 2

PCA

Study Questions:

1) PCA is a robust approach that can be used for a wide variety of data sets besides genotypes, what other types of data related to plant breeding could PCA be applied?

2) The PCs can be thought of as a subset of variables that explain the majority of the variation for a given data set. If the first few PCs explain a lot of the variation, what is explained by the larger PCs?

6. Cluster Analysis

Similar to PCA, the purpose of cluster analysis is to explore genetic diversity, but in a more clear-cut fashion.

a) A common approach to cluster analysis is K-means clustering, where K is a pre- determined number of clusters. This is an iterative procedure with the following steps:

i. Initialization: An initial set number of K means (seed points) are

determined; these are the initial means for each of K clusters. ii. Each observation is then assigned to the nearest cluster mean.

iii. Means for each cluster are then re-calculated and observations are re- assigned to the nearest K mean.

iv. Steps ii and iii are then repeated until no more changes occur.

• Running K-means clustering on the barley data set from above where K is equal to 6 gives the following results:

PC1 (24.5%) P

-20 -10 0 10 20

• By looking at the PC plots with each line colored by the cluster, we can start to visualize the different clusters with respect to one another. The PC plot of PC1 versus PC3 also demonstrates the value of PCs beyond PC1 and PC2. Although PC3 accounts for only 4.5% of the variation in the data, it adequately separates what seemed to be a single cluster when looking at PC1 and PC2, the bottom left group with low PC1 and low PC2 values.

b) Another common approach to cluster analysis for genetic data is hierarchical

clustering. This approach sequentially lumps or splits observations to make clusters. • Applying the hierarchical approach to the barley data set we can visualize the

results using a cluster dendrogram. Observations are arrayed along the x- axis and the y-axis refers to the distance between breakpoints. For example, the horizontal line at 4e+05 indicates that there are two major groups with a distance between them of 4e+05. The user determines the height (distance along the y-axis) at which a horizontal line is drawn and the number of clusters is chosen, this is drawn below in red for 6 clusters. The user may determine this by using the PC plots, cluster dendrogram, and any prior information that is known about the germplasm.

PC1 (24.5%)

P C

2 (1

0. 1%

P C

3 (4

0e +0 0

1e +0 5

2e +0 5

3e +0 5

4e +0 5

M an

ha tta

n D

is ta

nc e

• Hierarchical clustering is implemented on a matrix of distance between every pair of observations called the distance matrix. For the example above the Manhattan distance was used but another common distance is the Euclidean distance. Additionally, the user must also define the type of linkage method to use, this is the criterion which the algorithm uses to lump or split observations. For genetic data, the most common linkage method is Ward’s linkage that attempts to minimize the variance within clusters and maximize the variance between clusters.

• Similar to K-means clustering, we can look at the PC plots to explore the results for hierarchical clustering to see how the lines were assigned to clusters.

Study Questions:

1) It is apparent that results differ with respect to the clustering method, propose a design to explore a data set combining PCA and various cluster analyses.

2)

P C

3 (4