Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP...

36
Downstream analysis with SNP markers Part I: Introduction to computer software for data Part I: Introduction to computer software for data analysis Sung-Chur Sim Th Ohi St t Ui it OARDC The Ohio State University, OARDC SolCAP workshop

Transcript of Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP...

Page 1: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Downstream analysis with SNP markersPart I: Introduction to computer software for dataPart I: Introduction to computer software for data

analysis

Sung-Chur SimTh Ohi St t U i it OARDCThe Ohio State University, OARDC

SolCAP workshop

Page 2: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

OutlineOutline

Part I: Introduction to computer softwarePart I: Introduction to computer software• MicroSatellite Analyzer (MSA)• Graphical GenoType (GTT)p yp ( )• STRUCTURE

What can you do using the software?Wh d l d th ft ?Where can you download the software?How can you format input data ?

Part II: The use of STRUCTURE for associationPart II: The use of STRUCTURE for association mapping• Detail steps to generate a Q-matrix using STRUCTURE

Page 3: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

MicroSatellite Analyzer: MSAMicroSatellite Analyzer: MSA

An independent analysis tool for large data sets (Dieringer p y g ( gand Schlӧtterer 2003)

• Descriptive statistics per population and locus (e.g. allelic richness heterozygosity and Shannon index of diversity)richness, heterozygosity, and Shannon index of diversity)

• FST, FIS, and FIT based on the Weir and Cockerham method• FST per locus and population pair ; P-value for FST determined ST ST

by permuting genotypes among groups• Genetic distance including Nei’s standard genetic distance • Converts your data into the formats of GENEPOP• Converts your data into the formats of GENEPOP,

STRUCTURE, ARLEQUIN, etc. Version 4.05 available for Windows, Linux, and Mac: (http://i122server.vu-wien.ac.at/MSA/MSA_download.html)

Page 4: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

MSA: MicroSatellite AnalyzerMSA: MicroSatellite Analyzer

Page 5: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Input formatInput format

http://i122server.vu-wien.ac.at/MSA/MSA_download.htmlp _

Page 6: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Input formatInput formatOne or two column format• Specify one (1) or two (2) column p y ( ) ( )

format in the cell A1• Enter name of population in the first

column (no empty cell)( p y )• Specify inbred (h) or outbred (d)

for your species in the second column (no empty cell)( p y )

• Enter group number of population (no empty cell)

• SNP data converted from letter codesSNP data converted from letter codes to numerical coding

• Missing data cam be indicated by -1, nd dot( ) or empty cellnd, dot(.), or empty cell

• Save your data in the format “TAB DELIMITED”

Page 7: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Identify loci that distinguish populations

60

Processing vs. Freshmarket60

Processing vs. Heirloom

20

30

40

50

umbe

r of

loci

20

30

40

50

umbe

r of

lociovate: 0

fw2.2: 0sp6: 0.14

ovate: 0.26fw2.2: 0sp6: 0.73

0

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Nu

FST

0

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Nu

FSTFST ST

50

60

i

Freshmarket vs. Heirloom

ovate: 0.31

FST = 0: an allele of a gene is fixed or the gene is under balancing selection

10

20

30

40

50

Num

ber

of lo

ci fw2.2: 0sp6: 0.47

g g

FST = 1: a gene under diversifying selection

0

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

FST

Page 8: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Graphical GenoType: GGTGraphical GenoType: GGT

A tool for representing molecular marker data by graphical p g y g prepresentation and color coding of chromosomes• Useful for evaluation of plant material and selection of a desired

genotypegenotype

Advanced genetic analyses • Marker-trait association• Genetic distance

Li k di ilib i• Linkage disequilibrium

Version 2.0 available for Windows (http://www.plantbreeding.wur.nl/UK/software_ggt.html)

Page 9: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE
Page 10: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

http://www.plantbreeding.wur.nl/UK/software_ggt.html

Page 11: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Input formatInput format

Two data files derived from locus and map data pLocus file• Contains data on marker alleles using the MapMaker or

JoinMap type of coding• A plain text file

Page 12: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Locus fileLocus file

Page 13: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Input formatInput format

Two data files derived from locus and map data pLocus file• Contains data on marker alleles using the MapMaker or

JoinMap type of coding• A plain text file

M filMap file• Specifies marker positions on a linkage map• A plain text fileA plain text file

Page 14: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Map fileMap file

Page 15: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Input formatInput format

Two data files derived from locus and map data pLocus file• Contains data on marker alleles using the MapMaker or

JoinMap type of coding• A plain text file

M filMap file• Specifies marker positions on a linkage map• A plain text fileA plain text file

Build a GGT file by merging the locus and map files using the ‘Build GGT-file’ optionThe GGT file can also be prepared from an Excel spreadsheet

Page 16: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE
Page 17: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Fresh Market Heirloom Processing Wild

Identify Recombinants

Chromosome 5

Regions of low variation

Chromosome 6

Page 18: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

STRUCTURESTRUCTURE

A model-based clustering method (Pritchard et al. 2000)g ( )• Inferring population structure using multi-locus genotype data • Generating a Q-matrix to correct for population subdivision

d i k t it i ti l i i l l tiduring marker-trait association analysis in complex populations (e.g. breeding populations)

• Identifying migrants and admixed individuals

Version 2.3.3 available for Windows, Linux, and Mac: (http://pritch bsd uchicago edu/structure html)(http://pritch.bsd.uchicago.edu/structure.html)

Page 19: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE
Page 20: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

h // i h b d hi d / h lhttp://pritch.bsd.uchicago.edu/structure.html

Page 21: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Input formatInput format

A matrix where the data for individuals are in rows, the loci are in column• n consecutive rows have the data• n consecutive rows have the data

for each individual of n-ploidspecies

• Integer should be used for coding• Integer should be used for coding genotype

• Missing data should be indicated by a number which doesn’t occura number which doesn t occur elsewhere in the data (e.g. -1)

• The data file should be a text file ( txt) not an excel file ( xls) for(.txt) not an excel file (.xls) for running STRUCTURE

Page 22: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

SummarySummary

Three computer programs, MSA, GGT, and STRUCTURE were introduced for SNP data analysis by providing the following information:

• What can the programs do?• Where can you download them?• Where can you download them?• How can you format input data for each program?

Page 23: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Downstream analysis with SNP markersPart II: The use of STUCTURE software for associationPart II: The use of STUCTURE software for association

mapping of bacterial spot resistance in tomato

Sung-Chur SimTh Ohi St t U i it OARDCThe Ohio State University, OARDC

SolCAP workshop

Page 24: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Bacterial spot in tomatoBacterial spot in tomato

A disease complex caused by speciesA disease complex caused by species of Xanthomonas bacteria.

Five physiological races: T1-T5

Sources of resistance from close relatives of cultivated tomato (Solanum lycopersicum L ) or S(Solanum lycopersicum L.) or S. pimpinellifolium• Hawaii 7998 (T1)

• Hawaii 7981 (T3)

• PI128216 (T3)

• PI114490 (T1 T2 T3 and T4)• PI114490 (T1, T2, T3, and T4)

Page 25: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Association analysis models incorporate a correction for population structure

Unified mixed model (Yu et al. 2006)

Y = μ REPy + Qw + Markerα + Zv + Error

Addi t i Q f l ti t t t f d li k dAdding a matrix, Qw, of population structure can correct for pseudo-linkage and can add insight to which crosses, pedigrees, subpopulations have the highest breeding value

Page 26: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

STRUCTURE analysis

Format marker data

y

The marker data file used in this example is available on the workshop URL: http://pbgworks.org/tomato-workshop (file name: STRUCTURE_InputData.txt)

Page 27: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

STRUCTURE analysis

Format marker data

y

Decide how long to run STRUCTURE (burnin and MCIC)(burnin and MCIC)

Burnin length: how long to run the simulation before collecting data to minimize the effect of the starting configuration (Recommendation: 10,000 ~100,000)

MCIC length: how long to run the simulation after the burnin to get accurate parameter estimates (R d ti 500 000 1 000 000)(Recommendation: 500,000~1,000,000)

Page 28: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

STRUCTURE analysis

Format marker data

y

Decide how long to run STRUCTURE (burnin and MCIC) ( )

Run simulations 20 times for each of several different Ks

Identify the best K based on the log likelihood values from the 20 simulations for each K using the non parametric and/or ∆K20 simulations for each K using the non-parametric and/or ∆K

methods

Page 29: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Inference of best K (number of populations)Inference of best K (number of populations)

The log likelihood for each K Ln P(D) = L(K)The log likelihood for each K, Ln P(D) = L(K)

Two approaches to determine the best K

1. Use of L(K): When K is approaching a true value, L(K) plateaus (or continues increasing slightly) and has high variance between runs (Rosenberg et al. 2001, E t l 2005)Evanno et al. 2005).

nonparametric test (Wilcoxin test)

2. Use of an ad hoc quantity (∆K): Calculated based on q y ( )the second order rate of change of the likelihood (∆K) (Evanno et al. 2005). The ∆K shows a clear peak at the true value of K.

∆K = m([L’’K])/s[L(K)]∆K = m([L K])/s[L(K)]

Evanno et al. 2005. Molecular Ecology 14: 2611-2620

Page 30: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Log likelihood values

Page 31: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Inference of best K using the delta K methodg

The best K = 8

L(K) = an average of 20 values of Ln P(D) L’(K) = L(K)n – L(K)n-1 The Excel file used in this example isL’’(K) = L’(K)n – L’(K)n-1Delta K = [L’’(K)]/Stdev

The Excel file used in this example is available on the workshop URL:http://pbgworks.org/tomato-workshop(file name: The best K analysis.xls)

Page 32: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

STRUCTURE analysis

Format marker data

y

Decide how long to run STRUCTURE (burnin and MCIC) ( )

Run simulations 20 times for each of several different Ks

Identify the best K based on the log likelihood values from the 20 simulations for each K using the non parametric and/or ∆K20 simulations for each K using the non-parametric and/or ∆K

methods

Retrieve a Q-matrix of the best K

Page 33: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Bar plot showing clustering of individual genotype

Page 34: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

Q-matrix

Page 35: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

SAS codesSAS codes

%macro Mol(mark);%macro Mol(mark);proc mixed data = three;class &mark gen rep;

d l T1 1 2 3 4 5 6 7 8 & k /model T1 = pop1 pop2 pop3 pop4 pop5 pop6 pop7 pop8 &mark / solution;random gen rep; Qwg p%mend;

%Mol(M1);%Mol(M2);

Markerα

%Mol(M2);%Mol(M3);%Mol(M4);

run; The SAS code used in this example is available on the workshop URL: http://pbgworks.org/tomato-workshop(file name: SAScode.txt)

Page 36: Downstream analysis with SNP markers - PBGworkspbgworks.org/sites/pbgworks.org/files/06 Sim SNP Markers_0.pdf · Downstream analysis with SNP markers Part II: The use of STUCTURE

SummarySummarySTRUCTURE is a useful tool to detect population subdivisionsubdivision

Th f th Q t i t f b l tiThe use of the Q-matrix can correct for subpopulations during association analysis in breeding populations; avoids detection of false-positivesavoids detection of false-positives

The SNP resources from SolCAP are a powerfulThe SNP resources from SolCAP are a powerful survey tool; we should be thinking beyond bi-parental populations toward analysis of complex breeding p p y p gpopulations