Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)

Overall goal: – User-‐friendly access to relevant informa3on on plant gene3c resources.

–  Increased u3liza3on of germplasm for gene3c diversity in food crops.

Strategies to improve the u,liza,on of germplasm in seedbank collec3ons to increase the gene3c diversity of food crops for enhanced food security.

2

•  Scien3sts and plant breeders want a few hundred germplasm accessions to evaluate for a par3cular trait.

•  How does the scien3st select a small subset likely to have the useful trait?

•  More than 560 000 wheat accessions in genebanks worldwide.

3 Slide adopted from a slide by Ken Street, ICARDA (FIGS team)

  “I am screening for variations in powdery mildew resistance genes can you send me 1200 landrace accessions of bread wheat”…

  “I am screening for drought – could you send me some landraces from Afghanistan and some other dry countries”…

  “I am screening for rust can you send me 9000 bread wheat samples”…

  “I am looking for new salt tolerance genes can you send me some wild relatives from salty areas”…

  “I want about 500 bread durum acc to screen for RWA”…

  “I am screening for Sunn Pest and can handle about 200 acc – can you send me a selection of Triticum species”…


•  The scien3st or the breeder need a smaller subset to cope with the field screening experiments.

•  A common approach is to create a so-‐called core collec,on.

5

Sir OVo H. Frankel (1900-‐1998) proposed that a limited or "core collec3on" could be established from an exis3ng collec3on. With minimum similarity between its entries the core collec3on is of limited size and chosen to represent the gene,c diversity of a large collec3on, a crop, a wild species or group of species (1984) .

•  Given that the trait property you are looking for is rela3vely rare:

•  Perhaps as rare as a unique allele for one single landrace cul3var...

•  Ge_ng what you want is largely a ques3on of LUCK!


Objec,ve of this study:

– Explore climate data as a predic3on model for “pre-‐screening” of crop traits BEFORE full scale field trials.

–  Iden3fica3on of landraces with a higher probability of holding an interes3ng trait property.

8

•  Primi,ve crops and tradi,onal landraces are the source of exo3c traits, crop proper3es.

•  Traits from landraces are an interes3ng source of novel traits for improvement of modern crops.

•  Landraces are ogen not described for the economically valuable trait in ques3on.

•  Iden3fica3on of crop traits are ogen the result of a larger field trial screening project (thousands of individual plants).

•  Large scale field trials are very costly (land area and human working hours).

9

The underlying assump3on is that the climate at the original source loca3on, where the landrace was developed during long-‐term tradi3onal cul3va3on, is correlated to trait.

The aim is to build a computer model explaining the crop trait score (dependent variables) from the climate data (independent variables).

10

Wild rela3ves are shaped by climate

Primi3ve cul3vated crops are shaped by climate and humans

Tradi3onal cul3vated crops (landraces) are shaped by climate and humans

Modern cul3vated crops (cul3vars) are mostly shaped by humans (plant breeders)

Perhaps future crops are shaped in the molecular laboratory…? 11

1)  Landrace samples (genebank seed accessions) 2)  Trait observa3ons (experimental design) 3)  Climate data (for the landrace origin loca3ons)

•  The accession iden3fier (accession number) provides the bridge to the crop trait observa3ons. •  The longitude, la,tude coordinates for the original collec3ng site of the accessions (landraces) provide the bridge to the environmental data.

12

13 More than 6 million genebank accessions, more than 1 400 genebanks, worldwide.

hVp://barley.ipk-‐gatersleben.de 14 Powdery Mildew,

Blumeria graminis Leaf spots Ascochyta sp.

Yellow rust Puccinia strilformis

Black stem rust Puccinia graminis

Faba bean, Finland Field trials, Gatersleben, Germany

Forage crops, Dotnuva, Lithuania Radish (S. Jeppson)

Cauliflower (S. Jeppson)

Linnés äpple

The climate data is extracted from the WorldClim dataset. hVp://www.worldclim.org/

Data from weather sta3ons worldwide are combined to a con3nuous surface layer.

Climate data for each landrace is extracted from this surface layer. Precipita3on: 20 590 sta3ons

Temperature: 7 280 sta3ons 15

This study is part of a new method to predict crop traits of primi3ve cul3vated material from climate variables by using mul3variate sta3s3cal methods.

16

17

FIGS The FIGS technology takes much of the guess work out of choosing which accessions are most likely to contain the specific characteris3cs being sought by plant breeders to improve plant produc3vity across numerous challenging environments. hVp://www.figstraitmine.org/

17

Origin of Concept (1980s): Wheat and barley landraces from marine soils in the Mediterranean region provided genetic variation for boron toxicity.

What is

Slide made by M C Mackay 1995

hVp://www.figstraitmine.org/

18

Queensland Australia

Mediterranean region

Slide made by M C Mackay 1995

19

20

•  No sources of Sunn pest resistance previously found in hexaploid wheat.

•  2000 accessions screened at ICARDA without result

•  A FIGS set of 534 accessions was developed and screened.

•  10 resistant accessions were found! •  The FIGS selec3on started from 16 000

landraces from VIR, ICARDA and AWCC •  Exclude origin CHN, PAK, IND were Sunn pest

only recently reported (6 328 acc). •  Only accession per collec3ng site (2 830 acc). •  Excluding dry environments below 280 mm/

year •  Excluding sites of low winter temperature below

10 degrees Celsius (1 502 acc)

Slide adopted from Ken Street, ICARDA (FIGS team)

•  The fundamental ecological niche of an organism was formalized by Hutchinson[1] in 1957 as a mul3dimensional hypercube defining the ecological condi3ons that allow a species to exist.

•  Full understanding of all the environmental condi3ons for any organism is a monumental task[2].

•  A computer model of the occurrence locali3es together with associated environmental condi3ons such as rainfall, temperature, day length etc., provides an approxima3on of the fundamental niche.

•  Popular soCware implementa3ons for modeling the ecological niche include openModeller, MaxEnt, BioCLIM, DesktopGARP, etc.

George Evelyn Hutchinson (1903 – 1991)

21

A flexible, user friendly, cross-platform environment where the entire process of a fundamental niche modeling experiment can be carried out.

Input: species occurrence and environmental data.

Output: a fundamental niche model and projection of the model into an environmental scenario.

hVp://openmodeller.sourceforge.net/

22

–  The ini3al model is developed from the training set

–  Fine tuning of model parameters and se_ngs

–  No model can ever be absolutely correct! –  A simula3on model can only be an approxima3on –  A model is always created for a specific purpose!

–  The simula3on model is applied to make predic3ons based on new fresh data

–  Be aware of extrapola3on 24

–  For the ini3al calibra3on or training step.

–  Further calibra3on, tuning step –  Ogen cross-‐valida3on on the

training set is used to reduce the consump3on of raw data.

–  For the model valida3on or goodness of fit tes3ng.

–  External data, not used in the model calibra3on.

25

•  A number of different coefficients are developed to measure correla3on in different situa3ons.

•  The best known is the Pearson product-‐moment correla,on coefficient.

•  The indicates the strength and direc3on of a linear rela3onship between two random variables.

•  The indicates how well future outcomes are likely to be predicted by a sta3s3cal model.

Name of the sta3s3c Symbol Range

* Correla3on coefficient r -‐1 to 1 * Coefficient of determina3on r2 0 to 1

The covariance of the two variables is divided by the product of their standard devia3ons.

27

28 Be aware of over-‐fi_ng! NB! Model valida3on!

The distance between the model (predic3ons) and the reference values (valida3on) is the residuals.

Example of a bad model calibra3on

Cross-‐valida3on indicates the appropriate model complexity.

Sta,on Al,tude La,tude Longitude

Priekuli, Latvia 83 m 57.3167 25.3667

Bjørke forsøksgård, Norway 149 m 60.7667 11.2167

Landskrona, Sweden 3 m 55.8667 12.8333

31

accide AccNum Country Locality Eleva,on La,tude Longitude Coordinate

7436 NGB27 Finland Sarkalahti, Luumäki 95 m 61.0333 27.3333 SESTO

9717 NGB456 Norway Dønna, Nordland 71 m 66.1167 12.5 Georeferenced

9601 NGB468 Norway Trysil 400 m 61.2833 12.2833 Georeferenced

9600 NGB469 Norway BJØRNEBY 400 m 61.2833 12.2833 Georeferenced

7966 NGB775 Sweden Överkalix, Allsån 45 m 66.4 22.9333 SESTO

8510 NGB776 Sweden Överkalix 100 m 66.4 22.7667 SESTO

7810 NGB792 Finland Luusua, Kemijärvi 145 m 66.4833 27.35 SESTO

9538 NGB2072 Norway Finset 1220 m 60.6 7.5 Georeferenced

8482 NGB2565 Sweden Öland 11 m 56.7333 16.6667 Georeferenced

9102 NGB4641 Denmark Støvring, Jylland 55 m 56.8833 9.8333 Georeferenced

9015 NGB4701 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced

9039 NGB6300 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced

8531 NGB9529 Denmark Lyderupgaard 9 m 56.5667 9.35 Georeferenced

7344 NGB13458 Finland Koskenkylä, Rovaniemi 91 m 66.5167 25.8667 Georeferenced 32

From a total of 19 landrace accessions included in the dataset, only 4 of the landrace accessions included geo-‐referenced coordinates in the NordGen SESTO database.

10 accessions were geo-‐referenced from the reported place name and descrip3ons of the original gathering site included in SESTO and other sources.

For 5 accessions there were not enough informa3on available to locate the original gathering loca3on.

Right side illustra.on Example of georeferencing for NGB9529, landrace reported

as origina@ng from Lyderupgaard using KRAK.dk and maps.google.com

33

Score plots The observa3ons made at Priekuli (Latvia) are separated from the observa3ons made at Bjørke (Norway) and Landskrona (Sweden) in PC1 and PC2.

The combined observa3ons from each year (2002 and 2003) are less separated.

The two replicate series are NOT separated

35

The bi-‐plot shows heading days and ripening days as the most influen3al trait variables for the separa3on of the observa3ons from the different observa3on loca3ons. Length of plant par3cipate in spreading out the scores (in PC1 and PC2), but is less ac3ve in the separa3on of the groups.

The influence plot (residuals against leverage) shows sample

observed at Priekuli in 2003 (replicate 2) with a very high leverage -‐ well separated from the “data cloud”. Ager looking into the raw data (see next slide), this data point was removed as outlier (set to NaN).

36

Sample (FRO) observed at Priekuli in 2003 (replicate 2) has the lowest score for harvest index in the en3re dataset.

Ager looking into the raw data (see the table above), this observa3on point was removed as outlier (set to NaN).

37

The ini3al PCA analysis of the climate data showed a nice spread of the scores. No surprises.

The influence plot iden3fied sample (NOR) as a mild outlier. I decided to keep this sample, but to keep an eye out for it in the mul3-‐way analysis.

38

•  Plot of the trait scores (max – min) from each observa3on loca3on and year. •  The effect from the different experimental condi3ons have a significant effect on the trait observa3ons.

40

Mode 3 (climate variables) have very different range of numerical values (tmin, tmax, and prec). Scaling across mode 3 is thus applied to the mul3-‐way models.

Leg is displayed the box-‐plot for the 3-‐way data unfolded as to keep the dimensions of mode 3.

The 3-‐way climate data was reasonably well described by a PARAFAC model of two components.

tmin tmax prec

Scaling across mode 3

42

6 traits

14 land

races (x2)

6 28

6

6 traits

Bjørke (N) 2002

6 traits 6 traits 6 traits 6 traits 6 traits

28 records

Mode 2 (Traits) * Heading days * Ripening days * Length of plant * Harvest index * Volumetric weight * Grain weight

Bjørke (N) 2003

Landskrona (S) 2003

Landskrona (S) 2002

Priekuli (Lv) 2002

Priekuli (Lv) 2003

Mode 3 * LVA 2002 * LVA 2003 * NOR 2002 * NOR 2003 * SWE 2002 * SWE2003

44

12 monthly means

14 land

races

(loca3o

n of origin)

12 14

3

Min. temperature

14 samples

Climate data (mode 3): •  Minimum temperature •  Maximum temperature •  Precipita3on •  … (many more can be added)

Jan, Feb, Mar, …

Max. temperature

Jan, Feb, Mar, …

Precipita3on

Jan, Feb, Mar, …

45

(NOR) was iden3fied as a mild outlier from the influence plot.

No3ce that both replica3ons are located in the same part of the plot. And that they (together) are not isolated from the “data cloud”.

•  The ini3al PARAFAC models calibrated from the 4-‐way trait dataset failed to converge to any good models. The core-‐consistency remained very low.

•  The problem showed to be lack of systema3c independent varia3on between instances of mode 3 (observa3on years) and mode 4 (observa3on loca3ons)

•  A two component PARAFAC model was chosen for the new 3-‐way trait dataset.

46

PARAFAC split-‐half (mode 1) analysis:

The two PARAFAC models each calibrated from two independent split-‐half subsets, both converge to a very similar solu3on as the model calibrated from the complete dataset.

The PARAFAC model is thus a general and stable model for the scope of Scandinavia.

47

Further search for any good PARAFAC split-‐half for the climate dataset:

A systema3c recording of results from 10 different split-‐half alterna3ves resulted in two good split-‐half.

The PARAFAC model for the climate data is thus reasonable general (for Scandinavia), but less stable than the model for the 3-‐way trait data.

48

•  Ogen the cri3cal levels (α) for the p-‐value is set as 0.05, 0.01 and 0.001.

•  For the modeling of 14 samples (landraces) gives: –  12 degrees of freedom for the correla3on tests –  One-‐tailed test (looking only at posi3ve correla3on of predic3ons versus the reference values).

–  A coefficient of determina3on (r2) larger than 0.56 is significant at the 0.001 (0.1%) level for 14 values/samples.

Many introductory text books on sta3s3cs include a table of Cri3cal Values for Pearson’s r. 51

53

•  Latvia 2002 (LY11) – May 2002 was extreme dry in Priekuli. –  June 2002 was extreme wet in Priekuli. –  The wet June caused germina3on on the spikes for many of the early varie3es.

•  Landskrona 2003 (LY32) –  June 2003 was extreme dry in Landskrona. –  June was the 3me for grain filling here.

•  Too extreme for the genotype to be “normally” expressed ?

•  Too large effect from “G by E” interac3on ?

Sta,on Year Sowing week

Rainfall (mm)

May June July August

Bjørke forsøksgård, Norway 2002 17 82.9 67.4 128.5 136.5

2003 21 75.1 85.7 67.1 53.2

Landskrona, Sweden 2002 13 53.5 75.3 76.4 68.9

2003 15 70.7 40.4 76.0 45.7

Priekuli, Latvia 2002 17 38.2 111.1 67.0 11.3

2003 19 88.0 59.2 87.8 175.8

54

Exploring why some of the subset (LY) give very bad N-‐PLS regressions...

57

RMSECV=3.72 Expl. X = 96% Expl. y = 54%

All samples r2 cal = 0.54 r2 cv = 0.16

RMSECV=3.18 Expl. X = 98% Expl. y = 64%

Without NGB456 r2 cal = 0.64 r2 cv = 0.33

59

66

•  The first dataset I started to work with is a “FIGS” dataset with genebank accessions of Barley (Hordeum vulgare ssp. vulgare) collected from different countries worldwide and tested for susceptibility of net blotch infection. Net blotch is a common disease of barley caused by the fungus Pyrenophora teres.

•  The barley plants were inoculated with the fungus and the percentage of the leaves infected with the disease was normalized to an interval scale (1 to 9).

•  1-3 are basically resistant group 1 •  4-6 are intermediate group 2 •  7-9 are susceptible group 3

67

•  Field loca3ons (USA) –  Athens, Georgia (273 observa3ons) –  Fargo, North Dakota (3381 observa3ons) –  Langdon, North Dakota (858 observa3ons) –  Stephen, Minnesota (139 observa3ons)

•  Observa3on years (1987 – 2004) –  9 dis3nct years

•  Greenhouse versus field trials –  Greenhouse (1676 observa3ons) –  Field trial (2975 observa3ons)

70

•  one-‐way ANOVA test for difference between the observa3on loca3ons. The p-‐value of 0.000 rejects the null hypothesis of no difference.

•  The Tukey pair-‐wise comparison test gave the same result.

Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev -----+---------+---------+---------+- ATHENS 262 2,0840 0,6555 (---*---) FARGO 789 1,6793 0,6023 (-*-) LANGDON 1558 1,6727 0,6466 (-*) STEPHEN 136 1,6103 0,7810 (-----*----) -----+---------+---------+---------+- 1,60 1,80 2,00 2,20

72

•  Agro-‐clima3c Zone (UNESCO classifica3on) •  Soil classifica3on (FAO Soil map) •  Aridity (dryness) •  Precipita3on •  Poten3al evapotranspira3on (water loss) •  Temperature •  Maximum temperatures •  Minimum temperatures

(mean values for month and year)

73

•  The correctly classified groups for the training dataset was 45.9%, and we would expect a similar success rate for the predic3on of the “blinded” values.

•  Remember that random classifica3on of three groups are: 33.3%

•  A test set of 9 samples showed a propor3on correct classifica3ons of 44.4%

Discriminant Analysis: obs_nb versus acz_moisture; ...

Quadratic Method for Response: obs_nb

Predictors: acz_moisture; acz_winter_temp;

acz_summer_temp; arid_annual; pet_annual;

prec_annual; temp_annual; tmax_annual;

tmin_annual

Group 1 2 3

Count 1049 1190 234

Summary of classification

Put into Group 1 2 3

1 523 427 48

2 287 451 25

3 238 314 163

Total N 1048 1192 236

N correct 523 451 163

Proportion 0,499 0,378 0,691

N = 2476 N Correct = 1137

Proportion Correct = 0,459

74

Eddy De Pauw Climate data

Harold Bockelman Net blotch data

Ken Street FIGS project leader

Michael Mackay FIGS coordinator

Dag Endresen Data analysis

Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)

Technology

Transcript of Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)