Post on 16-Sep-2019
Final report: Transgene flow risk analysis Raúl Jiménez Rosenberg (rauljr@stanford.eud)
Introduction Mexico is one of the most important place centers of origin and diversification of many plant foods, like cucurbit, avocado, and the worldwide used maize, which was domesticated over thousands of years in Mexico. By law, Mexico does not allow commercial genetically modified maize cultivation, only experimental parcels but there is already evidence of genetic contamination in Mexico [1, 2]. And there is a lot o pressure to open the country to commercial maize crops. Motivation Genetic diversity is very important in biodiversity assessment; it has been proven in many cases genetic diversity is a key component for species to deal with environmental shift, diseases, plagues, amount many other things. Mexico so far has decided not to allow commercially crops of genetically modified organism, for those plants that Mexico is the center of origin and diversification, in concern of transgene contagion to wild and domesticated relatives. And the experimental crops should be carefully analyzed to minimize the transgene contagion. One way to analyze if a crop should be allowed or not is based on geographical distribution of the relatives’ species and the place where the experimental (or commercial) is asked to establish. So having maps on the species distribution is essential. Short background on Species Distribution Modeling Potential Species Distribution Models (SDM) is related to the ecological conditions the species require to maintain populations in a given region [15] so the idea is to find the biotic and abiotic conditions that are suitable to the species, they are many factors why a specie not occupies all appropriate conditions (e.g. mountains, sea. I.e. geographical barriers, the presence of predator, etc.), that is the reason they are called potential distributions, there is a branch of ecology that studies the actual distribution of the species, but they rely on extensively and well, design species inventories or field works to validate the model, which is feasible only in cases like endemic species, or small regions, etc. Many approaches have been applied to the SDM, if the data set includes “true absences,” absences due to the species not being present, rather than to insufficient exploration. Then you have binary data, so we can use many statistical tools for this kind of data, like regressions, GLM and GAM, but
most of the data set on specimen samples, lack “true absences,” which is called presence only data. For this kind of data sets, it has been a long discussion on the methods; to mention some, we found the profile methods (Bioclim, Domain, Mahalanobis), statistical methods (GLM, GAM) using pseudo-‐absences, Geographic models (Convex Hull, Inverse Distance Weighted) and the so called from the SDM field the Machine Learning (ML) methods (Maximum Entropy, Boosted Regression Trees, SVM, etc.). Recently, it has been shown the mathematical equivalency of the most widely used ML algorithm, MaxEnt with other algorithms, based in the work of William Fithian and Trevor Hastie at Stanford University concluding that MaxEnt, Boosted Regression Trees and others are motivated by the same underlying model Inhomogeneous Poisson Process ([4], [16], [17]). The approach explored in this work is based on the idea of the BAM diagram (figure 1), which tries to comprise a practical (and ‘realistic’) potential SDM, which illustrates the relations between biotic (B), abiotic (A), and movement (M) elements.
Solid circles on the figure indicate presence, and open circles indicate absences of the specie. Figure taken from [15]
Figure 1. BAM diagram GA = the abiotical suitable area, GO = the occupied area and GI = the invadable area. So the potential distribution area is GP = GO ∪ GI [6]. Methods Since the work presented is based on the ideas of Soberón at Kansa University [12], (BAM idea), the first step was to build a reliable dataset on biotic and abiotic dataset biological sound for the 59 races of maize present in Mexico, first specimens occurrences from two main sources Conabio (www.conabio.gob.mx) and GBIF (www.gbif.org). There is no dataset for M almost for any species, but there is literature about the biology of the species, preferred precipitation, altitude, temperature, etc. Therefor to get an idea of M the reading on the biology of the races of maize was done, I propose a simplistic idea based on what I learned [14].
Datasets Specimens’ occurrences, were collate around 36,000 records from Conabio and GBIF, but a quality assessment of the data reduced the dataset to 10,950 records. Main problems were lacked or incongruent geographical data (latitude and longitude); lack of a valid maize race scientific name, the time consumed to cleansing the data was far greater than any other part of the project. However, 10,950 record is a sufficient amount of data to work, but for four races they were less than 20 records, so instead of work with each race, I follow the idea of a groups (or complexes) classification, so maize has been classified into seven groups [14] (table 1). Group Representative maize races Records 1 Palomero, Cacahuacintle, Cónico 3574 2 Apachito, Cristalino de Chihuahua 222 3 Harinoso de Ocho, Elotes
Occidentale, Bofo 1655
4 Chapalotem, Reventador 165 5 Nal-‐tel, Zapalote Chico 681 6 Tepecintle, Choapaneco, Tuxpeño 3575 7 Olotillo, Dzit Bacal, Olotón 1078 10950
Table 1. Occurrence dataset To this, occurrences dataset I over impose climate data to build the abiotic and biotic part; these climate data are our variables to use in the species distribution algorithm. The climatic dataset used come from the WorldClim (www.worldclim.org) [18]; Resolution is 30 seconds of arc, for Mexico this is ~1 km pixel resolution; the extend of the study area is the geographic box of ({-‐120, 34}, {-‐85, 12}), so each climate raster are 4,200 * 2,640 = 11,088,000 pixels, ocean pixels have no-‐data value. All data was processed in R.
From the reading of the biology of the maize, the algorithm was feed only with the following 12 variables: BIO3 = Isothermally, BIO5 = Max Temperature of Warmest Month, BIO6 = Min Temperature of Coldest Month, BIO7 = Temperature Annual Range, BIO8 = Mean Temperature of Wettest Quarter, BIO9 = Mean Temperature of Driest Quarter, BIO10 = Mean Temperature of Warmest Quarter, BIO11 = Mean Temperature of Coldest Quarter, BIO16 = Precipitation of Wettest Quarter, BIO17 = Precipitation of Driest Quarter, BIO18 = Precipitation of Warmest Quarter, BIO19 = Precipitation of Coldest Quarter Most of the biology description, characterize maize base on altitude, therefor for M it was used a physiographic provinces map [3]. Actually, there is no algorithm that uses a BAM concept to build specie distribution maps, but instead these map was used to provide the background data (like pseudo-‐absences, but with a very different use), therefor the points for background was randomly sampled from the physiographic region where the maize group has an occurrence (based on the dataset). So M plays a role as a mask for randomly points selection (background points). A quick notion of background data vs. pseudo-‐absences is that pseudo-‐absences was used to label ‘0’ (absences) and then apply supervised classification algorithms; the background data is used to characterize environments in the region where species occur and used has conditions for optimization, in particular, MaxEnt uses Lagrange’s technique. Therefor is very important to choose good conditions. I.e. M should rely mostly on the biological knowledge about the species.
Figure 2. Overview of maize groups distribution
Data preprocess The data handling, model fitting, evaluation, prediction process was done in R (the code will share; R is getting much attention within the SDM community), depending on the setting. It took until one hours to run the seven maize groups. All the data has a geographical projection with datum WGS84. For each maize group, we build M based on point occurrence on a polygon over the physiographic provinces map [3]. Figure 2 shows M map for group 4.
Figure 3. Red area (provinces) is M for maize group 4 For the background points the same size of the maize group (table 1) were selected, using the idea of Hijmans at Berkeley University to remove the spatial sorting bias [7] (point-‐wise distance sampling). Model Fitting Two algorithms were used Maximum Entropy (MaxEnt [9]) and Boosted Regression Trees (BRT [5]); this report includes only the result of MaxEnt, (BTR requires further work, However, based on visually inspection of the predictions, locks a lot like MaxEnt, but we know that already [4]). MaxEnt, maximize the information entropy H = -‐ Σ pi log pi. To assess the relative variables contribution to the model a jackknife process was performed. Recall that based on the biology sound variable of maize races, we already pre-‐selects 12 ‘important’ variables. This can be used as a result in terms to answer which variables are more important for the species, this analysis entirely depends on how the model was built (optimization); so distinct algorithms can get different answers.
Group / Bio 1 2 3 4 5 6 7
10 5.3 2.5 1.9 0.8 1.0 0.4 0.4 11 4.0 4.2 12 32.6 11.6 32.0 0.7 16 6.9 9.3 33.7 13.7 0.4 25.5 20.0 17 1.3 1.0 13.8 1.0 8.45 4.9 1.6 18 7.7 0.6 2.5 11.7 3.43 0.5 7.8 19 8.4 2.0 0.3 1.7 26.3 11.5 3.6 3 3.1 7.0 21 6.0 8.7 3.1 31.4 5 1.4 0.9 0.1 2.3 17.9 1.5 4.1 6 0.6 54.7 5.0 7.1 8.8 4.7 10.0 7 0.9 7.15 5.8 10.5 3.0 6.9 7.3 8 57.4 0.1 1.4 2.4 1.8 4.6 1.9 9 2.3 10.0 1.9 9.6 8.1 3.8 10.0
Table 2. Percent of contribution
In addition from the test on variable importance, we get which variable decrease the gain of the model the most when it is omitted, i.e. which appears to have the most information that isn't present in the other variables.
Group 1 2 3 4 5 6 7 Variable bio16 bio3 bio16 bio3 bio9 bio16 bio3
Table 3. Variable most gain decrease when it is omitted. Model evaluation In SDM field, the most used ‘fit’ of model measure is the Receiver Operating Characteristic (ROC) analysis. Using a cross-‐validation, based on the rule of thumbs 80% of the data to build the model (training) and testing with the remaining 20%, those points were built through k-‐fold data partitioning, Phillips et al at Princeton University discusses the use of Area under the Curve (AUC) for the presence only context [9]. Recall we have only occurrence and no absence data, for the commission rate instead of fraction of absences predicted present it is used the fraction of the total study area predicted present.
Maize group
AUC-‐Train AUC-‐Test
1 0.837 0.816 2 0.923 0.875 3 0.828 0.802 4 0.880 0.846 5 0.809 0.761 6 0.756 0.734 7 0.867 0.843
Table 4. AUC for maize group for train and test data set The expected area is above 0.5, which is considering a random model (black line in figure 4 and 5), so based on this measure, we get very good models like group 2 (AUC: 0.875), and the ‘worst’ case is group 6 (AUC: 0.734), even the latter is considered a useful model at a national-‐level decision (they a very informative). However with a best data cleaning, tweaking the model or ensemble methods, I’m sure can lead to better models.
Figure 4. ROC for maize group 2
Figure 5. ROC for maize group 6 Prediction Based on what it discussed earlier on this report, we can build our potential species distribution maps; we only need the model for each maize group and feed the algorithm and the predictors. MaxEnt is a Probabilistic algorithm so that, prediction is based on probabilities, and therefor map has values from 0 to 1 probability of presence.
Figure 6. Maize group 7 prediction map
This map can mislead or be difficult to be interpreted, because you have to decide which value of the map, stand for presences and therefor, which for absences, first thought is that there is over-‐fit of the model, because actually covers all the data points (figure 6). But we have to use a threshold, and like many things in SDM field is an ongoing research field [8]. Thresholds are hard and much into debate, I use an equal training sensitivity and specificity rule, which work good (but not always), one way to get a sense of usefulness is reviewing the density (frequency) of the presences and absences from the training data, for example, maize group 1. In this case, threshold is 0.377 (figure 7), so we classify or probability of presence on to map with absences < threshold and presences ≥ threshold. However, not always densities look like the following example (figure 7), for instance, we can check the plot densities of maize group 5 (figure 8).
Figure 7. Threshold graphic representation
Figure 8. Densities of absence and presence points from
training maize group 5. It is easy to see why in this case may be another threshold should be used (because the areas under the curves are much overlapping {figure 8}). Let’s classify maize group 7 (figure 6), threshold in this case is 0.39; this map is more interpretable, but changing your threshold approach can vary a lot the presences’ areas.
Figure 9. Maize group 7 presence classification
0.0 0.2 0.4 0.6
01
23
4
grupo1predicted value
Den
sity
. Ban
dwid
th=
0.03
832
Absences Presences
0.0 0.2 0.4 0.6 0.8
0.0
0.5
1.0
1.5
2.0
grupo5predicted value
Den
sity
. Ban
dwid
th=
0.06
835
0.0 0.2 0.4 0.6 0.8
0.0
0.5
1.0
1.5
2.0
2.5
grupo7predicted value
Dens
ity. B
andw
idth
= 0.
0641
5
Conclusion and future work Applying Machine Learning to Species Distribution Modeling, is an active research field, many methods have been applied to produce geographical distribution maps. This field is particular interesting, because pose a challenge in terms of how data has been generated, mostly presences and the highly bias ways to sample those presences, i.e. near roads, very different methods of collecting, dataset product of aggregated field work for decades, etc. even so, building models based on Machine Learning have proven it can be very successfully, moreover, in countries with mega-‐biodiversity and so heterogeneous; These tools can help decision makers, use the best possible achievable
information and analysis to promote biology conservation and sustainable use. Tweaking these algorithms will indeed provide better results, but at last they will reach to a limit difficult to overpass, so using ensemble method can be an interesting research area; some ensemble has already done, but many of them are based on algebraic combination of different model results, which may be delivered a consensus map but not an strictly ensemble method.
References [1] A Piñeyro-Nelson, J Van Heerwaarden, H R Perales, J
A Serratos-Hernández, A Rangel, M B Hufford, P Gepts, A Garay-Arroyo, R Rivera-Bustamante, E R Álvarez-Buylla. Transgenes in Mexican maize: molecular evidence and methodological considerations for GMO detection in landrace populations. Mol Ecol. 2009 February; 18(4): 750–761. doi: 10.1111/j.1365-294X.2008.03993.x
[2] Araújo, Miguel and A. Townsend Peterson. 2012. Uses
and misuses of bioclimatic envelope modeling. Ecology 93:1527-1539.
[3] Cervantes-Zamora, Y., Cornejo-Olgín, S. L., Lucero-
Márquez, R., Espinoza-Rodríguez, J. M., Miranda-Viquez, E. y Pineda-Velázquez, A, (1990). 'Provincias Fisiográficas de México'. Extraido de Clasificación de Regiones Naturales de México II, IV.10.2. Atlas Nacional de México. Vol. II. Escala 1:4000000. Instituto de Geografía, UNAM. México. http://www.conabio.gob.mx/informacion/gis/?vns=gis_root/region/fisica/rfisio4mgw
[4] Fithian, William and Hastie, Trevor. 2012. Statistical
Models for Presence-Only Data: Finite-Sample Equivalence and Addressing Observer Bias. arViv:1207.6950.
[5] Friedman, J.H., 2001. Greedy function approximation: a
gradient boosting machine. The Annals of Statistics 29: 1189-1232. http://www-stat. stanford.edu/~jhf/ftp/trebst.pdf)
[6] Gaston, K. J. 2003. The Structure and Dynamics of
Geographic Ranges. Oxford University Press, Oxford, UK.
[7] Hijmans, R.J., 2012. Cross-validation of species
distribution models: removing spatial sorting bias and calibration with a null-model. Ecology 93: 679- 688
[8] Liu C., P.M. Berry, T.P. Dawson, and R.G. Pearson,
2005. Selecting thresholds of occurrence in the prediction of species distributions. Ecography 28: 385-393.
[9] Phillips, Steven J, Miroslav Dudík and R.E. Schapire.
2004. A maximum entropy approach to species distribution modeling. In Proceeding of the twenty-first international conference on Machine Learning, page 83- ACM, 2004.
[10] Quist D, Chapela I. Transgenic DNA introgressed into
traditional maize landraces in Oaxaca, Mexico. Nature. 2001;414:541–543.
[11] Modeling of species distributions with Maxent: new
extensions and a comprehensive evaluation, Ecography 31: 161-175, 2008.
[12] Soberón, J. & Peterson, A. T. 2005. Interpretation of
models of fundamental ecological niches and species' distributional areas. Biodiversity Informatics, 2: 1-10.
[13] Stockwell, D. R. B. 1999. Genetic algorithms II. Pages
123-144 in A. H. Fielding, editor. Machine learning methods for ecological applications. Kluwer Academic Publishers, Boston.
[14] Takeo Ángel Kato Yamakake, Cristina Mapes Sánchez,
Luz María Mera Ovando, José Antonio Serratos Hernández and Robert Arthur Bye Boettles. 2009. Origen y Diversificacion del Maíz. Una Revisión Analítica. Comisión Nacional para el Conocimiento y Uso de la Biodiversidad.
[15] Townsend Peterson, A., Jorge Soberón, Richard G.
Pearson, Robert P. Anderson, Enrique Martínez-Meyer, Miguel Nakamura and Miguel Bastos Araújo. 2011. Ecological niches and geographic distributions. Princeton University Press.
[16] Trevor Hastie and Will Fithian. 2013, Inference from
presence-only data; the ongoing controversy. Ecography 36: 864-867, 2013.
[17] Warton David I., Leah C. Shepherd. 2010. Poisson
point process models solve the "pseudo-absence problem" for presence-only data in ecology. Annals of Applied Statistics 2010, Vol. 4, No. 3, 1383-1402.
[18] WorldClim, 2013 (consulted on October 2013),
Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.