5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša...

23
5.1.2 Analysis of stressors- responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of Ljubljana, Faculty of Civil and Geodetic Engineering) WP5 Meeting: Ljubljana, 17-18 June, 2015

Transcript of 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša...

Page 1: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

5.1.2 Analysis of stressors-responses relationswith decision trees

Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of Ljubljana, Faculty of Civil and Geodetic Engineering)

WP5 Meeting: Ljubljana, 17-18 June, 2015

Page 2: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Investigation of pressures/stressors correspondence with water quality data and geo-climatic factors

• Geodatabase will contain datasets regarding the Multiple stressors, Ecological status, Water quantity, Water quality, and Ecosystem services.• We will use data-driven modelling approach (namely regression and

classification trees) to investigate the relationship between pressures/stressors, geo-climatic factors and the state of the waterbodies.

Page 3: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Data-driven modelling approach and decision trees

• The goal of the data-driven methods is to learn the dependencies between the inputs and the outputs of the observed system from the measured data.• Decision tree learning is a commonly used method in data mining. A

tree can be learned by splitting the source dataset into subsets based on attribute value tests.• Two types of decision trees:• Classification trees: when the predicted (target) variable is a class• Regression trees: when the target variable is numeric or continuous

Page 4: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Classification trees

• Classification trees are used to separate the dataset into classes.

ATT 1 ATT 2 ATT 3 …TARGET (CLASS)

3.67 8.500 0.005 … Poor4.15 7.207 0.005 … Moderate5.32 8.357 0.011 … Good7.80 7.929 0.005 … Good8.11 7.096 0.005 … Poor9.36 7.804 0.005 … Poor

10.87 6.018 0.005 … Moderate11.10 7.400 0.006 … Poor10.23 5.457 0.011 … Good8.39 5.486 0.014 … Moderate7.42 5.486 0.013 … Moderate4.06 8.307 0.005 … Good

… … … … …

DATA SET (EXAMPLES) Classification tree Set of IF THEN rules

IF (ATT 1_value ≤ value1) THEN (class_value = class1)

IF (ATT 1_value > value1 and ATT 2_value ≤ value 2) THEN (class _value = class2)

IF (ATT 1_value > value1 and ATT 2_value > value 2) THEN (class _value = class3).

class1

class2 class3

Page 5: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Regression trees

• Regression trees are needed when the target variable is numeric or continuous. They are used for the prediction of the target value.

ATT 1 ATT 2 ATT 3 … TARGET3.67 8.500 0.005 … 2.1334.15 7.207 0.005 … 2.6015.32 8.357 0.011 … 3.7187.80 7.929 0.005 … 3.4818.11 7.096 0.005 … 1.7919.36 7.804 0.005 … 1.128

10.87 6.018 0.005 … 1.47111.10 7.400 0.006 … 1.52110.23 5.457 0.011 … 0.8698.39 5.486 0.014 … 0.5357.42 5.486 0.013 … 1.0344.06 8.307 0.005 … 1.636

… … … … …

DATA SET (EXAMPLES)Regression tree

TARGET = 2*ATT 1 +0.4*ATT 2 + 0*ATT 3

ATT 1 < 10

ATT 2 < 0.011TARGET = 0.2*ATT 1+ 3*ATT 2 + 4*ATT 3

= NO

= YES = NO

TARGET = 0.01*ATT 1+ 0*ATT 2 + 5*ATT 3

= YES

leaves, where the target variable is predicted

nodes

Set of equations for the prediction of a target value (i.e. regression model)

IF (ATT 1 < 10) THEN (target = 0.2 ATT 1 + 3 ∙ ∙ATT 2 + 4 ATT 3∙ )

IF (ATT 1 > 10 and ATT 2 < 0.011) THEN (target = 0.01 ATT 1 + 0 ATT 2 + 5 ATT 3∙ ∙ ∙ )

IF (ATT 1 > 10 and ATT 2 > 0.011) THEN (target = 2 ATT 1 + 0.4 ATT 2 + 0 ATT 3∙ ∙ ∙ ).

Page 6: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Finding the most important relationship between state and pressure/drivers data

MARS Geodatabase is prepared in a way that allows Pressure (multiple stressors) and State data (water quality) linkage to spatial objects.

Page 7: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Preparing datasets:

Spatial objects: • River segments with an unique identifier „tr“• Main drains in FEC and other river segments• Linkage of river segments with FEC and its

„Hinterland“ (all FECs in drainage area): „tr“ linked to „zhyd“• Linkage of SoE monitoring stations with main drainsWater quality and quantity data: • Data on nutrientsPressure data: • From EUROSTAT, E-PRTR and UWWTD; to include

also modelled data (Moneris, JRC - GreenModel?)

Page 8: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

FEC and hinterland polygons data: • Surface area• Have (average altitude), Hmin, Hmax• Average slope (%)• Precipitation, Temperature (1950-2000)• Population number, density• Hydroecoregion, Bioregion, Ecoregion (FEC)• Corine land cover (1st order)• River Straler order (for main drain)• River name (main drain)• WFD WB - ID • WFD ecological status• Monitoring station ID on main drain• Available water quality data (state)

Field Description UnitPrefix F data applies to FECPrefix H data applies to FECPrefix S stateWaterbaseI WISE SoE quality station IDtr ID of ECRINS river segment on which SoE quality station is locatedZHYD_FEC ID of ECRINS FEC on which SoE quality station is locatedhinterland Does SoE quality staion have hinterland or not (YES/NO)zhyd_hinterland ID of hinterland of SoE quality stationHinterland_Area_km2 Hinterland area in km2strahler Strahler order of tr where SoE quality staion is locatedSoE_RivM Name of riverWaterBody_ID WFD Water body IDWFD_ecol_st Ecological station of river segment from WFDSoE_RiverDisch River discharge (data from SoE quality database)DEM_altitu Altitude of SoE quality station from DEMH_CLC1 Agricultural areas Share of hinterland areaH_CLC2 Artificial surfaces Share of hinterland areaH_CLC3 Forest and semi natural areas Share of hinterland areaH_CLC4 Water bodies Share of hinterland areaH_CLC5 Wetlands Share of hinterland areaF_DEM_A Average (mean) altitude derived from DMV [m a.s.l]F_DEM_Mi Minimal altitude derived from DMV [m a.s.l]F_DEM_Mx Maximal altitude derived from DMV [m a.s.l]F_SLOP_A Average slope derived from DMV [percent reise]F_PON_A_W Population count of the World Version 3 (Wv3)-year 2000 [people]F_POD_A_W Population density of the World Version 3 (Wv3)-year 2000 [people/km2]F_POD_A_JRC Population density disaggregated with Corine land cover 2000-year 2000 [people/km2]F_AR_km2 Functional elementary cachment (FEC) area [m2]F_PRE_5000 Average yearly precipitation for periode 1950-2000 [mm/year]F_PRE1_5000 Average january precipitation for periode 1950-2000 [mm/month]F_PRE7_5000 Average july precipitation for periode 1950-2000 [mm/month]F_TEM_5000 Average yearly precipitation for periode 1950-2000 [°C]F_TEM1_5000 Average january precipitation for periode 1950-2000 [°C]F_TEM7_5000 Average july precipitation for periode 1950-2000 [°C]F_ECOR_ID Eco regions ID (AREA_ID) 1-25 Share of FEC areaF_BIOR_ID Biogeographical regions ID (ABBRE) Share of FEC areaF_HER_ID Hydro eco region ID (HERCODE) - European Hydro-Ecoregions Share of FEC area

Page 9: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Data from EUROSTAT (mainly on agriculture)• Farms• Livestock• Crops• Irrigation

All data on NUTS 2 level, for year 2010

In case there were no 2010 reportings, we used averages (from 2005 to 2009) or last reported values.

Field Description UnitH_beehives Beehives numberH_cattle Cattle headsH_dairy_cattle Dairy cattle headsH_equidae Equidae headsH_farms_ls Farms with livestock numberH_irr_area Total irrigable area haH_irr_volume Irrigation water volume m3H_maize Maize yield 100_kg/haH_oth_cattle Other cattle headsH_oth_pigs Other pigs headsH_pigs Pigs total headsH_potatoes Potato yields 100_kg/haH_poultry Poultry 1000_headsH_rabbits Rabbits headsH_sheep Sheep headsH_sows Sows headsH_uaa Utilized agricultural area haH_vineyards Vineyards 100_kg/haH_wheat Wheat yield 100_kg/ha

Page 10: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Data from WISE: - UWWTD and - SOE water quality

Field Description UnitH_BOD_discharge sum of BOD discharges in hinterland (UWWTD) [t/y]H_COD_discharge sum of COD discharges in hinterland (UWWTD) [t/y]H_P_discharge sum of P discharges in hinterland (UWWTD) [t/y]H_N_discharge sum of N discharges in hinterland (UWWTD) [t/y]H_uwws_count number of UWW systems in hinterland (UWWTD) [t/y]H_TN_release total nitrogen release (E-PRTR) [t/y]H_TP_release total phosphorus release (E-PRTR) [t/y]S_ammonium mg/l NS_total ammonium mg/l NS_bod5 mg/l O2S_bod7 mg/l O2S_chlorophyll_a µg/lS_codcr mg/l O2S_codmn mg/l O2S_DOC dissolved organic carbon mg/l CS_DO dissolved oxygen mg/l O2S_EC electrical conductivity µS/cmS_KN kjeldahl nitrogen mg/l NS_nitrate mg/l NS_orthophosphates mg/l PS_OS oxygen saturation %S_phS_silicate mg/l SiS_T water temperature °CS_TOC total organic carbon (toc) mg/l CS_TP total phosphorus mg/l P

Page 11: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Temperature (°C)Precipitation (mm/year)

Case study: Drava river catchement (1)

Page 12: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Ecoregions (Illies)Hydoecoregion (Rebecca project)

Case study: Drava river catchement (2)

Page 13: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Drava river: 107 monitoring stations (water quantity)

Page 14: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

HR_RV_29111 hinterland.pngAT_RV_FW61400127 hinterland.png

Page 15: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Temperature

Population Density (people/km2)

Maize yield (100 kg/ha, 2010)

Irrigation water volume (m3/year, 2010)

Pigs (heads, 2010)

Page 16: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Modelling exercise – Drava river catchment

• The previously mentioned data were used to generate different classification trees using WEKA software.• We decided to use the ecological status of water bodies (according to

WFD) as a target variable.• Our target variable can fall into one of the following three classes: • Good: 33 examples• Moderate: 42 examples• Poor: 13 examples

Page 17: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

1st Classification Tree(no. of parameters: 32; cross-validation (CV): 64 %; training data: 85 % ...85% of all cases are classigfed

by this rules correctly)- the most important

parameter is eco-region; the eco-region "Hungarian Lowlands" has poor ecological state, while the eco-region "Dinaric Western Balkans" has moderate ecological status.

- In eco-region "Alps" the most significant is percentage of urban areas, followed by the percentage of water surface.

- Interestingly, a greater number of beehives resulted in a better ecological state of the watercourse.

Test results – Drava river catchment (1)

Number of beehives

CLC1: agricultureCLC2 : artificialCLC3 : forestCLC4 : water bodies

Page 18: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Test results – Drava river catchment (2)

2nd Classification Tree(no. of parameters 12, CV: 64 %, training set: 84 %): - Here the most important parameter

is altitude. If it is lower than 161.24 m asl, then the ecological status is poor.

- At higher altitudes, the most important parameter is percentage of forest land. If the forest area covers more than 89.61 %, then the status of the water is good and percentage of urban areas becomes important. If it is less than 1.22 %, the status is good.

- Interestingly (and logically) the ecological state of the higher-lying sections (Strahler <= 5) is better (good vs. moderate).

Treshold

17 cases prove the rule, 3 failed

CLC1: agricultureCLC2 : artificialCLC3 : forestCLC4 : water bodies

IF THE HINTERLAND OF STATION IS COVERED with MORE THAN 90% OF FOREST IS A LARGE PROBABILTIY THAT THIS WB WILL HAVE GOOD ES. If not it depend on other land uses: if we have less then 90% of forest and urban areas less then 1.2, we can also expect good ES; otherwise we check again the forrest coverage. If it is less then 90 but more then 70, then we check the River discharge. If Forest is less then 70 (that is: urban area more then 1.2 and forest less then 70) we check the agricultural coverage. If more then 30, then moderate. Otherwise we go to the last check.

Page 19: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Test results – Drava river catchment (3)

3rd Classification Tree(no. of parameters: 13, CV: 65 % (more robust tree…preform better on validation dataset; training data: 73 %)In this case we used techniques to obtain smaller trees. Some information may be lost but the tree is more robust against new (validation) data.The tree is similar to the tree no. 2, only slightly shorter and easier to interpret. The important attributes are altitude, percentage of forest areas, and the percentage of water surface.

IF THE HINTERLAND OF STATION IS COVERED with MORE THAN 90% OF FOREST IS A LARGE PROBABILTIY THAT THIS WB WILL HAVE GOOD ES. If not it depend on other land uses: if we have less then 90% of forest and water surface more 0.18% we can expect moderate state. Otherwise if we have less than 73% OF FOREST we can hardly expet moderate status. MODEL INDICATES THE IMPROTANCE OF THE FOREST AND WATER SURFACE: IF WE HAVE ENOUGH FOREST SURFACE WE CAN AFFORD OTHER ACTIVITIES.

CLC1: agricultureCLC2 : artificialCLC3 : forestCLC4 : water bodies

Page 20: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Conclusions (1)

1) THE MOST IMPORTANT DRIVER/PRESSURE IS LAND USE• FOREST COVER IS THE DOMINANT LAND USE CLASS AND THE TRESHOLD OF FOREST COVERAGE FOR

GOOD STATUS IS 89% IN THE HINTERLAND• WATER SURFACE IN THE HITERLAND IS THE SECOND MOST LAND USE CLAS: THE TRESHOLD IS

0,179% (IF MORE THAN 0,179% WATER INTHE HINTERLAND, THAN ONE CAN EXPECT MODERATE STATUS)

• (THESE ARE DRIVERS THAT REDUCE PRESSURES FROM OTHER DRIVERS)• 2) FOR HIGH ALTITUDE: IF WE HAVE ENOUGH SURFACES OF FOREST (MORE THAN 90%) WE CAN

EXPECT GOOD ECOLOGICAL STATUS; IF NOT WE HAVE TO SEE WHAT ELSE WE ARE DOING IN THE H: • IF FOREST AREA BETWEEN 70-90%, THAN Ecdological Status DEPENDS ON RIVER DISCHARGE (MORE IS BETTER)• IF AGRIC MORE THAN 30% THAN WE CAN EXPECT MODERATE STATUS;• IF AGRIC LESS THAN 30% THAN

• 3) ALSO SEEMS IMPORTANT: IRRIGATION AND URBAN AREAS

Page 21: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

How to interpret and use the trees

• Clear message from all models is that forest coverage is most important attribute for the ecological status of water bodies in Drava catchment• The models provide with threshold values of the attributes, based on which a

strategy for land use management in the hinterland can be developed.• For example, clear guideline for managers is:

• In Drava catchment hinterlands below 161 m.a.s.l. tend to be problematic regardless of the land use and need more attention.

• Hinterlands above 161:• If we keep more then 90 % of the land use as forest there is big probability to have good ES. • If less forest: then pay attention to percentages of water surface, agricultural areas, urban areas and river

discharge. These thresholds are given in the models from the previous slides

• Important to note: Classification trees were trained on Drava catchment, thus the info they disclose is valid for this catchment only

Page 22: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Modelling exercise – further work

• Each SoE station is affected by the corresponding drainage area. Therefore, it is more reasonable to use data aggregation on hinterland-level instead of on FEC-level, especially for the geo-climatic factors (e.g., average slope, average annual prectipitation, etc.).• Not only ecological status, we can model other variables from the SoE

stations as well (e.g. P and N ranges)• We will model other catchments and find similarities.• The size of the catchments to be discussed• We still need to include point sources data (from E-PRTR and UWWTD

databases), which will hopefully improve the interpretability of the models

Page 23: 5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Points for further discussion

• Which additional attributes should we include in our modeling tasks?• Which target variables should we predict?• Which type of decision trees seems more usefull – classification or

regression trees?• Should we perform modelling tasks for single test cases, river groups

with common properties or for the Europe as a whole?• Where do you see a potential use of the constructed decision trees

within the other MARS Tasks?