Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and...

21
1 Purpose: P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare species) P Conduct transformations and standardizations P Detect and handle outliers Data Screening and Adjustments 2 P Examine summary statistics (e.g., n, mean, min, max) and check for irregularities Data Screening for Errors Unrealistic value? Where did all the data go? Action: correct errors in the raw data

Transcript of Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and...

Page 1: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

1

Purpose:P Detect and correct data errors

P Detect and treat missing data

P Detect and handleinsufficiently sampledvariables (e.g., rare species)

P Conduct transformations andstandardizations

P Detect and handle outliers

Data Screening and Adjustments

2

P Examine summary statistics(e.g., n, mean, min, max) andcheck for irregularities

Data Screening for Errors

Unrealistic value?

Where did allthe data go?

Action: correct errors in the raw data

Page 2: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

3

P Evaluate amount and pattern ofmissing data and take correctiveaction, if needed:

Data Screening for Missing Data

Action: replace with prior knowledge; insert meansor medians; use regression to estimate values

e.g., Median replacement

4

P Check for and drop“insufficient” variables< E.g., rare species in

community datasets

Sufficiency is the extent to which each variable, e.g.,each species’ ecological character, is accurately andmeaningfully described by the data.

E.g., species with very few records are not likely to beaccurately placed in ecological space. You must decideat what level of frequency of occurrence you want toaccept the ‘message’ and eliminate species below thislevel.

Data Screening for Sufficiency

Page 3: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

5

P Other issues:

< Influence of abundant generalists in communitydatasets

Abundant generalists define strong dimensions ofthe data cloud that have no meaningful pattern onthem. They can overwhelm the message of rarerspecies in some types of analysis. You must decidewhether to include or exclude these “dominant”species.

< Variables with too little variation (i.e., no signature)

Variables with too little variation have nomeaningful pattern (or influence) and are thereforeunnecessary.

Data Screening for Sufficiency

6

Data Screening for Sufficiency

Typical communitydataset

Rarespecies

Dominantspecies

Median occurrence

95% occurrence

5% occurrence

Page 4: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

7

Data Screening for Sufficiency

Too fewoccurrences ?

Too little variation?

P Drop “insufficient” variables (species) andconduct sensitivity analysis< Rare species (e.g., <5% occurrence)< Too little variability (e.g., <5-10% CV)

Some Rules of Thumb

8

Data Screening for SufficiencySome Rules of Thumb

P Drop “abundant generalist” species andconduct sensitivity analysis< Dominant species (e.g., >95%

occurrence)

Too ubiquitous?

Page 5: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite2.410.300.600.600.300.300.3013.340.000.850.850.700.480.4826.390.001.491.491.321.041.0432.280.000.300.300.480.600.6040.300.000.300.000.000.000.0051.320.001.320.000.000.000.006

16.050.304.863.242.802.422.42Total

TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943

-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716

0000000Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

9

Purpose:P Statistical

< Improve assumptions of normality, linearity,homogeneity of variance, etc.

< Make units of variables comparable when measured ondifferent scales.

P Ecological< Make ecological distance measures work better.< Reduce effect of total quantity in sample units, to put

focus on relative quantities.< Equalize (or otherwise alter) the relative importance of

variables (e.g., common and rare species).< Emphasize informative variables (species) at the

expense of uninformative variables (species).

Data Transformations & Standardizations

10

LogTransformation

bij=log(xij+1)

Raw Data Matrix Column Z-score Standardizationbij=(xij-0j)/sj

Transformations are applied toeach element of the data matrix,independent of the otherelements.Standardizations adjust matrixelements by a row or columnstandard (e.g., max, sum, etc.).

Data Transformations & Standardizations

Page 6: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite611111115011111250111113501111141010000510100006

23164444Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

11

P To adjust for highly skewedvariables

P To better meet assumptions ofstatistical test (e.g., normality,constant variance, etc.)

P To emphasizepresence/absence(nonquantitative) signature

When to Transform?

Which Transformation?

P Depends on type of data

P Whichever works best

Monotonic Transformations

12

Monotonic Transformations

Raw Data Matrix

Acceptable Domain of x: AllRange of f(x): 0 and 1 only

P Converts quantitative data intononquantitative data

P Applicable for species data

P Most useful when there is littlequantitative information present

P Can be a severe transformation

Binary presence/absenceTransformationbij=xij

0 (power)

bij=xij0 (power)

Page 7: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite2.410.300.600.600.300.300.3013.340.000.850.850.700.480.4826.390.001.491.491.321.041.0432.280.000.300.300.480.600.6040.300.000.300.000.000.000.0051.320.001.320.000.000.000.006

16.050.304.863.242.802.422.42Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

13

Raw Data Matrix

Acceptable Domain of x: >0Range of f(x): AllP Compresses high values and

spreads low values byexpressing values as orders ofmagnitude

P Useful when high degree ofvariation; ratio of largest tosmallest >10; highly positivelyskewed data

bij=log(xij+1)

Log Transformationbij=log(xij+1)

?

Monotonic Transformations

14

Monotonic Transformations

Log Transformationbij=log(xij+1)

T ?

Page 8: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite7.461.001.731.731.001.001.0019.730.002.452.452.001.411.412

21.750.005.485.484.473.163.1636.880.001.001.001.411.731.7341.000.001.000.000.000.000.0054.470.004.470.000.000.000.006

51.291.0016.1310.668.897.317.31Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

Power Transformations

0

2

4

6

8

10

12

0 10 20 30 40 50 60 70 80 90 100

x

b

p=1/2

p=1/3

p=1/4

p=1/5

p=1/10

15

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): $0

P Similar in effect to, but lessdramatic than, the logtransformation

P Often used with count(meristic) data; e.g., whenmean equals the variance(Poisson distribution)

Square Root Transformationbij=xij

½ (power)

bij=xij½ (power)

Monotonic Transformations

16

Acceptable Domain of x: $0Range of f(x): $0P Different exponents change

the effect of thetransformation; the smallerthe exponent, the morecompression applied to highvalues

P Flexible transformation usefulfor a wide variety of data

Monotonic Transformations

Power Family Transformation bij=xij

1/p

Page 9: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite1.291.000.050.080.040.060.0610.650.000.100.150.150.130.1323.230.000.490.750.740.630.6330.490.000.020.030.070.190.1940.020.000.020.000.000.000.0050.330.000.330.000.000.000.006

6111111Total

TotalFEDCBASite1.761.000.140.180.120.160.1611.170.000.200.250.250.230.2322.980.000.490.670.660.580.5830.930.000.080.100.180.290.2940.080.000.080.000.000.000.0050.390.000.390.000.000.000.006

7.312511.3921.1981.211.2561.256Total

17

Monotonic Transformations

Power Family Transformation bij=xij

1/p

18

Raw Data Matrix

Acceptable Domain of x: 0-1Range of f(x): 0-1P Spreads end of the scale while

compressing the middle forproportion data

P Useful for proportion datawith positive skew (can usearcsine transformation fornegative skew)

Arcsin Square RootTransformation

bij=(2/π)*sin- 1(xij½)

bij=(2/π)*sin- 1(xij½)

Monotonic Transformations

Page 10: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

BS

s

1

19

Monotonic TransformationsArcsin Square Root

Transformationbij=(2/π)*sin- 1(xij

½)

T ?

20

P Consider binary(presence/absence)transformation when:< percent zeros high

(say >50%)< number of distinct

values low (say < 10)< Beta diversity high

(say >5)

Some Rules of Thumb

Monotonic Transformations

P Use a log or square roottransformation for “highly”skewed data or ranging over several(>2) orders of magnitude

P Use arcsine squareroot transformationfor proportion data

P If applied to related variable set(e.g., species), then use sametransformation (e.g., log) so that allare scaled the same; otherwise,transform independently

Page 11: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

TotalFEDCBASite3.330.331.001.000.330.330.3313.330.001.001.000.670.330.3323.330.001.001.000.670.330.3333.330.000.330.330.671.001.0041.000.001.000.000.000.000.0051.000.001.000.000.000.000.006

15.330.335.333.332.332.002.00Total

TotalFEDCBASite-0.222.24-0.65-0.34-0.50-0.48-0.481-1.35-0.45-0.38-0.06-0.07-0.19-0.19210.00-0.451.802.192.192.132.133-1.97-0.45-0.83-0.53-0.350.100.104-4.09-0.45-0.83-0.63-0.64-0.77-0.775-2.36-0.450.89-0.63-0.64-0.77-0.776

0000000Total

TotalFEDCBASite0.00-0.711.411.41-0.71-0.71-0.7110.00-1.511.211.210.30-0.60-0.6020.00-1.511.211.210.30-0.60-0.6030.00-1.51-0.60-0.600.301.211.2140.00-0.452.24-0.45-0.45-0.45-0.4550.00-0.452.24-0.45-0.45-0.45-0.456

0-6.127.6952.329-0.7-1.6-1.6Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

21

Raw Data MatrixP To place on equal footing

highly unequal sample units orvariables (species)

P To better represent thepatterns of interest

When to Standardize?

Which Standardization?P Depends on objective (sample

or variable adjustment) andstatistical technique(ordination, cluster, etc.)?

P Which standard (variance,totals, max, etc.) makes sense?

bij=xij / max(xi)

Standardizations

22

Standardizations

Raw Data Matrix

bij=(xij-0j)/sj P Standardizations adjustmatrix elements by a row orcolumn standard (e.g., max,sum, etc.).

P All standardizations can beapplied to either rows orcolumns (or both)

bij=(xij-0i)/si

Page 12: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

23

Column or Row Standardizations?

Raw Data Matrix

P When the principal concern is toadjust for differences (e.g., variances,total abundance, ubiquity) amongvariables (species) in order to placethem on equal footing.

P When the focus is on the profileacross sample units.

Column Standardization

Row StandardizationP When the principal

concern is to adjust fordifferences (e.g., totalabundance, diversity)among sample units inorder to place them onequal footing.

P When the focus is on theprofile within a sampleunit.

24

Common Standardizations

P Total...divide by margin total

P Max...divide by marginmaximum

P Range...standardize values torange 0-1

P Frequency...divide by marginmaximum and multiply bynumber of non-zero items,so that the average of non-zero items is 1

P Hellinger...square root ofmethod=total

P Normalization...makemargin sums of squaresequal 1

P Standardize...scale to zeromean and unit variance (z-scores)

P Chi.square...divide by rowsums and square root ofcolumn sums, and adjust forsquare root of matrix total

Page 13: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943

-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716

0000000Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

TotalFEDCBASite1.291.000.050.080.040.060.0610.650.000.100.150.150.130.1323.230.000.490.750.740.630.6330.490.000.020.030.070.190.1940.020.000.020.000.000.000.0050.330.000.330.000.000.000.006

6111111Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

25

Raw Data Matrix

Acceptable Domain of x: AllRange of f(x): AllP Converts data to z-scores

(mean=0, variance=1)

P Commonly used to placevariables on equal footing

P Essential when variables havedifferent scales or units ofmeasurement

Column Z-score Standardizationbij=(xij-0j)/sj

bij=(xij-0j)/sj

Standardizations

26

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): 0-1

P Commonly used with species datato adjust for unequal abundancesamong species

P Equalizes areas under curves ofspecies response profiles

P Relative abundance profiles ofsamples depends on species’relative abundances across all sites

Column Total Standardizationbij=xij / 3xj

bij=xij / 3xj

Standardizations

Page 14: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite1.451.000.100.100.050.100.1011.000.000.200.200.200.200.2025.000.001.001.001.001.001.0030.770.000.030.030.100.300.3040.030.000.030.000.000.000.0050.670.000.670.000.000.000.0068.921.002.031.331.351.601.60Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

0

5

10

15

20

25

0 1 2 3 4 5 6 7 8 9 10

Abundance (count)

Fre

qu

ency

Species A

Species B

C olu mn T ota l Stan dardization

0.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0 1 2 3 4 5 6 7 8 9 10

Abundance (count)

Fre

quen

cy

C olu mn M ax Stan dardization

0.000

0.200

0.400

0.600

0.800

1.000

1.200

0 1 2 3 4 5 6 7 8 9 10

Abundance (count)

Fre

quen

cy

27

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): 0-1

P Similar to column total, except:

P Equalizes heights of peaks of speciesresponse curves

P Based on extreme values whichcan introduce noise

P Can exacerbate importance ofrare species

Column Max Standardizationbij=xij / max(xj)

bij=xij / max(xj)

Standardizations

28

Standardizations

Equalizes areaunder curve

Equalizes peaksof curves

Page 15: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

TotalFEDCBASite10.100.300.300.100.100.10110.000.300.300.200.100.10210.000.300.300.200.100.10310.000.100.100.200.300.30410.001.000.000.000.000.00510.001.000.000.000.000.00660.103.001.000.700.600.60Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

TotalFEDCBASite3.330.331.001.000.330.330.3313.330.001.001.000.670.330.3323.330.001.001.000.670.330.3333.330.000.330.330.671.001.0041.000.001.000.000.000.000.0051.000.001.000.000.000.000.006

15.330.335.333.332.332.002.00Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

29

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): 0-1

P Commonly used with species datato adjust for unequal abundancesamong sample units

P Equalizes areas under curves ofsample unit profiles

P Shifts emphasis to relativeabundance within a sample unit

P Relative abundance profiles ofsamples are independent

Row Total Standardizationbij=xij / 3xi

bij=xij / 3xi

Standardizations

30

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): 0-1

P Similar to row total; except:

P Equalizes heights of peaks ofsample unit profiles

P Based on extreme valueswhich can introduce noise

Row Max Standardizationbij=xij / max(xi)

bij=xij / max(xi)

Standardizations

Page 16: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

Site A B C D E F Total1 0.10 0.10 0.05 0.10 0.10 1.00 1.452 0.20 0.20 0.20 0.20 0.20 0.00 1.003 1.00 1.00 1.00 1.00 1.00 0.00 5.004 0.30 0.30 0.10 0.03 0.03 0.00 0.775 0.00 0.00 0.00 0.00 0.03 0.00 0.036 0.00 0.00 0.00 0.00 0.67 0.00 0.67

Total 1.60 1.60 1.35 1.33 2.03 1.00 8.92

Site A B C D E F Total1 0.01 0.01 0.01 0.01 0.01 0.10 0.152 0.01 0.01 0.01 0.01 0.01 0.00 0.053 0.01 0.01 0.01 0.01 0.01 0.00 0.054 0.03 0.03 0.01 0.00 0.00 0.00 0.085 0.00 0.00 0.00 0.00 0.03 0.00 0.036 0.00 0.00 0.00 0.00 0.03 0.00 0.03

Total 0.06 0.06 0.04 0.03 0.10 0.10 0.39

31

Raw Data Matrix

Acceptable Domain of x: $0Range of f(x): 0-1

P 1st standardize by species (col)maxima, then by row totals

P Equalize emphasis among sampleunits and among species

P Appealing, but comes at cost ofdiminishing the intuitive meaningfor individual data values

Wisconsin DoubleStandardization

Standardizations

bij=col max

bij=row total

32

StandardizationsSome Rules of Thumb

P Effect of standardization on analysis dependson variability among rows and/or columns

Page 17: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

TotalFEDCBASite10.100.300.300.100.100.10110.000.300.300.200.100.10210.000.300.300.200.100.10310.000.100.100.200.300.30410.001.000.000.000.000.00510.001.000.000.000.000.00660.103.001.000.700.600.60Total

Site A B C D E F Total1 1 1 1 3 3 1 102 2 2 4 6 6 0 203 10 10 20 30 30 0 1004 3 3 2 1 1 0 105 0 0 0 0 1 0 16 0 0 0 0 20 0 20

Total 16 16 27 40 61 1 161

TotalFEDCBASite-0.202.04-0.59-0.31-0.45-0.44-0.441-1.23-0.41-0.35-0.06-0.06-0.18-0.1829.12-0.411.642.002.001.941.943

-1.80-0.41-0.76-0.49-0.320.090.094-3.73-0.41-0.76-0.57-0.58-0.71-0.715-2.16-0.410.82-0.57-0.58-0.71-0.716

0000000Total

33

Some Rules of Thumb

P Consider rowstandardizations for speciesdata sets, commonly:

< Row normalize (Euclideandistance (ED) = chorddistance)

Standardizations

< Row chi.square (ED =chi.square distance ofCA/CCA)

< Row total (ED = speciesprofile distance)

< Row hellinger (ED =Hellinger distance)(Legendre and Gallagher 2001)

34

Some Rules of Thumb

P Consider column standardizationsto “equalize” variables measured indifferent units and scales,commonly:

< Column standardize (z-scores =zero mean and unit variance)

< Column normalize (uncenteredwith unit variance)

< Column total (col sums = 1)

< Column range (col range 0-1)

Standardizations

bij=(xij-0j)/sj

Page 18: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

35

Some Rules of Thumb

P Standardizations may not matter depending onsubsequent analysis, e.g.,:

< Principal components of correlation matrixhas built in column standardization

< Correspondence analysis of species data sethas essentially a built in chi-squarestandardization

P No theoretical basis for selecting the “best”standardization - should justify on biologicalgrounds and perhaps conduct sensitivity analysis

Standardizations

36

P What are outliers?< Sample units with extreme values

for individual variables(univariate outliers) or sampleunits with unusual combinationof values for more than onevariable (mulitvariate outliers).

P Why worry about outliers?< Outliers can have a large effect

on the outcome of an analysisand therefore can lead toerroneous conclusions.

Data Screening for Outliers

Page 19: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

AMGO AMRO BAEA BCCH81 6.50 4.91 NA NA82 6.50 NA NA NA83 4.27 4.30 NA 4.2984 6.50 NA NA NA85 NA NA NA 5.4487 NA NA NA NA89 NA NA NA 5.4490 NA NA NA NA91 NA NA 12.73 NA

37

P Univariate outliers:< Examine sample standard

deviation scores on eachvariable separately.

Data Screening for Outliers

Extremeobservations

Standard deviation scores >3

38

P Multivariate outliers:< Examine deviations of

the sample averagedistances to othersamples.

Data Screening for Outliers

Extremeobservations

Standard deviation scores >3

Page 20: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

39

P Multivariate outliers:< Examine each sample’s

Mahalanobis distance tothe group of remainingsamples.

Data Screening for Outliers

D2

40

Data Screening for Outliers

PC2PC1

PC3

P Multivariate outliers:< Examine results of subsequent

analyses for extreme values (e.g.,isolated points in ordination plots,single-member clusters in clusteranalysis, etc.)

Page 21: Data Screening and Adjustments - UMass Amherst · P Detect and correct data errors P Detect and treat missing data P Detect and handle insufficiently sampled variables (e.g., rare

41

P Examine data at all stages of analysis(i.e., input data,transformed/standardized data,ecological distance matrix, results ofanalysis) for extreme values

P Be aware of potential impact ofextreme values in chosen analysis

P Delete extreme values only ifjustifiable on ecological grounds

P Conduct sensitivity analysis

Data Screening for OutliersSome Rules of Thumb