Environmental System Factor analysis, Principle … Credit ※Att d & S b i i f tAttendance &...

2016 Winter / Graduate School

Environmental System Environmental System A l iA l iAnalysisAnalysis

KURISU Kiyo

Scope of this classp

Learn basic statistics using EXCEL

L h t SPSS ( l ft f t ti ti ) Learn how to use SPSS (popular software for statistics)

Learn multivariable analysis such as Learn multivariable analysis, such as Factor analysis, Principle Component Analysis, andStructural Equation Modeling,q gfrom basics to practical analysis using SPSS

If h l d k th→ If you have already known these, you need not take this class

2

Style of this classy

Exercise style

I l i b i f h t ti ti l th d d I explain basics of each statistical method and show the analytical procedure using EXCEL and SPSS one by oneusing EXCEL and SPSS one by one.

Students also use PCs* and try to do the same procedure.y p* I will bring 8-10 Note PCs

equipped with EXCEL , SPSS (& Amos) in this class.Y th PCYou can use the PCs.

→ If SPSS& Amos are installed in your PC, b i d it

3

you can bring and use it.

ScheduleNo. Date Contents

1 Sept. 26 Outline, Basic analysis for comparison of means t-Test

Oct. 3 Cancel

Oct. 10 Holiday

2 Oct 17 Baic analysis for comparison of Means Repeated Data analysis2 Oct 17 Baic analysis for comparison of Means, Repeated Data analysist-Test, ANOVA, repeated ANOVA

3 Oct 24 Regression Analysis & Multiple regression analysis

4 Oct 31 Principle Component Analysis & Factor Analysis& Cluster Analysis

5 Nov 7 Path analysis using Amos

6 Nov 14 Structural Equation Modeling using Amos

4

For Credit

※ Att d & S b i i f t※ Attendance & Submission of report

Analyze data using statistical analysis learned in this classy g y

You can use your own data as well as public statistical data

You should use analytical methods shown during this class, but you can use any tools (like EXCEL, SPSS, R etc.) y ( , , )

D dli D 16 (F i)Deadline：Dec. 16 (Fri)Submission: [email protected] via e-mail

5

（０）What kind of data you want to analyze??

6

Data Characteristics

Type of Data Function ExampleAyp pNominal Labeling ID no.Ordinal Labeling

＋ O dA, B, C

123

AB C

＋ OrderInterval Labeling

＋OrderTemperature(℃: degrees Celsius)

12

＋Same interval for 1 unit

ProportionalLabeling＋Order

Concentration(mg/L)p

＋Same interval for 1 unit＋Origin (0) has meaning

( g )

First of all, you should check what scale is used for your data

7

you should check what scale is used for your data.


Data gained by lab. experiments are usually proportional-scale data.

Condition NO3(mgN/L )

BOD(mgO/L)

PO4(mgP/L)

g y p y p p

(mgN/L ) (mgO/L) (mgP/L)A 5.8 2.2 0.03B 10.5 7.8 0.20C 2.3 5.6 0.04D 6.6 3.2 0.09E 2.5 4.5 0.12F 20.3 15.0 0.23G 7 8 3 4 0 02G 7.8 3.4 0.02

8


Data gained by questionnaire survey can involve various types of data

Respon Gender Blood- Body Daily exercise

g y q y yp

pdent male: 1,

female: 2cholesterol

(µg/L)

ytemperature

（℃）

y1: always2: sometimes3: rarely4: never

A 1 25 36.8 1B 2 50 35.4 4C 2 15 36.3 2D 1 32 37.1 2E 2 45 36 2 1E 2 45 36.2 1F 1 63 37.3 2G 1 78 36.4 3

9

How to handle Likert scale？？

Think so very much

Think so Think soa bit

Not very much think so

Not think so

Never think sovery much a bit think so think so think so

Strictly: “Ordinal Scale” data Above 5 components : can be handled as “Interval Scale” data Below 3 components : it’s better not to apply “Interval Scale”p pp y 4 components: gray zone >>> Interval scale can be applied.

If respondents can identify difference among 5 and more components, you can use 5 or more components.But, if people cannot differentiate 5 or more components, 4 components are often used.

10

Likert’s ScaleLikert s ScaleRensis Likert served as vice president of the American Statistical Association from 1953 to 1955 and was president of the ASA in 1959. His service to the ASA, however, hardly begins to indicate the breadth of the achievements of his long and vigorous career.ac e e e ts o s o g a d go ous ca ee

Likert was born in Cheyenne, Wyoming, in 1903. After living with his parents in several states, he entered the University of Michigan in 1922. There, he concentrated on civil engineering for a couple of years, until he took a course in sociology. Robert Angell recalls him from that class as his brightest engineering student. The respect was mutual, and Likert found that his scientific interests were more in people than in things He transferred in his senior year from the college of engineering and took his bachelor’s degreein things. He transferred in his senior year from the college of engineering and took his bachelor s degree in sociology in 1926.

Likert went from Michigan to Union Theological Seminary for one year, then to the Columbia University Department of Psychology, where he took his PhD in 1932. At Columbia, Likert gradually moved from traditional fields of psychology to the new social psychology. In this, he was influenced by Gardner Murphy, who served as chair for his dissertation committee. His doctoral research dealt with a wide-ranging set of attitudes, interrelating student attitudes toward race and international affairs; it was published in 1932 as “A Technique for the Measurement of Attitudes.”

Engineering, sociology, psychology, ethics, and statistics. Likert maintained throughout his life an energetic appreciation of concepts and values from all these areas. He was always curious about how things work[ed]

Rensis Likert:Creator of Organizations

1 SEPTEMBER 2010 appreciation of concepts and values from all these areas. He was always curious about how things work[ed] and how to fix them when they did not. His strong feel for structures and measurements also showed in his quantitative and pragmatic approaches to social problems and social measurements. The year at Union Theological Seminary was reflected in his openness, his optimism, and both his desire and his ability to do good.

Likert did well in several areas During the course of his doctoral research he developed what soon

This article was originally written by Leslie Kish and appeared in The American Statistician in 1982.

Likert did well in several areas. During the course of his doctoral research, he developed what soon became the famous “Likert scale,” an early example of “sufficing” that exemplifies Likert’s pragmatic approach. He showed, with empirical comparisons, that his much simpler method (asking the respondent to place himself on a scale of favor/disfavor with a neutral midpoint) gave results very similar to those of the much more cumbersome(though more theoretically elegant) Thurstone procedure (based on the psychophysical method of equal-appearing intervals) The Likert scale has been adopted throughout the

11

psychophysical method of equal-appearing intervals). The Likert scale has been adopted throughout the world, and continuing demand for the thesis led to its re-publication in the series Classic Contributions to Social Psychology.

http://magazine.amstat.org/blog/2010/09/01/rensislikertsep10/

（１）How to analyze the difference of groups

Two groups

12

Comparison of two groups Type II p g p

Body Height of 25-year-old ladies in Japan

(cm)

Sampled Data: Sampled 25-year-old ladies in Japan Number of samples: n

Want to compare

(cm) 160

158

155

165

152

Number of samples: nA

Mean: xA Variance: sA

2 averages from two groups

152

166

163

158

156

Parent population: All 25-year-old ladies in Japan Mean: μA From sampled data

165 Variance: σA2

Body Height of 25-year-old ladies in USA

Sampled Data: Sampled 25-year-old ladies in USA

Which is taller

From sampled data…..

(cm) 170

162

165

168

Number of samples: nB Mean: xB Variance: sB

2

Which is taller,Japan or USA in parent populations?

164

168

165

163

168

Parent population: All 25-year-old ladies in USA Mean: μB V i 2

13

168

172

165

168

Variance: σB2

Which methodology you can choose ?Type II gy y

Want to compare averagesfrom two groups

Parametric test Non Parametric test

Data for each group are Normally Distributed

Not Normally Distributed


Equal Variance between groups

Not Equal Variance

Wil t t U t t

Student’s t-test Welch’s test

Wilcoxon test

Mann Whitney test

K l S i t t

U-test

W-test

14

Kolmogorov-Sminov test

What are Parametric & Nonparametric ?p

Parametric testIt assumes that parent data follow Normal Distribution

Parametric test

t-test & variance test assume that averages or variances are normally distributed

Non Parametric testDoes not assume normal distributionDoes not assume normal distribution

Methods for calculation are complicated, th t ft d f l tiso they were not often used for a long time.

However, various methods have been prepared in statistic soft wares, because of high-spec computers.

15

How to check Normal distribution?

(A) Make Histogram(A) Make Histogram

(B) Check on EXCEL

1) Rank each data: RANK(data, range, 1)2) Calculate p : (RANK-1/2)/n

(B) Check on EXCEL

NORMSINV(p) NORMSINV(p)

2) Calculate p : (RANK 1/2)/n3) NORMSINV(p)

(p)

(C) Check on SPSS

X X

[Normality plots with tests]

16

How to check Equal Variance?q

Bartlett’s testBartlett s test

Levene’s testThese tests are usually included in statistic soft wares.

Levene s test

bartlett.test ( y$●●~ y$▲▲ ) Bartlett test of homogeneity of variancesBartlett test of homogeneity of variances

data: y$●● by y$▲▲Bartlett‘s K squared = ×× df = 2 p value = □□Bartlett s K-squared = ××, df = 2, p-value = □□

17

Concept of t-testType II p（H1: Alternative hypothesis）

From my opinion, there should be significant difference between the groups

Temporaril ass me no difference（H N ll h pothesis） μ μ

μA≠μB

Temporarily assume no difference（H0: Null hypothesis）

Estimate the probability of no-difference cases

μA=μB

Probability of μA=μBEstimate the probability of no-difference cases

If the probability of no-difference cases is quite small （p<0.05, p<0.01),

Probability of μA=μB

p y q （p , p ),it means that no-difference cases rarely occur.

Null hypothesis is rejected. μA=μB rarely occurs

You can say there is significant difference between the groups.

yp j

μA≠μB

18

Concept of Students’s T-testp

Rarely occurs

Calculated t based onCalculated t based on your dataPossibly occurs

19

One-tailed test or Two-tailed test?

H0: μ=μ0 H0: μ=μ0 H0: μ=μ0H1: μ<μ0 H1: μ≠μ0 H1: μ>μ0

If your alternative(first) hypothesis is “one average is bigger/smaller than the other”,one average is bigger/smaller than the other ,you should apply one-tailed test.If you just want to check that averages of two groups are

20

same or not, you can apply two-tailed test.








Not Equal Variance

Wil t t W t t


Wilcoxon test

Mann Whitney test

K l S i t t

U-test

W-test

21


Type II Parametric analysis- two groups comparison


y

Text Page7

t xA xBs2 s2 t xA xBsA2 sB2 your nA nB

2 nA 1 sA2 nB 1 sB2nA nBt value

s2 nA 1 sA nB 1 sBnA nB 2

reference t value

=TINV(a, df) 95% significance: Two tailed: TINV(0.05, df) One tailed: TINV(0.1, df)

df sA2nA sB2nB 2sA4nA2 nA 1 sB4nB2 nB 1 df=nA+nB-2

22

nA nA 1 nB nB 1

How to analyze the Equal-VarianceType II y q

Levene’s test Text Page12 Excel TypeII(Levene’s test)

1) H0: sA2=sB

2

2) Calculate La If you want to use the analysis which requires Equal-variance, this hypothesis should not be La n NGNG 1 ∑ nk Zk Z 2NGk∑ ∑ ZinkiNGk Zk 2

this hypothesis should not be rejected (p>0.05 is required).

Zi |xi xk|

Different from usual tests

nk: number of samples in ‘Group k’ NG: number of groups n: Total sample number xk: average of ‘Group k’

SPSS’s output

xi: each data

23

3) Compare with F(a, NG 1, n NG








Not Equal Variance

Wil t t U t t


Wilcoxon test

Mann Whitney test

K l S i t t

U-test

W-test

24


Type II Non Parametric analysis- two groups comparisony

Original Statistic Compare withParametric Originaldata

Statistic value

Compare with reference valuetest

Non-ParametricRankingdata

Non Parametric test

Non-ParametricParametric

25

Compare with Type II

Excel TypeII(3)Wilcoxon test

Body Height of 25-year-old ladies in Ri

reference value

Wilcoxon distribution curve

Text Page8

yJapan (cm)

160158155165

Ri

6

4

2

11

Calculate the summation of each group

WA 6665

152166163158156

11

1

16

8

4

3

each group

ω(0.025, 10, 12)=84 N1 N2 a

0 05 0 025 0 01156165


(cm)

170

Calculate the rankof each

3

11

66

21

0.05 0.025 0.01 ω ω ω ω ω ω

9

19 96 165 90 171 83 1720 99 171 93 177 85 18

10 10 82 128 78 132 74 13170162165168164

of each value

21

7

11

17

10

10

10 82 128 78 132 74 1311 86 134 81 139 77 1412 89 141 84 146 79 1513 92 148 88 152 82 15

168165163168172

17

11

8

17

22

Calculate the summation of each group

WB 169

26

165168

11

17

each group ω(0.025, 10, 12)=146


3 and more groups

27

Differences among three groupsType III g g p

Body Height of 25-year-old ladies in Japan

(cm)

Sampled Data: Sampled 25-year-old ladies in Japan Number of samples: n

(cm)160

158

155

165

152

Number of samples: nA Mean: xA Variance: sA

2

Body Height of 25-year-old ladies in Italy

(cm)

Sampled Data: Sampled 25-year-old ladies in Italy Number of samples: nC A

152

166

163

158

156

Parent population: All 25-year-old ladies in Japan Mean: μA

(cm) 155

162 165 163

160

Average: xC Variance: sC

2

165 Variance: σA

2


Sampled Data: Sampled 25-year-old ladies in USA

160

158

165

161

166 Parent population: All 25 year old ladies in Italy

(cm) 170

162

165

168

Number of samples: nB Mean: xB Variance: sB

2

156

160

All 25-year-old ladies in Italy Average: μC Variance: σC

2

164

168

165

163

168

Parent population: All 25-year-old ladies in USA Mean: μB V i 2

28

168

172

165

168

Variance: σB2

Why you should not repeat t-Test?Type III y y p

Ａ B

Ａ C

Significant with 95% If these three are repeated, the probability decreases to

0 95×0 95×0 95 0 857Significant with 95%

B C0.95×0.95×0.95＝0.857

Significant with 95%

B CＡ Instead of above, we want to analyze the differences among three groups with 95% significance.

29

Which method can we choose?Type III

Text Page13Compare among three

or more groups


Normally distributed Not distributed normally


Equal of V i

Variance is not l

Analysis of variance is considered to be very strong analytical method

and the results are not influenced so much by the requirement of normal distributionVariance equal by the requirement of normal distribution

and equal variance. So, ANOVA is sometimes conducted without checking of the prerequisite.

One-way ANOVA

Kruscal-Wallis one-way analysis of variance

H-test

30

Concept of Variance Analysis (F-test)Type III p y ( )

Total VarianceTotal Variance（SST）

Variance between Groups Variance within groupVariance between Groups（SSG）

Variance within group（SSE）

HeightSSG / dfG

SSE / dfE= F value

If SSG is much bigger than SSE, it means there is a difference

g

SSE / dfEbetween groups

If SSE is quite bigger than SSGIf SSE is quite bigger than SSG, it means there is no significant difference between groups

31

Japan Italy USA

Concept of F-testType III p（H1: Alternative hypothesis）

From my opinion, there should be significant difference between the groups

Temporaril ass me no difference（H N ll h pothesis） μ μ μ

μA≠μBor≠Temporarily assume no difference（H0: Null hypothesis）

Estimate the probability of no-difference cases

μA=μB=μC μB≠μCor

μC≠μAProbability of μA=μB=μCEstimate the probability of no-difference cases

If the probability of no-difference cases is quite small （p<0.05, p<0.01),

Probability of μA=μB=μC

p y q （p , p ),it means that no-difference cases rarely occur.

Null hypothesis is rejected. μA=μB=μC rarely occurs

You can say there is significant difference between the groups.

yp j

μA≠μBor

μ ≠μB t j t b F t t t k h th i

32

μB≠μCor

μC≠μA

But just by F-test, you cannot know where there is significant difference. >>> PosHoc test is needed

Type III

nk: number of samples in ‘Group k’ NG: number of groups n: Total sample number xk: average of ‘Group k’

Text Page13

33

k g px: average of all data Excel TypeIII

What is Levene’s Test ??? of “Z”of “X”

We see “X”

X

Z= Difference from group average

Z We see “Z”

Ho: µA= µB Japan Italy USA

H 2 2

34

Ho: sA2= sB2ANOVA

Levene’s test

Kruscal-Wallis analysisy

1) Rank all data: Rki (Rank of data Xi in group k) Text Page14-152) Group average of the ranks: ∑ Rkii /nk 3) Total average of ranks: ∑Ri/n

4) Calculate KW value

Excel TypeIII(KW)

4) Calculate KW value

KW 12 1 nk Rk R 2 n n 1 k kk5) Calculate χ2 g 1; a value KW χ2 g 1; a >>>>Ho is rejectedKW χ2 g 1; a >>>>Ho is rejected

The value of KW is approximately χ2 distribution. g: number of groups

35

Pos Hoc Test (Multiple comparison)Where does the difference exist? ( p p )

Multiple groups comparison Text Page14-15

Different sample numbers

Equal sample numbersnumbers

Equal Variance Not equal Equal Variance Not equal Variance Variance

dfE≥75 dfE<75 dfE≥75 dfE<75

Tukey Kramer Hochberg GT2

Dunett’s CGames-Howell

Tukey HSD REGWQ

Dunett’sT3

Dunett’sT3

Dunett’s CGabriel

36

g


Paired Data

37

How to handle Paired data?Type IV

Paired samples or paired comparison occur when a single individual is tested twice (e g

Text Page17when a single individual is tested twice (e.g. before and after) or the sampling station is retested.

Body Weight of before Apple

Body Weight of after Apple

Parent population: Before Apple Diet Average: μA

Sampled Data: Before Apple Diet Number of samples: n

Diet (kg) Diet (kg) A 60.8 59.2

B 55.7 54.7

C 50.4 50.2

D 48 2 47 2

Average: μA

Variance: σA2

Number of samples: n Average: xA Variance: sA

2

D 48.2 47.2

E 75.2 65.6

F 56.4 56.3

G 47.3 47.0

H 60 0

Parent population: After Apple Diet Average: μB

2

Sampled Data: After Apple Diet Number of samples: n

H 65.1 60.0

I 80.3 75.3

J 52.4 52.3

Variance: σB2 Average: xB

Variance: sB2

Calculate difference

Null Hypothesis H0: μA-μB=0Alternate Hypothesis H1: μA-μB≠0

between A(before) and B(after), and then you can construct the null hypothesis

38

ypthat the difference is 0.

Why you must not use basic t-Test?y y

＜Original data＞＜If you exchange some of the data…＞


Diet (kg)

Body Weight of after Apple Diet (kg)


Diet (kg)

Body Weight of after Apple Diet (kg)

A 56 3( g) ( g)

A 60.8 59.2

B 55.7 54.7 C 50.4 50.2 D 48 2 47.2

A 60.8 56.3

B 55.7 50.2

C 50.4 52.3

D 48.2 47.2 D 48.2 47.2

E 75.2 65.6

F 56.4 56.3

G 47.3 47.0

E 75.2 60.0

F 56.4 59.2

G 47.3 47.0

H 65.1 60.0

I 80.3 75.3

J 52.4 52.3

H 65.1 65.6

I 80.3 75.3

J 52.4 54.7

You can find the trend of decrease You cannot find any trend….

39

If you apply basic t-Test, above two data give same results…..

Repeated t-Test

Type IVExcel Type IVBasic procedure of paired datap p

Hypothesis H >

Hypothesis H >0

Null Hypothesis H 0

BodyWeight

BodyWeight

Change of BodyWeight yi

H1: μBefore > μAfter H1: μBefore – μAfter >0 H0: μBefore – μAfter = 0

t-test (one tailed)

beforeDiet

afterDiet

60.8 59.255 7 54 7

1.61

If you cannot assume 55.7 54.750.4 50.248.2 47.2

10.21

ywhich is larger, just assuming difference, you can apply Two-tailed test.

75.2 65.656.4 56.347 3 47

9.60.10 3

Hypothesis H1: μBefore ≠ μAfter

N ll H th i47.3 4765.1 6080.3 75.352 4 52 3

0.35.15

0 1

Null Hypothesis H0: μBefore = μAfter

t-test (two tailed)

40

52.4 52.3 0.1( )

For Paired Data: Repeated ANOVAType V p Length of root

after toxicLength of root

after toxicLength of root

after toxic Wh t tafter toxic chemical is

dosed After 1 day

after toxic chemical is

dosed After 1 month


dosed After 2 months

A 15.2 13.8 12.6

B 15 6 13 5

When you want to check the temporal change among 3 or more time pointsB 22.6 15.6 13.5

C 17.8 15.2 14.8

D 20.5 18.2 16.8

E 18.6 16.5 14.6

F 14 1 14 0

more time-points,You should use repeated ANOVA

F 14.2 14.1 14.0

G 19.5 17.2 16.5

H 17.6 14.8 11.2

I 21.8 19.5 16.2 1 day after J 22.4 19.6 17.2

yChemical

application1 month after

ChemicalChemical application

We want to know the effect of chemical>> If length of root decreases along the time,

2 months afterChemical

application

41

g g ,you can say there is significant effect of the chemical

Why you must not use basic ANOVA?y y

＜Original data＞＜If you exchange some of the data…＞ Length of root


dosed After 1 day

Length of root after toxic chemical is

dosed After 1 month


dosed After 2 months


dosed After 1 day


dosed After 1 month


dosed After 2 months After 1 day After 1 month After 2 months

A 15.2 18.2 12.6 B 22.6 15.6 13.5

C 17.8 15.2 16.2 D 20.5 13.8 16.8

After 1 day After 1 month After 2 months A 15.2 13.8 12.6

B 22.6 15.6 13.5 C 17.8 15.2 14.8

D 20 5 18.2 16.8 20.5 E 18.6 19.5 14.6

F 14.2 14.1 14.0

G 19.5 17.2 16.5

H 17 6 19.6 11.2

20.5 8. 6.8E 18.6 16.5 14.6

F 14.2 14.1 14.0

G 19.5 17.2 16.5

H 17 6 14.8 11.2 17.6 I 21.8 16.5 17.2

J 22.4 14.8 14.8

H 17.6 14.8 11.2

I 21.8 19.5 16.2

J 22.4 19.6 17.2

You can find the trend of decrease You cannot find any trend….

42

If you apply basic One-way ANOVA, above two data give same results…..

Repeated ANOVA

（２）Analyze relationships between variables

43

First of all…High correlation,

but not linearType VI

Describe Scattered GraphDescribe Scattered Graph

Is it OK just seeing correlation coefficient?Linear??Outlier??Should be grouped??

Influenced by outlier

Grouping is better

44

CorrelationType VI

Sy Yi Y 2

Sx Xi X 2

Sxy Xi X Yi Y Sr SxySxSy

45

Spurious CorrelationType VI pThis is the case when there is correlation but actually there is no cause effect relationshipactually there is no cause-effect relationship between “A” and “B”. The omitted third variable “C” can describe the relationships.

B d H i h

[You can get high correlation between “body height” and “knowledge” in elementary school students]

The taller students have more Body Height

Knowledge

knowledge?

・・・Is it true ?????

A

B

[Actual structure]Another factor “age” has

Age

Body Height Another factor age has influences on both “body height” and “knowledge”, so apparently correlation is seen between

C A

46

Knowledge

correlation is seen between “body height” and “knowledge”B

RegressionType VI g

1) Calculate the a and b, minimizing n nYpi Yi 2i a Xpi b Yi 2i

Xi: measured data Yi: measured data Yp: predicted valueYp: predicted value Xp: predicted data n: sample number

47

Concept of Regression AnalysisType VI p g y

to determine a and b

Your data of Y Modeled Yp = aX+b

Calculate the difference between Y and Yp To minimize the difference

Y=aX+b is applied to your data and determine the values of a and b.

48

For fitness >>> What is R2 ??Type VI o t ess at s

Total VarianceTotal Variance（SST）

Variance explained by independent variables Error VarianceVariance explained by independent variablesModel variance（SSr）

Error Variance（SSE）

SSHow much

R2 =SST

SSr explained by the model ?

SSSSE

= 1 -SST

Xi: measured data Yi: measured data

49

Yp: predicted value n: sample number

What is adjusted R2?Type VI at s adjusted

If you have large sample number R2 easily becomes significant If you have large sample number, R2 easily becomes significant.

When number of independent variables is large, R2 is overestimated.

→ Adjusted R2: Above influences are removed

R*2 = 1 -SS / ( 1)

SSE / (n-p-1)SST / (n-1)

t t l l bn: total sample numberp: number of independent variables

50

Multiple Regression AnalysisType VII u t p e eg ess o a ys s Place Nitrogen

concentration Fertilizer

Input Area

of Livestock numbers

Soil Types

in the groundwater

[Y]

[X1]

Farm land [X2]

[X3]

[X4] A 13.5 50 100 300 Sand

B 22 1 120 220 100 ClayB 22.1 120 220 100 Clay

C 4.5 30 50 200 Clay

D 16.1 80 170 400 Silt

E 28.5 150 290 150 Silt

F 40 0 200 500 300 Silt 40.0 G 69.1 350 750 220 Sand

H 5.2 10 20 60 Sand

I 115.2 600 1200 400 Sand

J 45 8 280 550 350 ClayFertilizer

J 45.8 280 550 350 Clay

NLivestock

Farmland Soil Type

51

Multiple Regression AnalysisType VII u t p e eg ess o a ys s

Y = a + β1 X1 + β2 X2 + β3 X3 + + βm XmY = a + β1 X1 + β2 X2 + β3 X3 + ……… + βm Xm

Y: Dependent VariableX Independent VariableX：Independent Variable

Nitrogen in groundwater = a + β1 + β2 + β3 + βm Xm

Fertilizer Farmarea

Live-stocks

Soil

52

Concept of Multiple Regression AnalysisType VII p p g y

to determine β1 to βm

Your data of Y Modeled Yp = β1 X1+ β2 X2+…....+ βm Xm

Calculate the difference between Y and Yp To minimize the difference

Y=β1 X1+ β2 X2+…....+ βm Xmis applied to your data and determine the values of β1 to βm

53

is applied to your data and determine the values of β1 to βm.

Multiple Regression AnalysisType VII u t p e eg ess o a ys s

Y = a + β1 X1 + β2 X2 + β3 X3 + + βm XmY = a + β1 X1 + β2 X2 + β3 X3 + ……… + βm Xm

Y: Dependent VariableX Independent VariableX：Independent Variable

Nitrogen in groundwater = a + β1 + β2 + β3 + βm Xm

Fertilizer Farmarea

Live-stocks

Soil

How to handle categorical variables?

54

g

How to make Dummy Variablesy

Use category number

Make dummy variables for allnumber

directlyvariables for all categories

D1+D2+D3=1D3 is automatically decided by D1 & D2decided by D1 & D2.

M k d i bl fMake dummy variables for [Number of categories]-1

55

Multicollinearityu t co ea tyMulticollinearity refers to the correlation among three or more independent variables Multicollinearity creates “shared” variance between variablesvariables. Multicollinearity creates shared variance between variables, thus decreasing the ability to predict the dependent measure as well as ascertain the relative rolesas well as ascertain the relative roles of each independent variable.

How to check multicollinearity

VIF(Variance Inflation Factor)VIF > 2 There is multicollinearity

Condition Index<10 No collinearity

15<<30 Some collinearity>100 Hi h lli it

56

>100 High collinearity Decrease the variables

Standardized / UnstandardizedSta da d ed / U sta da d ed

Nitrogen in = a + β1 + β2 + β3 + β4 + β5Fertilizer Farm Live- Sand ClayNitrogen in groundwater = a + β1 + β2 + β3 + β4 + β5Fertilizer Farm

areaLive

stocksSand

DummyClay

Dummy

What you can know from Standardized Results

β: between 0 and 1 Each β can be directly compared Each β can be directly compared Which independent variable has bigger effect on the dependent variable

Wh t k f U St d di d R lt

β: various values Each β cannot be directly compared

What you can know from Un-Standardized Results

Each β cannot be directly compared How much dependent variable changes by one-unit independent variable change

57

Principle Component Analysis (PCA)y ( )

Compile multiple variables and make an index. Componentsp p

Place NH4+

[X1]

NO3-

[X2]

Colliform

[X3]

BOD [X4]

P [X5]

Human/Agricultural Activity

Organic ContaminationA 50.0 2.0 100 10.8 0.1

B 25.7 1.2 250 13.2 0.3 C 5.0 10.5 0 2.2 2.0 D 35.2 2.2 30 15.8 0.7

Activity Contamination

NH4 NO3 Coli BOD PE 18.2 1.9 310 20.1 0.1 F 3.2 21.2 20 1.1 1.5 G 20.5 9.9 410 17.9 0.4 H 5 8 40 1 25 3 6 3 1

NH4 NO3 Coli BOD P

Ob d V i blH 5.8 40.1 25 3.6 3.1 I 2.9 30.1 30 3.0 2.2 J 28.7 10.0 560 22.1 0.5

Observed Variable

58

Factor Analysis (FA)acto a ys s ( )

Extract Latent variables which give influences on Observed variables g

Place NH4+

[X1]

NO3-

[X2]

Colliform

[X3]

BOD [X4]

P [X5]

Latent Variable[ ] [ ] [ ] [ ] [ ]

A 50.0 2.0 100 10.8 0.1 B 25.7 1.2 250 13.2 0.3 C 5.0 10.5 0 2.2 2.0 D

Agricultural ActivityOrganic

Contamination

D 35.2 2.2 30 15.8 0.7 E 18.2 1.9 310 20.1 0.1 F 3.2 21.2 20 1.1 1.5 G 20 5 9 9 410 17 9 0 4

NH4 NO3 Coli BOD P

G 20.5 9.9 410 17.9 0.4 H 5.8 40.1 25 3.6 3.1 I 2.9 30.1 30 3.0 2.2 J 28.7 10.0 560 22.1 0.5

Observed Variable28.7 10.0 560 22.1 0.5

59

Different Concept between PCA and FA

ｆ1 f2Factor Analysis (FA) ｆ1 f2

X1 X2 X3 X4

Factor Analysis (FA)

X1 = a1 f1 + a2 f2 + e1X1 X2 X3 X4

e1 e2 e3 e4

Rotation：OK

Z1 Z2Principle Component Analysis (PCA)

Z1 Z2

X X X X

Z1 = b1 X1 + b2 X2 + b3 X3 + b4 X4 + e1

X1 X2 X3 X4Rotation：None (in principle)

60

Factor Analysis：Rotationy

For finding appropriate axes, rotation is conducted.

Orthogonal Rotation

No correlation between

Oblique Rotation

Some correlation betweenNo correlation between variables is assumed

Some correlation between variables is assumed

Valimax Rotation etc.

Promax Rotation etc.

In actual cases some correlation can be observed between factors

61

In actual cases, some correlation can be observed between factors, so the oblique rotation (e.g. Promax Rotation) is usually selected.

How to decide Factor Numbers?o to dec de acto u be s

１） Eigen Value

Kaiser Criteria =1.0 (you can modify)

２） Scree-Plot

Select the point where the slope changesSelect the point where the slope changes

３） Cumulative variance

How much variance is explained by the factors

4） Interpretation4） Interpretation

The above points are not rigid. Besides, you should consider about the interpretation of your results and decide the factor number.

62

the interpretation of your results and decide the factor number.

Hierarchical vs Non-HierarchicalHierarchical vs. Non-HierarchicalCluster AnalysisHierarchical Non-Hierarchical

63

Non-Hierarchical Cluster AnalysisNon-Hierarchical Cluster Analysis

You need to decide the number of clusters first.

64

（3）Structural Equation Modeling

65

Structural Equation Modeling & Path AnalysisStructural Equation Modeling & Path Analysis

SEM and Path Analysis are statistical models that seek to explain theSEM and Path Analysis are statistical models that seek to explain the relationships among multiple variables.

They examine the structure of relationships expressed in a series of equationsThey examine the structure of relationships expressed in a series of equations, similar to a series of multiple regression equations. These equations depict all of the relationships among constructs (both independent and dependent variables) involved in the analysisvariables) involved in the analysis.

In the case when the relationships just among the observed variables are investigated, it is called as Path Analysis. It is similar to multiple regressioninvestigated, it is called as Path Analysis. It is similar to multiple regression analysis, but completely different because it can involve the relationships among the dependent and independent variables.

In the case when the relationships among not only the observed variables but also the latent variables are investigated, it is called as Structural Equation Modeling. This modeling is like combination of multiple regression analysis

66

and factor analysis.

Multiple Regression Analysis vs. Path AnalysisMultiple Regression Analysis vs. Path Analysis

Multiple Regression Analysis Path AnalysisMultiple Regression Analysis Path Analysis

67

Path Analysis vs. Structural Equation ModelingPath Analysis vs. Structural Equation Modeling

Structural Equation Modeling (SEM)Path Analysis

68

Terminology & Descriptiongy p

Observed variable (観測変数）Observed variable (観測変数）Observed variable (Measured variable) The variable directly measured by survey. This variable should be shown in a box.

Latent variable (潜在変数）Unobserved variable (Latent variable) The variable cannot be measured directly This variable should be shown in a circlemeasured directly. This variable should be shown in a circle.

Endogenous variable (内⽣変数）The variable determined by the other variables. In other words, y ,this is the variable pointed by arrows from other variables.

Exogenous variable (外⽣変数）g (The independent variable not determined by other variables. In other words, this variable is not pointed by any arrows.

69

Path AnalysisPath AnalysisPlace N2O

emission Total

Organic Nitrate

Concentration Dissolved Oxygen

Fertilizer Input

Livestock Numbers We want to explain the

[X1]

Compound (TOC) [X2]

[X3]

(DO)

[X4]

[X5]

[X6] A 12.5 13.4 9.7 4.5 100 120

B 50.3 34.8 24.9 0.9 260 700

N2O emission using the other variables.

C 2.3 5.5 3.5 5.9 300 60

D 125.4 56.8 34.5 0.8 400 800

E 34.2 29.1 8.9 1.1 80 650

F 22.1 30.5 18.1 0.8 200 800

G 9 8 5 9 8 3 6 2 50 20

N2OTOC

G 9.8 5.9 8.3 6.2 50 20

H 5.5 7.7 6.1 7.8 60 15

I 12.1 18.9 5.3 5.5 30 70

J 23.8 22.1 9.9 2.1 120 600

K 2 2 7.1 6.2 6.8 50 40

NitrateDO

2.2 L 13.8 12.1 13.2 4.5 140 180

M 45.9 24.1 10.2 1.5 90 600

N 180.1 145.1 56.3 0.3 700 1200

O 56.2 33.6 18.9 0.6 150 700

Fertilizer

LivestockP 12.1 8.8 7.6 6.5 30 40

Q 3.3 3.4 4.5 3.2 60 35

R 4.5 6.6 8.2 4.5 40 80

S 22.8 15.3 12.1 3.3 150 220

70

T 45.9 22.8 20.1 1.1 200 500

Path Analysis (SEM)- STEP 1Path Analysis (SEM) STEP 1Construct a hypothetical model

Fertilizer Input Nitrate

N2O emission

DOLivestock Numbers

TOC

71

Path Analysis (SEM)- STEP 2Path Analysis (SEM) STEP 2Estimate the coefficients The SEM & Path models are

defined as segregation of[What does each path means?]

Fertilizer Input Nitratea1

a7 v1 e4

e1 v3 v6

defined as segregation oflinear equations.In this case, following equations can be constructed

N2O emission

DOLivestock Numbers

a1

a2

a3 4

a6 e2 v4 c equations can be constructed.

[N2O emission]= a1*[Nitrate]+a2*[DO]+a3*[TOC]+[Error4]

Av

TOC

a4

a5

e3

v2

v5

[N2O emission]= a1 [Nitrate]+a2 [DO]+a3 [TOC]+[Error4][Nitrate]=a6*[Livestock]+a7*[Fertilizer]+[Error1][DO]=a4*[TOC]+[Error2][TOC]=a5*[Livestock]+[Error3]

se3 v5

Variance of [e1]= v3Variance of [e2]=v4Variance of [e3]=v5

Where should [Error] be add?Any variables explained by other

Variance of [e4]=v6Variance of [Fertilizer]=v1Variance of [Livestock]=v2Covariance between [Fertilizer] & [Livestock]=c

variables (Endogenous variables) should have Error.For example, [N2O emission] is explained by [Nitrate], [DO], and

72

[TOC] in the model, but some other influences can exist. They are defined as Error.

Path Analysis (SEM)- STEP 2Path Analysis (SEM) STEP 2How to estimate the coefficients ?

Variance of data

Variance explained by model

73

Environmental System Factor analysis, Principle … Credit ※Att d & S b i i f tAttendance &...

Documents

Transcript of Environmental System Factor analysis, Principle … Credit ※Att d & S b i i f tAttendance &...