Environmental System Factor analysis, Principle … Credit ※Att d & S b i i f tAttendance &...
Transcript of Environmental System Factor analysis, Principle … Credit ※Att d & S b i i f tAttendance &...
2016 Winter / Graduate School
Environmental System Environmental System A l iA l iAnalysisAnalysis
KURISU Kiyo
Scope of this classp
Learn basic statistics using EXCEL
L h t SPSS ( l ft f t ti ti ) Learn how to use SPSS (popular software for statistics)
Learn multivariable analysis such as Learn multivariable analysis, such as Factor analysis, Principle Component Analysis, andStructural Equation Modeling,q gfrom basics to practical analysis using SPSS
If h l d k th→ If you have already known these, you need not take this class
2
Style of this classy
Exercise style
I l i b i f h t ti ti l th d d I explain basics of each statistical method and show the analytical procedure using EXCEL and SPSS one by oneusing EXCEL and SPSS one by one.
Students also use PCs* and try to do the same procedure.y p* I will bring 8-10 Note PCs
equipped with EXCEL , SPSS (& Amos) in this class.Y th PCYou can use the PCs.
→ If SPSS& Amos are installed in your PC, b i d it
3
you can bring and use it.
ScheduleNo. Date Contents
1 Sept. 26 Outline, Basic analysis for comparison of means t-Test
Oct. 3 Cancel
Oct. 10 Holiday
2 Oct 17 Baic analysis for comparison of Means Repeated Data analysis2 Oct 17 Baic analysis for comparison of Means, Repeated Data analysist-Test, ANOVA, repeated ANOVA
3 Oct 24 Regression Analysis & Multiple regression analysis
4 Oct 31 Principle Component Analysis & Factor Analysis& Cluster Analysis
5 Nov 7 Path analysis using Amos
6 Nov 14 Structural Equation Modeling using Amos
4
For Credit
※ Att d & S b i i f t※ Attendance & Submission of report
Analyze data using statistical analysis learned in this classy g y
You can use your own data as well as public statistical data
You should use analytical methods shown during this class, but you can use any tools (like EXCEL, SPSS, R etc.) y ( , , )
D dli D 16 (F i)Deadline:Dec. 16 (Fri)Submission: [email protected] via e-mail
5
(0)What kind of data you want to analyze??
6
Data Characteristics
Type of Data Function ExampleAyp pNominal Labeling ID no.Ordinal Labeling
+ O dA, B, C
123
AB C
+ OrderInterval Labeling
+OrderTemperature(℃: degrees Celsius)
12
+Same interval for 1 unit
ProportionalLabeling+Order
Concentration(mg/L)p
+Same interval for 1 unit+Origin (0) has meaning
( g )
First of all, you should check what scale is used for your data
7
you should check what scale is used for your data.
Data Characteristics
Data gained by lab. experiments are usually proportional-scale data.
Condition NO3(mgN/L )
BOD(mgO/L)
PO4(mgP/L)
g y p y p p
(mgN/L ) (mgO/L) (mgP/L)A 5.8 2.2 0.03B 10.5 7.8 0.20C 2.3 5.6 0.04D 6.6 3.2 0.09E 2.5 4.5 0.12F 20.3 15.0 0.23G 7 8 3 4 0 02G 7.8 3.4 0.02
8
Data Characteristics
Data gained by questionnaire survey can involve various types of data
Respon Gender Blood- Body Daily exercise
g y q y yp
pdent male: 1,
female: 2cholesterol
(µg/L)
ytemperature
(℃)
y1: always2: sometimes3: rarely4: never
A 1 25 36.8 1B 2 50 35.4 4C 2 15 36.3 2D 1 32 37.1 2E 2 45 36 2 1E 2 45 36.2 1F 1 63 37.3 2G 1 78 36.4 3
9
How to handle Likert scale??
Think so very much
Think so Think soa bit
Not very much think so
Not think so
Never think sovery much a bit think so think so think so
Strictly: “Ordinal Scale” data Above 5 components : can be handled as “Interval Scale” data Below 3 components : it’s better not to apply “Interval Scale”p pp y 4 components: gray zone >>> Interval scale can be applied.
If respondents can identify difference among 5 and more components, you can use 5 or more components.But, if people cannot differentiate 5 or more components, 4 components are often used.
10
Likert’s ScaleLikert s ScaleRensis Likert served as vice president of the American Statistical Association from 1953 to 1955 and was president of the ASA in 1959. His service to the ASA, however, hardly begins to indicate the breadth of the achievements of his long and vigorous career.ac e e e ts o s o g a d go ous ca ee
Likert was born in Cheyenne, Wyoming, in 1903. After living with his parents in several states, he entered the University of Michigan in 1922. There, he concentrated on civil engineering for a couple of years, until he took a course in sociology. Robert Angell recalls him from that class as his brightest engineering student. The respect was mutual, and Likert found that his scientific interests were more in people than in things He transferred in his senior year from the college of engineering and took his bachelor’s degreein things. He transferred in his senior year from the college of engineering and took his bachelor s degree in sociology in 1926.
Likert went from Michigan to Union Theological Seminary for one year, then to the Columbia University Department of Psychology, where he took his PhD in 1932. At Columbia, Likert gradually moved from traditional fields of psychology to the new social psychology. In this, he was influenced by Gardner Murphy, who served as chair for his dissertation committee. His doctoral research dealt with a wide-ranging set of attitudes, interrelating student attitudes toward race and international affairs; it was published in 1932 as “A Technique for the Measurement of Attitudes.”
Engineering, sociology, psychology, ethics, and statistics. Likert maintained throughout his life an energetic appreciation of concepts and values from all these areas. He was always curious about how things work[ed]
Rensis Likert:Creator of Organizations
1 SEPTEMBER 2010 appreciation of concepts and values from all these areas. He was always curious about how things work[ed] and how to fix them when they did not. His strong feel for structures and measurements also showed in his quantitative and pragmatic approaches to social problems and social measurements. The year at Union Theological Seminary was reflected in his openness, his optimism, and both his desire and his ability to do good.
Likert did well in several areas During the course of his doctoral research he developed what soon
This article was originally written by Leslie Kish and appeared in The American Statistician in 1982.
Likert did well in several areas. During the course of his doctoral research, he developed what soon became the famous “Likert scale,” an early example of “sufficing” that exemplifies Likert’s pragmatic approach. He showed, with empirical comparisons, that his much simpler method (asking the respondent to place himself on a scale of favor/disfavor with a neutral midpoint) gave results very similar to those of the much more cumbersome(though more theoretically elegant) Thurstone procedure (based on the psychophysical method of equal-appearing intervals) The Likert scale has been adopted throughout the
11
psychophysical method of equal-appearing intervals). The Likert scale has been adopted throughout the world, and continuing demand for the thesis led to its re-publication in the series Classic Contributions to Social Psychology.
http://magazine.amstat.org/blog/2010/09/01/rensislikertsep10/
(1)How to analyze the difference of groups
Two groups
12
Comparison of two groups Type II p g p
Body Height of 25-year-old ladies in Japan
(cm)
Sampled Data: Sampled 25-year-old ladies in Japan Number of samples: n
Want to compare
(cm) 160
158
155
165
152
Number of samples: nA
Mean: xA Variance: sA
2 averages from two groups
152
166
163
158
156
Parent population: All 25-year-old ladies in Japan Mean: μA From sampled data
165 Variance: σA2
Body Height of 25-year-old ladies in USA
Sampled Data: Sampled 25-year-old ladies in USA
Which is taller
From sampled data…..
(cm) 170
162
165
168
Number of samples: nB Mean: xB Variance: sB
2
Which is taller,Japan or USA in parent populations?
164
168
165
163
168
Parent population: All 25-year-old ladies in USA Mean: μB V i 2
13
168
172
165
168
Variance: σB2
Which methodology you can choose ?Type II gy y
Want to compare averagesfrom two groups
Parametric test Non Parametric test
Data for each group are Normally Distributed
Not Normally Distributed
Parametric test Non Parametric test
Equal Variance between groups
Not Equal Variance
Wil t t U t t
Student’s t-test Welch’s test
Wilcoxon test
Mann Whitney test
K l S i t t
U-test
W-test
14
Kolmogorov-Sminov test
What are Parametric & Nonparametric ?p
Parametric testIt assumes that parent data follow Normal Distribution
Parametric test
t-test & variance test assume that averages or variances are normally distributed
Non Parametric testDoes not assume normal distributionDoes not assume normal distribution
Methods for calculation are complicated, th t ft d f l tiso they were not often used for a long time.
However, various methods have been prepared in statistic soft wares, because of high-spec computers.
15
How to check Normal distribution?
(A) Make Histogram(A) Make Histogram
(B) Check on EXCEL
1) Rank each data: RANK(data, range, 1)2) Calculate p : (RANK-1/2)/n
(B) Check on EXCEL
NORMSINV(p) NORMSINV(p)
2) Calculate p : (RANK 1/2)/n3) NORMSINV(p)
(p)
(C) Check on SPSS
X X
[Normality plots with tests]
16
How to check Equal Variance?q
Bartlett’s testBartlett s test
Levene’s testThese tests are usually included in statistic soft wares.
Levene s test
bartlett.test ( y$●●~ y$▲▲ ) Bartlett test of homogeneity of variancesBartlett test of homogeneity of variances
data: y$●● by y$▲▲Bartlett‘s K squared = ×× df = 2 p value = □□Bartlett s K-squared = ××, df = 2, p-value = □□
17
Concept of t-testType II p(H1: Alternative hypothesis)
From my opinion, there should be significant difference between the groups
Temporaril ass me no difference(H N ll h pothesis) μ μ
μA≠μB
Temporarily assume no difference(H0: Null hypothesis)
Estimate the probability of no-difference cases
μA=μB
Probability of μA=μBEstimate the probability of no-difference cases
If the probability of no-difference cases is quite small (p<0.05, p<0.01),
Probability of μA=μB
p y q (p , p ),it means that no-difference cases rarely occur.
Null hypothesis is rejected. μA=μB rarely occurs
You can say there is significant difference between the groups.
yp j
μA≠μB
18
Concept of Students’s T-testp
Rarely occurs
Calculated t based onCalculated t based on your dataPossibly occurs
19
One-tailed test or Two-tailed test?
H0: μ=μ0 H0: μ=μ0 H0: μ=μ0H1: μ<μ0 H1: μ≠μ0 H1: μ>μ0
If your alternative(first) hypothesis is “one average is bigger/smaller than the other”,one average is bigger/smaller than the other ,you should apply one-tailed test.If you just want to check that averages of two groups are
20
same or not, you can apply two-tailed test.
Which methodology you can choose ?Type II gy y
Want to compare averagesfrom two groups
Parametric test Non Parametric test
Data for each group are Normally Distributed
Not Normally Distributed
Parametric test Non Parametric test
Equal Variance between groups
Not Equal Variance
Wil t t W t t
Student’s t-test Welch’s test
Wilcoxon test
Mann Whitney test
K l S i t t
U-test
W-test
21
Kolmogorov-Sminov test
Type II Parametric analysis- two groups comparison
Student’s t-test Welch’s test
y
Text Page7
t xA xBs2 s2 t xA xBsA2 sB2 your nA nB
2 nA 1 sA2 nB 1 sB2nA nBt value
s2 nA 1 sA nB 1 sBnA nB 2
reference t value
=TINV(a, df) 95% significance: Two tailed: TINV(0.05, df) One tailed: TINV(0.1, df)
df sA2nA sB2nB 2sA4nA2 nA 1 sB4nB2 nB 1 df=nA+nB-2
22
nA nA 1 nB nB 1
How to analyze the Equal-VarianceType II y q
Levene’s test Text Page12 Excel TypeII(Levene’s test)
1) H0: sA2=sB
2
2) Calculate La If you want to use the analysis which requires Equal-variance, this hypothesis should not be La n NGNG 1 ∑ nk Zk Z 2NGk∑ ∑ ZinkiNGk Zk 2
this hypothesis should not be rejected (p>0.05 is required).
Zi |xi xk|
Different from usual tests
nk: number of samples in ‘Group k’ NG: number of groups n: Total sample number xk: average of ‘Group k’
SPSS’s output
xi: each data
23
3) Compare with F(a, NG 1, n NG
Which methodology you can choose ?Type II gy y
Want to compare averagesfrom two groups
Parametric test Non Parametric test
Data for each group are Normally Distributed
Not Normally Distributed
Parametric test Non Parametric test
Equal Variance between groups
Not Equal Variance
Wil t t U t t
Student’s t-test Welch’s test
Wilcoxon test
Mann Whitney test
K l S i t t
U-test
W-test
24
Kolmogorov-Sminov test
Type II Non Parametric analysis- two groups comparisony
Original Statistic Compare withParametric Originaldata
Statistic value
Compare with reference valuetest
Non-ParametricRankingdata
Non Parametric test
Non-ParametricParametric
25
Compare with Type II
Excel TypeII(3)Wilcoxon test
Body Height of 25-year-old ladies in Ri
reference value
Wilcoxon distribution curve
Text Page8
yJapan (cm)
160158155165
Ri
6
4
2
11
Calculate the summation of each group
WA 6665
152166163158156
11
1
16
8
4
3
each group
ω(0.025, 10, 12)=84 N1 N2 a
0 05 0 025 0 01156165
Body Height of 25-year-old ladies in USA
(cm)
170
Calculate the rankof each
3
11
66
21
0.05 0.025 0.01 ω ω ω ω ω ω
9
19 96 165 90 171 83 1720 99 171 93 177 85 18
10 10 82 128 78 132 74 13170162165168164
of each value
21
7
11
17
10
10
10 82 128 78 132 74 1311 86 134 81 139 77 1412 89 141 84 146 79 1513 92 148 88 152 82 15
168165163168172
17
11
8
17
22
Calculate the summation of each group
WB 169
26
165168
11
17
each group ω(0.025, 10, 12)=146
(1)How to analyze the difference of groups
3 and more groups
27
Differences among three groupsType III g g p
Body Height of 25-year-old ladies in Japan
(cm)
Sampled Data: Sampled 25-year-old ladies in Japan Number of samples: n
(cm)160
158
155
165
152
Number of samples: nA Mean: xA Variance: sA
2
Body Height of 25-year-old ladies in Italy
(cm)
Sampled Data: Sampled 25-year-old ladies in Italy Number of samples: nC A
152
166
163
158
156
Parent population: All 25-year-old ladies in Japan Mean: μA
(cm) 155
162 165 163
160
Average: xC Variance: sC
2
165 Variance: σA
2
Body Height of 25-year-old ladies in USA
Sampled Data: Sampled 25-year-old ladies in USA
160
158
165
161
166 Parent population: All 25 year old ladies in Italy
(cm) 170
162
165
168
Number of samples: nB Mean: xB Variance: sB
2
156
160
All 25-year-old ladies in Italy Average: μC Variance: σC
2
164
168
165
163
168
Parent population: All 25-year-old ladies in USA Mean: μB V i 2
28
168
172
165
168
Variance: σB2
Why you should not repeat t-Test?Type III y y p
A B
A C
Significant with 95% If these three are repeated, the probability decreases to
0 95×0 95×0 95 0 857Significant with 95%
B C0.95×0.95×0.95=0.857
Significant with 95%
B CA Instead of above, we want to analyze the differences among three groups with 95% significance.
29
Which method can we choose?Type III
Text Page13Compare among three
or more groups
Parametric test Non Parametric test
Normally distributed Not distributed normally
Parametric test Non Parametric test
Equal of V i
Variance is not l
Analysis of variance is considered to be very strong analytical method
and the results are not influenced so much by the requirement of normal distributionVariance equal by the requirement of normal distribution
and equal variance. So, ANOVA is sometimes conducted without checking of the prerequisite.
One-way ANOVA
Kruscal-Wallis one-way analysis of variance
H-test
30
Concept of Variance Analysis (F-test)Type III p y ( )
Total VarianceTotal Variance(SST)
Variance between Groups Variance within groupVariance between Groups(SSG)
Variance within group(SSE)
HeightSSG / dfG
SSE / dfE= F value
If SSG is much bigger than SSE, it means there is a difference
g
SSE / dfEbetween groups
If SSE is quite bigger than SSGIf SSE is quite bigger than SSG, it means there is no significant difference between groups
31
Japan Italy USA
Concept of F-testType III p(H1: Alternative hypothesis)
From my opinion, there should be significant difference between the groups
Temporaril ass me no difference(H N ll h pothesis) μ μ μ
μA≠μBor≠Temporarily assume no difference(H0: Null hypothesis)
Estimate the probability of no-difference cases
μA=μB=μC μB≠μCor
μC≠μAProbability of μA=μB=μCEstimate the probability of no-difference cases
If the probability of no-difference cases is quite small (p<0.05, p<0.01),
Probability of μA=μB=μC
p y q (p , p ),it means that no-difference cases rarely occur.
Null hypothesis is rejected. μA=μB=μC rarely occurs
You can say there is significant difference between the groups.
yp j
μA≠μBor
μ ≠μB t j t b F t t t k h th i
32
μB≠μCor
μC≠μA
But just by F-test, you cannot know where there is significant difference. >>> PosHoc test is needed
Type III
nk: number of samples in ‘Group k’ NG: number of groups n: Total sample number xk: average of ‘Group k’
Text Page13
33
k g px: average of all data Excel TypeIII
What is Levene’s Test ??? of “Z”of “X”
We see “X”
X
Z= Difference from group average
Z We see “Z”
Ho: µA= µB Japan Italy USA
H 2 2
34
Ho: sA2= sB2ANOVA
Levene’s test
Kruscal-Wallis analysisy
1) Rank all data: Rki (Rank of data Xi in group k) Text Page14-152) Group average of the ranks: ∑ Rkii /nk 3) Total average of ranks: ∑Ri/n
4) Calculate KW value
Excel TypeIII(KW)
4) Calculate KW value
KW 12 1 nk Rk R 2 n n 1 k kk5) Calculate χ2 g 1; a value KW χ2 g 1; a >>>>Ho is rejectedKW χ2 g 1; a >>>>Ho is rejected
The value of KW is approximately χ2 distribution. g: number of groups
35
Pos Hoc Test (Multiple comparison)Where does the difference exist? ( p p )
Multiple groups comparison Text Page14-15
Different sample numbers
Equal sample numbersnumbers
Equal Variance Not equal Equal Variance Not equal Variance Variance
dfE≥75 dfE<75 dfE≥75 dfE<75
Tukey Kramer Hochberg GT2
Dunett’s CGames-Howell
Tukey HSD REGWQ
Dunett’sT3
Dunett’sT3
Dunett’s CGabriel
36
g
(1)How to analyze the difference of groups
Paired Data
37
How to handle Paired data?Type IV
Paired samples or paired comparison occur when a single individual is tested twice (e g
Text Page17when a single individual is tested twice (e.g. before and after) or the sampling station is retested.
Body Weight of before Apple
Body Weight of after Apple
Parent population: Before Apple Diet Average: μA
Sampled Data: Before Apple Diet Number of samples: n
Diet (kg) Diet (kg) A 60.8 59.2
B 55.7 54.7
C 50.4 50.2
D 48 2 47 2
Average: μA
Variance: σA2
Number of samples: n Average: xA Variance: sA
2
D 48.2 47.2
E 75.2 65.6
F 56.4 56.3
G 47.3 47.0
H 60 0
Parent population: After Apple Diet Average: μB
2
Sampled Data: After Apple Diet Number of samples: n
H 65.1 60.0
I 80.3 75.3
J 52.4 52.3
Variance: σB2 Average: xB
Variance: sB2
Calculate difference
Null Hypothesis H0: μA-μB=0Alternate Hypothesis H1: μA-μB≠0
between A(before) and B(after), and then you can construct the null hypothesis
38
ypthat the difference is 0.
Why you must not use basic t-Test?y y
<Original data> <If you exchange some of the data…>
Body Weight of before Apple
Diet (kg)
Body Weight of after Apple Diet (kg)
Body Weight of before Apple
Diet (kg)
Body Weight of after Apple Diet (kg)
A 56 3( g) ( g)
A 60.8 59.2
B 55.7 54.7 C 50.4 50.2 D 48 2 47.2
A 60.8 56.3
B 55.7 50.2
C 50.4 52.3
D 48.2 47.2 D 48.2 47.2
E 75.2 65.6
F 56.4 56.3
G 47.3 47.0
E 75.2 60.0
F 56.4 59.2
G 47.3 47.0
H 65.1 60.0
I 80.3 75.3
J 52.4 52.3
H 65.1 65.6
I 80.3 75.3
J 52.4 54.7
You can find the trend of decrease You cannot find any trend….
39
If you apply basic t-Test, above two data give same results…..
Repeated t-Test
Type IVExcel Type IVBasic procedure of paired datap p
Hypothesis H >
Hypothesis H >0
Null Hypothesis H 0
BodyWeight
BodyWeight
Change of BodyWeight yi
H1: μBefore > μAfter H1: μBefore – μAfter >0 H0: μBefore – μAfter = 0
t-test (one tailed)
beforeDiet
afterDiet
60.8 59.255 7 54 7
1.61
If you cannot assume 55.7 54.750.4 50.248.2 47.2
10.21
ywhich is larger, just assuming difference, you can apply Two-tailed test.
75.2 65.656.4 56.347 3 47
9.60.10 3
Hypothesis H1: μBefore ≠ μAfter
N ll H th i47.3 4765.1 6080.3 75.352 4 52 3
0.35.15
0 1
Null Hypothesis H0: μBefore = μAfter
t-test (two tailed)
40
52.4 52.3 0.1( )
For Paired Data: Repeated ANOVAType V p Length of root
after toxicLength of root
after toxicLength of root
after toxic Wh t tafter toxic chemical is
dosed After 1 day
after toxic chemical is
dosed After 1 month
after toxic chemical is
dosed After 2 months
A 15.2 13.8 12.6
B 15 6 13 5
When you want to check the temporal change among 3 or more time pointsB 22.6 15.6 13.5
C 17.8 15.2 14.8
D 20.5 18.2 16.8
E 18.6 16.5 14.6
F 14 1 14 0
more time-points,You should use repeated ANOVA
F 14.2 14.1 14.0
G 19.5 17.2 16.5
H 17.6 14.8 11.2
I 21.8 19.5 16.2 1 day after J 22.4 19.6 17.2
yChemical
application1 month after
ChemicalChemical application
We want to know the effect of chemical>> If length of root decreases along the time,
2 months afterChemical
application
41
g g ,you can say there is significant effect of the chemical
Why you must not use basic ANOVA?y y
<Original data> <If you exchange some of the data…> Length of root
after toxic chemical is
dosed After 1 day
Length of root after toxic chemical is
dosed After 1 month
Length of root after toxic chemical is
dosed After 2 months
Length of root after toxic chemical is
dosed After 1 day
Length of root after toxic chemical is
dosed After 1 month
Length of root after toxic chemical is
dosed After 2 months After 1 day After 1 month After 2 months
A 15.2 18.2 12.6 B 22.6 15.6 13.5
C 17.8 15.2 16.2 D 20.5 13.8 16.8
After 1 day After 1 month After 2 months A 15.2 13.8 12.6
B 22.6 15.6 13.5 C 17.8 15.2 14.8
D 20 5 18.2 16.8 20.5 E 18.6 19.5 14.6
F 14.2 14.1 14.0
G 19.5 17.2 16.5
H 17 6 19.6 11.2
20.5 8. 6.8E 18.6 16.5 14.6
F 14.2 14.1 14.0
G 19.5 17.2 16.5
H 17 6 14.8 11.2 17.6 I 21.8 16.5 17.2
J 22.4 14.8 14.8
H 17.6 14.8 11.2
I 21.8 19.5 16.2
J 22.4 19.6 17.2
You can find the trend of decrease You cannot find any trend….
42
If you apply basic One-way ANOVA, above two data give same results…..
Repeated ANOVA
(2)Analyze relationships between variables
43
First of all…High correlation,
but not linearType VI
Describe Scattered GraphDescribe Scattered Graph
Is it OK just seeing correlation coefficient?Linear??Outlier??Should be grouped??
Influenced by outlier
Grouping is better
44
CorrelationType VI
Sy Yi Y 2
Sx Xi X 2
Sxy Xi X Yi Y Sr SxySxSy
45
Spurious CorrelationType VI pThis is the case when there is correlation but actually there is no cause effect relationshipactually there is no cause-effect relationship between “A” and “B”. The omitted third variable “C” can describe the relationships.
B d H i h
[You can get high correlation between “body height” and “knowledge” in elementary school students]
The taller students have more Body Height
Knowledge
knowledge?
・・・Is it true ?????
A
B
[Actual structure]Another factor “age” has
Age
Body Height Another factor age has influences on both “body height” and “knowledge”, so apparently correlation is seen between
C A
46
Knowledge
correlation is seen between “body height” and “knowledge”B
RegressionType VI g
1) Calculate the a and b, minimizing n nYpi Yi 2i a Xpi b Yi 2i
Xi: measured data Yi: measured data Yp: predicted valueYp: predicted value Xp: predicted data n: sample number
47
Concept of Regression AnalysisType VI p g y
to determine a and b
Your data of Y Modeled Yp = aX+b
Calculate the difference between Y and Yp To minimize the difference
Y=aX+b is applied to your data and determine the values of a and b.
48
For fitness >>> What is R2 ??Type VI o t ess at s
Total VarianceTotal Variance(SST)
Variance explained by independent variables Error VarianceVariance explained by independent variablesModel variance(SSr)
Error Variance(SSE)
SSHow much
R2 =SST
SSr explained by the model ?
SSSSE
= 1 -SST
Xi: measured data Yi: measured data
49
Yp: predicted value n: sample number
What is adjusted R2?Type VI at s adjusted
If you have large sample number R2 easily becomes significant If you have large sample number, R2 easily becomes significant.
When number of independent variables is large, R2 is overestimated.
→ Adjusted R2: Above influences are removed
R*2 = 1 -SS / ( 1)
SSE / (n-p-1)SST / (n-1)
t t l l bn: total sample numberp: number of independent variables
50
Multiple Regression AnalysisType VII u t p e eg ess o a ys s Place Nitrogen
concentration Fertilizer
Input Area
of Livestock numbers
Soil Types
in the groundwater
[Y]
[X1]
Farm land [X2]
[X3]
[X4] A 13.5 50 100 300 Sand
B 22 1 120 220 100 ClayB 22.1 120 220 100 Clay
C 4.5 30 50 200 Clay
D 16.1 80 170 400 Silt
E 28.5 150 290 150 Silt
F 40 0 200 500 300 Silt 40.0 G 69.1 350 750 220 Sand
H 5.2 10 20 60 Sand
I 115.2 600 1200 400 Sand
J 45 8 280 550 350 ClayFertilizer
J 45.8 280 550 350 Clay
NLivestock
Farmland Soil Type
51
Multiple Regression AnalysisType VII u t p e eg ess o a ys s
Y = a + β1 X1 + β2 X2 + β3 X3 + + βm XmY = a + β1 X1 + β2 X2 + β3 X3 + ……… + βm Xm
Y: Dependent VariableX Independent VariableX:Independent Variable
Nitrogen in groundwater = a + β1 + β2 + β3 + βm Xm
Fertilizer Farmarea
Live-stocks
Soil
52
Concept of Multiple Regression AnalysisType VII p p g y
to determine β1 to βm
Your data of Y Modeled Yp = β1 X1+ β2 X2+…....+ βm Xm
Calculate the difference between Y and Yp To minimize the difference
Y=β1 X1+ β2 X2+…....+ βm Xmis applied to your data and determine the values of β1 to βm
53
is applied to your data and determine the values of β1 to βm.
Multiple Regression AnalysisType VII u t p e eg ess o a ys s
Y = a + β1 X1 + β2 X2 + β3 X3 + + βm XmY = a + β1 X1 + β2 X2 + β3 X3 + ……… + βm Xm
Y: Dependent VariableX Independent VariableX:Independent Variable
Nitrogen in groundwater = a + β1 + β2 + β3 + βm Xm
Fertilizer Farmarea
Live-stocks
Soil
How to handle categorical variables?
54
g
How to make Dummy Variablesy
Use category number
Make dummy variables for allnumber
directlyvariables for all categories
D1+D2+D3=1D3 is automatically decided by D1 & D2decided by D1 & D2.
M k d i bl fMake dummy variables for [Number of categories]-1
55
Multicollinearityu t co ea tyMulticollinearity refers to the correlation among three or more independent variables Multicollinearity creates “shared” variance between variablesvariables. Multicollinearity creates shared variance between variables, thus decreasing the ability to predict the dependent measure as well as ascertain the relative rolesas well as ascertain the relative roles of each independent variable.
How to check multicollinearity
VIF(Variance Inflation Factor)VIF > 2 There is multicollinearity
Condition Index<10 No collinearity
15<<30 Some collinearity>100 Hi h lli it
56
>100 High collinearity Decrease the variables
Standardized / UnstandardizedSta da d ed / U sta da d ed
Nitrogen in = a + β1 + β2 + β3 + β4 + β5Fertilizer Farm Live- Sand ClayNitrogen in groundwater = a + β1 + β2 + β3 + β4 + β5Fertilizer Farm
areaLive
stocksSand
DummyClay
Dummy
What you can know from Standardized Results
β: between 0 and 1 Each β can be directly compared Each β can be directly compared Which independent variable has bigger effect on the dependent variable
Wh t k f U St d di d R lt
β: various values Each β cannot be directly compared
What you can know from Un-Standardized Results
Each β cannot be directly compared How much dependent variable changes by one-unit independent variable change
57
Principle Component Analysis (PCA)y ( )
Compile multiple variables and make an index. Componentsp p
Place NH4+
[X1]
NO3-
[X2]
Colliform
[X3]
BOD [X4]
P [X5]
Human/Agricultural Activity
Organic ContaminationA 50.0 2.0 100 10.8 0.1
B 25.7 1.2 250 13.2 0.3 C 5.0 10.5 0 2.2 2.0 D 35.2 2.2 30 15.8 0.7
Activity Contamination
NH4 NO3 Coli BOD PE 18.2 1.9 310 20.1 0.1 F 3.2 21.2 20 1.1 1.5 G 20.5 9.9 410 17.9 0.4 H 5 8 40 1 25 3 6 3 1
NH4 NO3 Coli BOD P
Ob d V i blH 5.8 40.1 25 3.6 3.1 I 2.9 30.1 30 3.0 2.2 J 28.7 10.0 560 22.1 0.5
Observed Variable
58
Factor Analysis (FA)acto a ys s ( )
Extract Latent variables which give influences on Observed variables g
Place NH4+
[X1]
NO3-
[X2]
Colliform
[X3]
BOD [X4]
P [X5]
Latent Variable[ ] [ ] [ ] [ ] [ ]
A 50.0 2.0 100 10.8 0.1 B 25.7 1.2 250 13.2 0.3 C 5.0 10.5 0 2.2 2.0 D
Agricultural ActivityOrganic
Contamination
D 35.2 2.2 30 15.8 0.7 E 18.2 1.9 310 20.1 0.1 F 3.2 21.2 20 1.1 1.5 G 20 5 9 9 410 17 9 0 4
NH4 NO3 Coli BOD P
G 20.5 9.9 410 17.9 0.4 H 5.8 40.1 25 3.6 3.1 I 2.9 30.1 30 3.0 2.2 J 28.7 10.0 560 22.1 0.5
Observed Variable28.7 10.0 560 22.1 0.5
59
Different Concept between PCA and FA
f1 f2Factor Analysis (FA) f1 f2
X1 X2 X3 X4
Factor Analysis (FA)
X1 = a1 f1 + a2 f2 + e1X1 X2 X3 X4
e1 e2 e3 e4
Rotation:OK
Z1 Z2Principle Component Analysis (PCA)
Z1 Z2
X X X X
Z1 = b1 X1 + b2 X2 + b3 X3 + b4 X4 + e1
X1 X2 X3 X4Rotation:None (in principle)
60
Factor Analysis:Rotationy
For finding appropriate axes, rotation is conducted.
Orthogonal Rotation
No correlation between
Oblique Rotation
Some correlation betweenNo correlation between variables is assumed
Some correlation between variables is assumed
Valimax Rotation etc.
Promax Rotation etc.
In actual cases some correlation can be observed between factors
61
In actual cases, some correlation can be observed between factors, so the oblique rotation (e.g. Promax Rotation) is usually selected.
How to decide Factor Numbers?o to dec de acto u be s
1) Eigen Value
Kaiser Criteria =1.0 (you can modify)
2) Scree-Plot
Select the point where the slope changesSelect the point where the slope changes
3) Cumulative variance
How much variance is explained by the factors
4) Interpretation4) Interpretation
The above points are not rigid. Besides, you should consider about the interpretation of your results and decide the factor number.
62
the interpretation of your results and decide the factor number.
Hierarchical vs Non-HierarchicalHierarchical vs. Non-HierarchicalCluster AnalysisHierarchical Non-Hierarchical
63
Non-Hierarchical Cluster AnalysisNon-Hierarchical Cluster Analysis
You need to decide the number of clusters first.
64
(3)Structural Equation Modeling
65
Structural Equation Modeling & Path AnalysisStructural Equation Modeling & Path Analysis
SEM and Path Analysis are statistical models that seek to explain theSEM and Path Analysis are statistical models that seek to explain the relationships among multiple variables.
They examine the structure of relationships expressed in a series of equationsThey examine the structure of relationships expressed in a series of equations, similar to a series of multiple regression equations. These equations depict all of the relationships among constructs (both independent and dependent variables) involved in the analysisvariables) involved in the analysis.
In the case when the relationships just among the observed variables are investigated, it is called as Path Analysis. It is similar to multiple regressioninvestigated, it is called as Path Analysis. It is similar to multiple regression analysis, but completely different because it can involve the relationships among the dependent and independent variables.
In the case when the relationships among not only the observed variables but also the latent variables are investigated, it is called as Structural Equation Modeling. This modeling is like combination of multiple regression analysis
66
and factor analysis.
Multiple Regression Analysis vs. Path AnalysisMultiple Regression Analysis vs. Path Analysis
Multiple Regression Analysis Path AnalysisMultiple Regression Analysis Path Analysis
67
Path Analysis vs. Structural Equation ModelingPath Analysis vs. Structural Equation Modeling
Structural Equation Modeling (SEM)Path Analysis
68
Terminology & Descriptiongy p
Observed variable (観測変数)Observed variable (観測変数)Observed variable (Measured variable) The variable directly measured by survey. This variable should be shown in a box.
Latent variable (潜在変数)Unobserved variable (Latent variable) The variable cannot be measured directly This variable should be shown in a circlemeasured directly. This variable should be shown in a circle.
Endogenous variable (内⽣変数)The variable determined by the other variables. In other words, y ,this is the variable pointed by arrows from other variables.
Exogenous variable (外⽣変数)g (The independent variable not determined by other variables. In other words, this variable is not pointed by any arrows.
69
Path AnalysisPath AnalysisPlace N2O
emission Total
Organic Nitrate
Concentration Dissolved Oxygen
Fertilizer Input
Livestock Numbers We want to explain the
[X1]
Compound (TOC) [X2]
[X3]
(DO)
[X4]
[X5]
[X6] A 12.5 13.4 9.7 4.5 100 120
B 50.3 34.8 24.9 0.9 260 700
N2O emission using the other variables.
C 2.3 5.5 3.5 5.9 300 60
D 125.4 56.8 34.5 0.8 400 800
E 34.2 29.1 8.9 1.1 80 650
F 22.1 30.5 18.1 0.8 200 800
G 9 8 5 9 8 3 6 2 50 20
N2OTOC
G 9.8 5.9 8.3 6.2 50 20
H 5.5 7.7 6.1 7.8 60 15
I 12.1 18.9 5.3 5.5 30 70
J 23.8 22.1 9.9 2.1 120 600
K 2 2 7.1 6.2 6.8 50 40
NitrateDO
2.2 L 13.8 12.1 13.2 4.5 140 180
M 45.9 24.1 10.2 1.5 90 600
N 180.1 145.1 56.3 0.3 700 1200
O 56.2 33.6 18.9 0.6 150 700
Fertilizer
LivestockP 12.1 8.8 7.6 6.5 30 40
Q 3.3 3.4 4.5 3.2 60 35
R 4.5 6.6 8.2 4.5 40 80
S 22.8 15.3 12.1 3.3 150 220
70
T 45.9 22.8 20.1 1.1 200 500
Path Analysis (SEM)- STEP 1Path Analysis (SEM) STEP 1Construct a hypothetical model
Fertilizer Input Nitrate
N2O emission
DOLivestock Numbers
TOC
71
Path Analysis (SEM)- STEP 2Path Analysis (SEM) STEP 2Estimate the coefficients The SEM & Path models are
defined as segregation of[What does each path means?]
Fertilizer Input Nitratea1
a7 v1 e4
e1 v3 v6
defined as segregation oflinear equations.In this case, following equations can be constructed
N2O emission
DOLivestock Numbers
a1
a2
a3 4
a6 e2 v4 c equations can be constructed.
[N2O emission]= a1*[Nitrate]+a2*[DO]+a3*[TOC]+[Error4]
Av
TOC
a4
a5
e3
v2
v5
[N2O emission]= a1 [Nitrate]+a2 [DO]+a3 [TOC]+[Error4][Nitrate]=a6*[Livestock]+a7*[Fertilizer]+[Error1][DO]=a4*[TOC]+[Error2][TOC]=a5*[Livestock]+[Error3]
se3 v5
Variance of [e1]= v3Variance of [e2]=v4Variance of [e3]=v5
Where should [Error] be add?Any variables explained by other
Variance of [e4]=v6Variance of [Fertilizer]=v1Variance of [Livestock]=v2Covariance between [Fertilizer] & [Livestock]=c
variables (Endogenous variables) should have Error.For example, [N2O emission] is explained by [Nitrate], [DO], and
72
[TOC] in the model, but some other influences can exist. They are defined as Error.
Path Analysis (SEM)- STEP 2Path Analysis (SEM) STEP 2How to estimate the coefficients ?
Variance of data
Variance explained by model
73