Introduction to spss – part 1

Introduction to SPSS-Part 1

Vignes Gopal KrishnaFast track PhD student, SLAI fellow, and Research

AssistantUniversity of Malaya

SPSS

• Statistical Package/Product for Social Sciences(Economics, Sociology, Population Studies, and etc)- Subjects – People/Society

• Statistical Package/Product for Sciences(SPS) (Health Sciences, Neurosciences, Medical Sciences, Economics, Sociology and etc)-Subjects –People/Society/Patients/Animals/Neurons

• SPSS- Rows X Columns X Cells (RCC)Rows – Subjects, Columns – Variables, Cells – Values/StatementsSPSS = Main Inputs (DV-views) X Outputs (Results)Additional inputs (Scripts & Syntax)Advantages• Deals with the process of quantifying qualitative data• Numerical presentation of qualitative data (Descriptive and

Inferential Statistics)• Deals with both parametric and non-parametric approaches• Deals with Cross Sectional Data, Time Series Data, and Panel

Data

SPSS LayoutRows

Cells

Columns

Icons

Menus

SPSS –Multi-dimensional MatrixWill you be able to find the number of rows

and columns?

Data View

Variable View

Disadvantages • Doesn’t deal with advanced mode of modeling and

quantitative techniques (Not possible by menus) • Doesn’t deal with the advanced techniques of data type.

(Not possible by menus)Common measurement(a)Categorical variable (CAV)-Nominal & Ordinal(b)Continuous variable (COV)-Scale(Ratio & Interval)(c) String – Qualitative statements (Not important in SPSS)-

Nvivo, QDA-Miner, Dedoose, Atlas-TI, and etc

Classification variable = is a partial element of categorical variable.

Classification variable-variable that is used to classify qualitative arguments/statements – variable by categories (Categorical variable) + variable by statements (Non-Categorical variable)

Categorical variable (a)Dichotomous variable (Binomial) – 2 values – NO / OR –

Independent & Dependent samples(b)Polychotomous variables (Multinomial)- >2 values – NO/OR

–Independent & Dependent samples

Categorical variable (a)constant and fixed (b)Separated by categories (c)Gradual change = 0, static (d)Nominal (X order) and Ordinal (Order)/RankContinuous variables(a) X constant and fixed(b) Separated by ratios and intervals(c) Gradual change !=0, dynamic

Types of Variables (a) Bi + nary variable = 2 groups of variables (0 and 1) Examples: Gender(0=Male, 1=Female), Case and Control(0=Healthy,

1=Disease), Fluctuations(0=Increase, 1=Decrease.(b) Dichotomous variable = 2 groups of variables(can be any 2 values) Examples:Gender(2=Male,3=Female), Case and

control(0=Before Treatment,1=Present Treatment)(c) Independent variable = stand alone variable-Corx1,x2,x3 = 0 – Predictor/Regressor/Indicator

(d) Dependent variable = relying on factors –Cory,x1,x2 !=0)-Predictand/Regressand/Outcome

(e) Confounding variable = distorts the effects of one variable on another. -expansion of matching – reduces the effects of confounding.

(f) Control variable –controls the effects of IV on DV.(g) Controlled variable – another term of Dependent Variable(h) Instrumental variable –variable that has zero correlation with residuals/error terms, but, has correlation with

dependent variable(i) Criterion variable – a variable that has presumed effect –Non-experimental research(j) Discrete variable – a variable that takes up distinct values (k) Dummy variable – similar as binary variable –classification variable(l) Endogeneous variable – inside the system-influenced by variables that are entering into the system.(m) Exogeneous variable – outside the system- entering the systm-influencing the endogeneous variable(n) Interval variable – a form of scale variable(o) Ratio variable – a form of scale variable(p) Intervening variable – intervene the association between the main variables. –moderating and mediating variables(q) Mediating variable – Indirect effect on the association between the main variables(r) Moderating variable – indirect effect through interaction effects between related variables

(s)Polychotomous variables – take up more than 2 values/groups(t)Manifest variable – indicator variable that can indicate the

presence of latent variable(u)Latent variable –variable that cannot be measured directly – it

has to depend on manifest variables.(v)Manipulated variable – Similar as IV(w)Outcome variable – Similar as DV-presumed effect(x)Predictor variable – Similar as IV-presumed cause(y) Nominal variable – takes up any value – doesn’t follow

orders/ranks(z) Ordinal variable –takes up values based on orders/ranks.* Treatment variable – Similar as IV

Types of Quantitative Data(a)Time Series Data –data follows the series of timing – single

country/industry/activity/firm/organization/stock market/society and etc – multiple sampling periods

(b) Cross Sectional Data – data follows the cross evaluations of various forms of subjects(countries/industries/activities/firms)-single point of time

(c) Panel Data – Time Series Data + Cross Sectional Data – with different characteristics

(d) Pooled Data – Combined version of data – with similar characteristics

(e) Longitudinal Data – Wider scope of data – variation of timing

Types of Qualitative Data

(a)Factual Data – Demographical Data(Marital Status, Level of Education, Age, Position and etc)- (Experimental and Non-experimental Data) –Yes/No versus Yes/No/Don’t know

True or False(b)Positive and Normative Data – Actual versus

predicted, Agreement to Disagreement, Likes to Dislikes

(c) Logical Arguments – True or False (d) Boolean Statements – AND, OR, NOT

Which one is more preferable?

Likert Scale(LS) and Scale(S)

LS != SFor example:-5 Levels of Likert Scale1=Strongly Agree2=Agree 3=Neither Agree nor Disagree4=Disagree5=Strongly Disagree

In a normal case, Scale refers to ratio or interval?

Sample and PopulationThe association between Sample and Population can

be seen in the context of Donut

Which one is good?“RVRCNB” Approach

Parameter and Statistics

Parameter = Population(Actual)Statistics = Sample(Prediction)Y=β0 + β1X1 + β2X2 + ε (Parameter)

PY=Pβ0 + Pβ1X1 + Pβ2X2 + Pε (Statistics)

Statistics ~ Parameter (Actual Population is Unknown)-estimated Population

Descriptive and Inferential Statistics

*For quantitative mode of single/multi-purposes*Descriptive = Describe + Narrative(Describing subjects) – Single Purpose(SP)*Inferential = Investigation + Narrative(Investigating subjects) –Multi Purposes(MP)Descriptive Analysis – Quantitative research(a) Descriptive Statistics (Continuous variables)-[Mean, Median, Variance, Standard deviation, Max,

Min , Range, skewness, kurtosis, Standard error of mean, Histogram with normal curve, Normal Q-Q plot, Normal P-P plot – Uni-variate

(b) Frequency Distribution(Categorical variables)-[Mode(similar as frequency), Median, Variance and Standard Deviation, Max, Min, Range]-Uni-variate

Inferential Analysis – Quantitative research(c) Normality tests -hypothesis testing – SPSS(Shapiro Wilk and Kolmogorov-Smirnov)(d) Non-normality tests – hypothesis testing – SPSS(One Sample Kolmogorov Smirnov tests for

uniform, Poisson, and Exponential distributions)-Others are possible through Scripts and Syntax(e) Mean differences – Single mean test, One sample t-test, Two samples (Independent and

Dependent sample tests)(f) Association – Linear and Non-Linear modes of regressions(e) Correlation – Linear and Non-Linear modes of correlations

Types of Samplings

All the research starts with a single or multiple purposes……..Purposive Sampling

Additional types of samplings(a)Simple random sampling – samples that have been selected

randomly-equal chance of probability –unbiased sampling(b)Systematic sampling – samples that have been selected from

ordered sampling frame(c)Stratified sampling –sampling mode that are divided into

homogeneous subgroups(d) Cluster sampling – sampling that deals with the division of it into

groups that deals with the similar characteristics.(e)Convenience sampling – Easy sampling – choose groups of interest.

What type of

research?

Sampling with replacement and no replacement

*Are tied up with the probability of sample selection.*For example: Let’s say that we have some alphabets with us(A,B, C,D,E)……(a)Sampling with replacement – Select one alphabet first and put it back into the sample space. Two alphabets were chosen. The

sample space can be presented as below:- AA, AB, AC, AD, AE BA, BB, BC, BD, BE CA, CB, CC, CD, CE DA, DB,DC, DD, DE EA, EB, EC, ED, EE The probability of choosing at least one Alphabet “A”, [AA,AB,AC, AD,AE,BA,CA, DA, EA], Probability=9/25=0.36(b)Sampling without replacement –Select one alphabet first and do not put it again in the sample space. We cannot select the

same alphabets.We can just use the previous example in which two alphabets were chosen. The sample space can be reflected as below:-

AA, AB, AC, AD, AE BA, BB, BC, BD, BE CA, CB, CC, CD, CE DA, DB,DC, DD, DE EA, EB, EC, ED, EE The probability of choosing at least one alphabet “A”, [AB, AC, AD, AE, BA, CA, DA, EA]. Probability=8/20 = 0.4

Dependent and Independent Samples

Dependent Samples – Same subjects at different levels (Very Highly Correlated)

Independent Samples – Different subjects at same and different levels.(Low and Moderate Correlations)

Population 1

Sample 1

Sample 2

Population 1

Sample 3

Sample 4

Independent and Dependent samples

Sample Size

• Should be representative of population size(N)• In a general/normal case, n >= pN(p=0.5 and above)• Manual computations of sample size(n) Margin of errors/Standard errors in percentage (when

population size is unknown) nPPPPzME /)1(

22 /)1( MEPPPPzn

Computation of sample size with finite population correction factorn= n(N)/n + (N-1)

Useful Software to deal with the selection of sample size

(a) G*Power (http://www.gpower.hhu.de/)(b) Power sample size(

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize)

(c) Power Analysis & Sample Size (http://www.ncss.com/software/pass/)

http://www.gpower.hhu.de/

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

http://www.ncss.com/software/pass/

Parametric versus Non-parametric

Introduction

The terms of “parametric” and “non-parametric” were coined by Jacob Wolfowitz in the year of 1942.

Parametric – (distribution is known)Non-parametric –(distribution is unknown)

In my point of view, I would say that it is just a general thought of statistics and it should be used as a benchmark or baseline on the development of various statistical modes of intellectual thoughts on the statistical tests.

Characteristics of parametric approach(a)Data – follows the probability distribution (b) Tied up with probability mode of sampling type (Simple random sampling,

Stratified random sampling, systematic random sampling, random cluster, stratified random cluster, Complex Multi-stage Random, Random mode of purposive sampling)

(c)Deals with the statistical inferences on the distributions of parameters(d) Always linked with linearity of data(variables and

errors/residuals(uncertainty))(e) Patterns of data(variables and errors/residuals follows the style of

homogeneity)(f) Follows strict forms of assumptions (robust = if the assumptions are fulfilled)I would classify this approach as the classical approach due to the fact that it

doesn’t the evolutionary direction of momentum.

Assumptions of parametric approach

(a)Linearity of parameters(b)Homogeneity/Homogeneous mode of existing variables and omitted

variables(error terms/residuals)-symmetrical form of distribution.(c)Dependent variables /residuals should be normally distributed.(d) Randomness among the selected samples should be maintained

(only if it has got to do with random sampling)(e)Expansionary use of non-categorical variables(continuous variables)

in the statistical tests.(f) Minimization of outliers (g) Mean, Mode, and Median of the variables are approximately the

same (for the case of normal distribution)-Bell Shaped Normal Curve.

(h) Doesn’t deal with the process of re-sampling(Bootstrapping)

Identification on the statistical approach is an important step that should be taken before moving to existing forms of statistical tests.

Distributional tests are needed to determine the nature of data(variables and residuals)

In a simple context, Parametric – follows normal distributionNon-parametric – follows free distribution

Distribution tests of normalityGraphical approach(a) Histogram with normal curve(b) Box plot(c) Normal Q-Q plot(d) Normal P-P plot(e) Leverage Plot

Numerical approachUni-variate tests (a) Jarque Bera test(b) Coefficient of variations (c) Coefficient of Skewness and Kurtosis(d) Kolmogorov-Smirnov test(e) Shapiro-Wilk test(f) Shapiro-Francia test(g) Anderson-Darling testMulti-variate tests(h) Multivariate tests of normality

Parametric tests of correlation (a)Pearson product moment correlation coefficient (Bivariate analysis)(b) Stepwise mode of linear regression (Multivariate analysis)(c) Auxiliary mode of linear regression (Multivariate analysis)(d) Scatter plot /Scatterplot matrix with fitness line(linear form) (Bivariate analysis)

Non-parametric tests of correlation (a) Spearman rank correlation (Bivariate analysis)(b) Kendall Tau’s rank correlation (Bivariate analysis)(c) Stepwise mode of Non-linear regression (Multivariate analysis)(d) Auxiliary mode of Non-Linear regression (Multivariate analysis)(e) Scatter plot/Scatterplot matrix with fitness line(Non-Linearity form) (Bivariate

analysis)

Parametric tests of associations(a) Linear regression (Bivariate and Multivariate)(b) Stepwise mode of Linear regression(Bivariate and Multivariate)(c) Auxiliary mode of Linear regression(Bivariate and Multivariate)(d) Linear mode of co-integration tests(e) Linear mode of causality testsNon-parametric tests of associations (f) Non-Linear regression (Bi-variate and Multivariate)(g) Logistic regression (LR) –DV(categorical variable) *Ordered LR (Ordinal variable) * Un-ordered LR (Nominal variable)(c) Correspondence Analysis independent sample (Pearson Chi-Square, Contingency Coefficient (Nominal),Phi-

Cramer’s V(Nominal), Lambda (Nominal)

Main features of SPSS –Inferential Statistics

RegressionParametric

Linear Regression

Linear Curve Estimation

Linear Weight Estimation & Different types of

estimation

Probit Regression

Tobit Regression

Linear mode of Scatter plot

Simultaneous regression

Non-Parametric

Non-Linear Regression

Non-Linear Curve Estimation

Non-Linear Weight Estimation & Different

types of estimation

Linear mode of LeveragePlot and residual plot

Non-Parametric Regression

Logit Regression

Non-Linear mode of Scatter Plot

Non-Linear mode of Simultaneous equation

Parametric correlation

Pearson correlation

Linear Mode of Stepwise Regression

Linear Mode of Auxiliary regression

VIF & Tolerance Value

Linear mode of Scatter Plot

Non-Linear mode of Leverage plot and Residual plot

Non-Parametric Correlation

Spearman rank correlation

Kendall’s tau-b rank correlation

Non-Linear Step Wise regression

Non-Linear Auxiliary Regression

VIF & Tolerance Value

Non-Linear Mode of Scatter Plot

Parametric mode of testing on differences

Single test of mean

One sample t-testPM

Two sample t-test

Dependent Samples

*Paired sample t-test *ANOVA repeated

measures

Independent Samples

*Independent Sample t-test*ANOVA –one way/two

way/multiple factors*MANOVA, GANOVA, SPANOVA, ANCOVA, MANCOVA,SPANCOVA

Non-Parametric mode of testing on differences

Chi-Square test

2 sample tests

Dependent samples

Binomial test

*Wilcoxon test *Sign test

*McNemar test• Marginal Homogeneity

• *Friedman test• *Kendall’s W test• *Cochran’s Q test

Independent samples

*Mann Whitney U test*Moses extreme reactions

*Kolmogorov-Smirnov Z*Wald-Wolfowitz runs test

*Kruskal –Wallis H test*Median test

*Jonckheere-Terpstra test

Introduction to spss – part 1

Education

Transcript of Introduction to spss – part 1