Introduction to SPSS for GHA Staff Prof Gwilym Pryce: [email protected]@gpryce.com Tutors: George...

55
Introduction to SPSS for GHA Staff Prof Gwilym Pryce: [email protected] Tutors: George Vlachos, Christian Holz Lab notes based on material by John Malcolm

Transcript of Introduction to SPSS for GHA Staff Prof Gwilym Pryce: [email protected]@gpryce.com Tutors: George...

Introduction to SPSS for GHA Staff

Prof Gwilym Pryce: [email protected]

Tutors: George Vlachos, Christian Holz

Lab notes based on material by John Malcolm

Plan:• A. Data Types

– 1. Variables– 2. Constants

• B. Introduction to SPSS– 1. SPSS Menu Bar– 2. File Types

• C. Tabulating Data– 1. Categorical variables– 2. Continuous variables

• D. Graphing Data– 1. Categorical variables– 2. Continuous variables

A. Data Types

• 1. Variables

• 2. Constants

1. What is a variable?

– A measurement or quantity that can take on more than one value:

• E.g. size of planet: varies from planet to planet• E.g. weight: varies from person to person• E.g. gender: varies from person to person• E.g. fear of crime: varies from person to person• E.g. income: varies from HH to HH

– I.e. values vary across ‘individuals’ = the objects described by our data

• Individuals = basic units of a data set whom we observe or experiment on in a controlled way– not necessary persons

• (could be schools, organisations, countries, groups, policies, or objects such as cars or safety pins)

• Variables = information that can vary across the individuals we observe– e.g. age, height, gender, income, exam scores,

whether signed Nuclear Test Ban Treaty

Variable Type, for Coding Purposes:Variable View of the Data

Variable Type for Coding Purposes:

• Available data types in SPSS are as follows:– Numeric – the default for new

variables– Comma– Dot– Scientific Notation– Date– String

• Numeric – A variable whose values are numbers. Values

are displayed in standard numeric format. – The Data Editor accepts numeric values in

standard format or in scientific notation.

• Comma – A numeric variable whose values are

displayed with commas delimiting every three places, and with the period as a decimal delimiter.

– The Data Editor accepts numeric values for comma variables with or without commas, or in scientific notation.

• Values cannot contain commas to the right of the decimal indicator.

• Dot – A numeric variable whose values are displayed

with periods delimiting every three places and with the comma as a decimal delimiter.

– The Data Editor accepts numeric values for dot variables with or without periods, or in scientific notation.

• Values cannot contain periods to the right of the decimal indicator.

• Scientific notation – A numeric variable whose values are displayed

with an imbedded E and a signed power-of-ten exponent.

– The Data Editor accepts numeric values for such variables with or without an exponent.

• The exponent can be preceded either by E or D with an optional sign, or by the sign alone--for example, 123, 1.23E2, 1.23D2, 1.23E+2, and even 1.23+2.

• Date – A numeric variable whose values are

displayed in one of several calendar-date or clock-time formats. Select a format from the list. You can enter dates with slashes, hyphens, periods, commas, or blank spaces as delimiters. The century range for two-digit year values is determined by your Options settings (from the Edit menu, choose Options and click the Data tab).

• Custom currency – A numeric variable whose values are

displayed in one of the custom currency formats that you have defined in the Currency tab of the Options dialog box. Defined custom currency characters cannot be used in data entry but are displayed in the Data Editor.

• String – Values of a string variable are not

numeric and therefore are not used in calculations.

– They can contain any characters up to the defined length.

– Uppercase and lowercase letters are considered distinct.

– Also known as an alphanumeric variable.

Conceptual Approach to Variable Type:

• Numeric = values are numbers that can be used in calculations.

• String = Values are not numeric, and hence not used in calculations. – But can often be coded: I.e. transformed into a

numerical variable:• e.g. If (LA = ‘Aberdeen’) X = 1.

If (LA = ‘East Renfrewshire’) X = 2. etc.

Continuous vs Categorical

• Continuous (or Scale or quantitative Variables) = data values are numeric values on an interval or ratio scale – (e.g., age, income). Scale variables must be numeric.– E.g. dimmer switch: brightness of light can be measured

along a continuum from dark to full brightness

• Categorical Variables = variables that have values which fall into two or more discrete categories – E.g. conventional light switch: either total darkness or full

brightness, on or off.– Male or female, employment category, country of origin

Two types of Categorical variables: Ordinal & Nominal

• Ordinal variables = Data values represent categories with some intrinsic order – (e.g., low, medium, high; strongly agree, agree,

disagree, strongly disagree). – Ordinal variables can be either string

(alphanumeric) or numeric values that represent distinct categories (e.g., 1=low, 2=medium, 3=high).

Ordinal variables:• Values fall within discrete but ordered

categories– I.e. the sequence of categories has meaning

• e.g. education categories:– 1 = primary

– 2 = secondary

– 3 = college

– 4 = university undergraduate

– 5 = university postgraduate masters

– 6 = university postgraduate phd

• e.g. 1= Very poor, 2= poor, 3=good, 4=very good

Nominal variables• Nominal Variables = Data values represent

categories with no intrinsic order – sequence of categories is arbitary --

ordering has no meaning in and of itself:• e.g. country of origin: Wales, Scotland,

Germany…• e.g. make of car: Ford, Vauxhall• e.g. job category • e.g. company division

– Nominal variables can be either string (alphanumeric) or numeric values that represent distinct categories (e.g., 1=Male, 2=Female).

2. What is a constant?

– A measurement or quantity that has only one value for all the objects described in our data

– Also called a ‘scalar’ or ‘intercept’ or ‘parameter’• E.g. speed of light in a vacuum: constant for all light transmissions• E.g. ratio of diameter to circumf.: constant for all circles• E.g. ave. increase in life expectancy: constant at 1 year pa since 1900• E.g. Price elasticity of housing supply: assumed constant for a particular market

• Often it is a constant that want to estimate:

– we employ statistical techniques to estimate ‘parameters’ or ‘constants’ that summarise or link variables.

• e.g. mean = ‘typical’ value of a variable = measure of central tendency

• e.g. standard deviation = measure of the variability of a variable = measure of spread

• e.g. correlation coefficient = measures the correlation between two variables

• e.g. slope coefficients = how much y increases when x increases

Plan:• A. Data Types

– 1. Variables– 2. Constants

• B. Introduction to SPSS– 1. SPSS Menu Bar– 2. File Types

• C. Tabulating Data– 1. Categorical variables– 2. Continuous variables

• D. Graphing Data– 1. Categorical variables– 2. Continuous variables

B. Introduction to SPSS1. SPSS Menu Bar

• When you first open SPSS, you will usually be presented with a blank Data View window– The Data View lists variables as columns

and observations (also called “cases” or “individuals”) as rows

• Data View without and with data looks like this…

Data View of Home Sales data:

• Variable View looks like this…

Variable View of Home Sales data:

SPSS Menu Bar

B.2. File Types & SPSS Structure• If you try opening a new file (File, New),

you will see that you are presented with five choices of file type.

• These choices reflect the basic structure of SPSS:

– Data– Syntax

• Steep learning curve, but essential for larger projects

– Backup– Record/checking– Re-use

– Output• Graphs, tables, commands, error

messages

SPSS Scripting Facility

• The scripting facility allows you to automate tasks, including:– Automatically customize output in the Viewer. – Open and save data files. – Display and manipulate dialog boxes. – Run data transformations and statistical

procedures using command syntax. – Export charts as graphic files in a number of

formats.

Plan:• A. Data Types

– 1. Variables– 2. Constants

• B. Introduction to SPSS– 1. SPSS Menu Bar– 2. File Types

• C. Tabulating Data– 1. Categorical variables– 2. Continuous variables

• D. Graphing Data– 1. Categorical variables– 2. Continuous variables

C. Tabulating Data• 1. Categorical Data: Frequency Tables

– E.g. Neighbourhood type (House Sales data)• Analyse, Descriptive Statistics, Frequencies

Neighborhood

42 1.7 1.7 1.7319 13.1 13.1 14.8258 10.6 10.6 25.4467 19.1 19.1 44.5500 20.5 20.5 65.0372 15.2 15.2 80.2482 19.8 19.8 100.0

2440 100.0 100.0

ABCDEFGTotal

ValidFrequency Percent Valid Percent

CumulativePercent

• Categorical Data: Crosstabs (2-Way Tables)– E.g. Does Ethnic Minority Status affect job type? (Emplment data)

• Analyse, Descriptive Statistics, Crosstabs

Minority Classification * Employment Category Crosstabulation

276 14 80 370

76.0% 51.9% 95.2% 78.1%

87 13 4 104

24.0% 48.1% 4.8% 21.9%

363 27 84 474

100.0% 100.0% 100.0% 100.0%

Count% within EmploymentCategoryCount% within EmploymentCategoryCount% within EmploymentCategory

No

Yes

Minority Classification

Total

Clerical Custodial ManagerEmployment Category

Total

2. Scale Data

• Scale or quantitative data: usually a measurement of size or quantity– not meaningful to report % or count

• Not unless you break the variale into categories (& then it becomes categorical data!)

• e.g. income bands = “grouped data”

• Tables of raw data not much use unless only a few values...

How tabulate 129,000 observations?

CM SML 1988 CM SML 1988 CM SML 1988 CM SML 1988Borrower Total Income Borrower Total IncomeBorrower Total IncomeBorrower Total Income

1 . 21 10800 41 . 61 .2 . 22 . 42 7216 62 .3 . 23 19072 43 . 63 .4 . 24 . 44 12000 64 .5 . 25 . 45 9758 65 .6 . 26 . 46 6084 66 .7 . 27 . 47 . 67 .8 . 28 . 48 . 68 .9 . 29 . 49 . 69 18336

10 . 30 . 50 9345 70 1509611 . 31 . 51 9810 71 .12 . 32 . 52 14406 72 1259713 . 33 . 53 9190 73 970014 . 34 . 54 . 74 .15 18720 35 . 55 . 75 .16 16000 36 . 56 . 76 .17 16455 37 . 57 . 77 .18 . 38 11500 58 . 78 529519 7020 39 2912 59 . 79 453920 4576 40 11745 60 . 80 .

Tables of Summary Statistics for Continuous Data:

• Descriptives Function in SPSS:– E.g. House Sales data

• On SPSS Menu Bar select:– Analyze, Descriptive Statistics, Descriptives

Descriptive Statistics

474 $15,750 $135,000 $34419.6 $17,075.661474

Current SalaryValid N (listwise)

N Minimum Maximum Mean Std. Deviation

• Explore Function in SPSS:– On SPSS Menu Bar select:

• Analyze, Descriptive Statistics, Explore

Descriptives

$34419.6 $784.311$32878.4

$35960.7

$32455.2$28875.0

3E+008$17075.7$15,750

$135,000$119,250$13,163

2.125 .1125.378 .224

MeanLower BoundUpper Bound

95% ConfidenceInterval for Mean

5% Trimmed MeanMedianVarianceStd. DeviationMinimumMaximumRangeInterquartile RangeSkewnessKurtosis

Current SalaryStatistic Std. Error

Plan:• A. Data Types

– 1. Variables– 2. Constants

• B. Introduction to SPSS– 1. SPSS Menu Bar– 2. File Types

• C. Tabulating Data– 1. Categorical variables– 2. Continuous variables

• D. Graphing Data– 1. Categorical variables– 2. Continuous variables

D. Graphs of Variables: 1. Graphs of Categorical Data

• Pie Charts– If all the categories sum to a meaningful

total, then you can use a pie chart– Pie charts emphasise the differences in

proportions between categories– OK for a single snapshot, but not very

good for showing trends• would need to have a separate pie chart for

each year

•On SPSS Menu Bar select: •Graphs, Pie, Summaries for Groups of Cases

• Bar Charts– can show either % or count– not very good for showing trends in more

than one category

Income Support claimants with housing costs by statistical group in May 1999

0

10

20

30

40

50

60

70

80

90

100

Aged 60 or over Lone Parents Disabled Other

Category of Claimant

00

0's

DSS Quarterly Statistical Enquiry

Income Support claimants with housing costs by statistical group in May 1999

0

10

20

30

40

50

60

70

80

90

100

Aged 60 or over Lone Parents Disabled Other

Category of Claimant

00

0's

DSS Quarterly Statistical Enquiry

Income Support claimants with housing costs by statistical group: May 1993 to May 1999

0

20

40

60

80

100

120

140

1993 1994 1995 1996 1997 1998 1999Year

000's

Other 000s

Disabled 000s

Lone Parents 000s

Aged 60 or over 000s

Beware of scaling...

Income Support claimants with housing costs by statistical group: May 1993 to May 1999

60

70

80

90

100

110

120

1993 1994 1995 1996 1997 1998 1999Year

000'sLone Parents 000s

Income Support claimants with housing costs by statistical group: May 1993 to May 1999

0

20

40

60

80

100

120

140

160

180

200

1993 1994 1995 1996 1997 1998 1999

Year

000'

s

Lone Parents 000s

D. Graphs of Variables: 2. Graphs of Continuous Data

• What are we interested in when describing data?

• E.g. income: – Is income evenly spread?– Or are most people rich?– Or are most people poor?– Or are most reasonably well off?

• This are all questions about the variable’s Distribution– We can represent the whole data set with one

picture...

TOTAL INCOME OF BORROWER(S)

57

00

0.0

- 5

85

00

.05

40

00

.0 -

55

50

0.0

51

00

0.0

- 5

25

00

.04

80

00

.0 -

49

50

0.0

45

00

0.0

- 4

65

00

.04

20

00

.0 -

43

50

0.0

39

00

0.0

- 4

05

00

.03

60

00

.0 -

37

50

0.0

33

00

0.0

- 3

45

00

.03

00

00

.0 -

31

50

0.0

27

00

0.0

- 2

85

00

.02

40

00

.0 -

25

50

0.0

21

00

0.0

- 2

25

00

.01

80

00

.0 -

19

50

0.0

15

00

0.0

- 1

65

00

.01

20

00

.0 -

13

50

0.0

90

00

.0 -

10

50

0.0

60

00

.0 -

75

00

.03

00

0.0

- 4

50

0.0

0.0

- 1

50

0.0

12000

10000

8000

6000

4000

2000

0

Std. Dev = 12830.02

Mean = 17993.3

N = 125541.00

•On SPSS Menu Bar select: •Graphs, Histogram, and select variable

LTV16

.75

- 17

.25

15.7

5 -

16.2

514

.75

- 15

.25

13.7

5 -

14.2

512

.75

- 13

.25

11.7

5 -

12.2

510

.75

- 11

.25

9.75

- 1

0.25

8.75

- 9

.25

7.75

- 8

.25

6.75

- 7

.25

5.75

- 6

.25

4.75

- 5

.25

3.75

- 4

.25

2.75

- 3

.25

1.75

- 2

.25

.75

- 1.

25-.

25 -

.25

LTV Frequency Distribution

All HHs in Low Price Areas

(1995-1998 CML SML Data)

Fre

quen

cy

60000

50000

40000

30000

20000

10000

0

Std. Dev = .25

Mean = .80

N = 74736.00

LTV

1.45

- 1

.50

1.40

- 1

.45

1.35

- 1

.40

1.30

- 1

.35

1.25

- 1

.30

1.20

- 1

.25

1.15

- 1

.20

1.10

- 1

.15

1.05

- 1

.10

1.00

- 1

.05

.95

- 1.

00.9

0 -

.95

.85

- .9

0.8

0 -

.85

.75

- .8

0.7

0 -

.75

.65

- .7

0.6

0 -

.65

.55

- .6

0.5

0 -

.55

.45

- .5

0.4

0 -

.45

.35

- .4

0.3

0 -

.35

.25

- .3

0.2

0 -

.25

.15

- .2

0.1

0 -

.15

.05

- .1

00.

00 -

.05

LTV Frequency Distribution

All HHs in Low Price Areas

(1995-1998 CML SML Data)

Fre

quen

cy

30000

20000

10000

0

Std. Dev = .22

Mean = .80

N = 74552.00

LTV

.95

- 1.

00.9

0 -

.95

.85

- .9

0.8

0 -

.85

.75

- .8

0.7

0 -

.75

.65

- .7

0.6

0 -

.65

.55

- .6

0.5

0 -

.55

.45

- .5

0.4

0 -

.45

.35

- .4

0.3

0 -

.35

.25

- .3

0.2

0 -

.25

.15

- .2

0.1

0 -

.15

.05

- .1

00.

00 -

.05

LTV Frequency Distribution

All HHs in Low Price Areas

(1995-1998 CML SML Data)

Fre

qu

en

cy

30000

20000

10000

0

Std. Dev = .22

Mean = .78

N = 70545.00

LTV1.

00 -

1.5

0

.50

- 1.

00

0.00

- .

50

LTV Frequency Distribution

All HHs in Low Price Areas

(1995-1998 CML SML Data)

Fre

quen

cy

70000

60000

50000

40000

30000

20000

10000

0

Std. Dev = .22

Mean = .80

N = 74552.00

Summary• A. Data Types

– 1. Variables– 2. Constants

• B. Introduction to SPSS– 1. SPSS Menu Bar– 2. File Types

• C. Tabulating Data– 1. Categorical variables– 2. Continuous variables

• D. Graphing Data– 1. Categorical variables– 2. Continuous variables