Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate...

42
Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce

Transcript of Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate...

Page 1: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3

Precise & Approximate Relationships Between Variables

Dr Gwilym Pryce

Page 2: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Plan:

1. Introduction 2. Precise Relationships 3. Approximate Relationships 4. Relationships between categorical

variables

Page 3: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

A token of transatlantic friendship… the relationship between variables:

Page 4: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.
Page 5: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

1. Introduction to relationships between variables

Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?– is income related to education?– is work alienation related to job monotony?

We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured

assuming a particular theory of causation

Page 6: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Exercise:

Q/ Does the main research question that interests you involve a relationship between variables?

Think about:– what the variables are– the direction of causation– the rationale for this causation– whether it is a precise or approximate

relationship

Page 7: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

2. Precise relationships

No random or error component:• Circumference = 3.14 Diameter

– (linear)

• Fahrenheit = 32 + 9/5 Centigrade – (linear)

• F = ma – (non-linear)– where F = force; m = mass; a = acceleration

• e = mc2

– (non-linear)– where e = energy; m = mass; c = speed of light

Page 8: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

– linear relationships have straight line graphical representations

– non-linear relationships have curved graphical representations

Page 9: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Precise Linear Relationships

Exercise:– Write a column of integers from 0 to 10 and

call this variable ‘C’– Then construct a new column called ‘F’

where F = 32 + 2C– Then plot F and C on a graph with F on the

vertical axis, and C on the horizontal axis.

Page 10: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

C F0 321 342 363 384 405 426 447 468 489 50

10 52

Page 11: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.
Page 12: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Equation of a straight line:

Traditional to:– call the dependent variable “y”

• I.e. the variable that’s being determined or explained

– call the explanatory variable “x”• I.e. the determinant of y; the factor that explains

the variation in y

Page 13: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

y = a + bxwhere:

• a is the vertical intercept» measures how much y would be if x is zero» changes in a simply move the line up or down in

parallel shifts

• b is the slope coefficient» measures how much y increases for every unit

increase in x» the greater the value of b the steeper the slope and

the more sensitive y is to x.

Page 14: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Graphing exact relationships

Axes:– put the dependent variable y on the vertical

axis– put the explanatory variable x on the

horizontal axis Equation is fully summarised with a line

Page 15: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

y = ln(x)

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

y

Page 16: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

y = exp(x ) = 2.7x

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

-4.9

-4.3

-3.7

-3.1

-2.5

-1.9

-1.3

-0.7

-0.1 0.5

1.1

1.7

2.3

2.9

3.5

4.1

4.7

5.3

5.9

6.5

7.1

7.7

8.3

8.9

9.5

10.1

10.7

11.3

11.9

12.5

13.1

13.7

14.3

14.9

x

y

Page 17: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

y = x 2

0

100

200

300

400

500

600

700

800

900

1000

x

y

Page 18: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

3. Approximate relationships In social science/epidemiology/history

we don’t tend to get precise relationships– e.g. Relationship between heart disease

and smoking– e.g. Educational achievement and social

class of parents– e.g. Rate of teenage pregnancy and area

deprivation

Page 19: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Modelling approximate relationships: Such relationships can sometimes be

approximated/summarised by a precise relationship plus an error term:– Linear:

• Risk Heart disease = a + b no. cigs + e

• y = a + b x + e

– Multivariate:• y = a + b x + c z + e

– Non-linear:• y = a + b x2 + e

Page 20: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Graphing approximate relationships

The most straight forward way to investigate evidence for relationship is to look at scatter plots:– Again, traditional to:

• put the dependent variable (I.e. the “effect”) on the vertical axis

– or “y axis”

• put the explanatory variable (I.e. the “cause”) on the horizontal axis

– or “x axis”

Page 21: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Scatter plot of IQ and Income:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000

Page 22: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

We would like to find the line of best fit:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000

bxay ˆ

line of slope

intercept

where,

b

ya

IQbaINCOME

Page 23: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Sometimes the relationship appears non-linear:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 24: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

… and so a straight line of best fit is not always very satisfactory:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 25: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Could try a quadratic line of best fit:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 26: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

… or a cubic line of best fit:(overfitted?)

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 27: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Could try two linear lines:“structural break”

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 28: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Q/How do we best fit a straight line? A/ Regression analysis

– The most popular algorithm for drawing the line of best fit

– minimises the sum of squared deviations from the line to each observation

– also called ‘Ordinary Least Squares’ (OLS)

n

iii yy

1

2)ˆ(min Where:

yi = observed value of y

= predicted value of yi

= the value on the line of best fit corresponding to xi

iy

Page 29: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Regression estimates of a, b:

This algorithm yields estimates of the slope b and y-intercept a of the straight line– b is usually the parameter of most interest

since it tells us what happens to y if x increases by 1.

Page 30: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

But sometimes the line of best fit doesn’t seem to explain the variation in y very well:

Floor Area (sq meters)

3002001000

Pu

rch

ase

Pri

ce

300000

200000

100000

0

Q/ Why do you think this might be?

Page 31: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Is floor area the only factor?What other variables determine purchase price?

Floor Area (sq meters)

3002001000

Pu

rch

ase

Pri

ce300000

200000

100000

0

Page 32: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Omitted explanatory variables:

If the line of best fit doesn’t seem to explain much of the variation in y this might be because there are other factors determining y:

Page 33: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Scatter plot (with floor spikes)

Purchase Price

300

100000

200000

300000

200

Floor Area (sq meters)3.53.0100 2.5

Number of Bathrooms2.01.51.0

Page 34: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Fitting non-linear lines of best fit:

Regression analysis can be used to summarise non-linear relationships, both bi-variate and multivariate:– e.g. y = a + b x2 + cz2

• multivariate and quadratic in x and z

– e.g. y = a + b x + cz2

• multivariate: linear relationship between y and x but quadratic relationship between y and z

Page 35: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

3D Surface Plots:Construction, Price & Unemployment

Q

Ut-1

P

020

4060

80

0

5

1015

-500

0

500

020

4060

80

0

5

1015

Page 36: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Construction Equation in a Slump

020

4060

80

0

510

15

0

200

400

600

800

020

4060

80

0

510

15

=> new construction has a linear relationship with Price, but a quatratic relationship with unemployment.

Page 37: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

4. Relationships between categorical variables:

The easiest way to represent relationships between categorical variables is to use contingency tables– also called cross-tabulations or cross tabs– also called two way tables

They show the number of observations (or % of observations) in particular categories and naturally lead to a test of independence which has a Chi-square (or “2”) distribution.

Page 38: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Contingency Tables in SPSS:

Page 39: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.
Page 40: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

Most basic cross tab just lists the count in each category:

You can add % in each category by returning to the cross-tabs window, select the cells button, and choose which percentages you want:

first time buyer y=2 n=1 * House County from Postcode Crosstabulation

Count

203 154 357

104 95 199

307 249 556

N

Y

first time buyery=2 n=1

Total

Cumber Durham

House County fromPostcode

Total

Page 41: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

If you select all three (row, column and total), you will end up with:

Page 42: Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce.

first time buyer y=2 n=1 * House County from Postcode Crosstabulation

203 154 357

56.9% 43.1% 100.0%

66.1% 61.8% 64.2%

36.5% 27.7% 64.2%

104 95 199

52.3% 47.7% 100.0%

33.9% 38.2% 35.8%

18.7% 17.1% 35.8%

307 249 556

55.2% 44.8% 100.0%

100.0% 100.0% 100.0%

55.2% 44.8% 100.0%

Count

% within first timebuyer y=2 n=1

% within HouseCounty from Postcode

% of Total

Count

% within first timebuyer y=2 n=1

% within HouseCounty from Postcode

% of Total

Count

% within first timebuyer y=2 n=1

% within HouseCounty from Postcode

% of Total

N

Y

first time buyery=2 n=1

Total

Cumber Durham

House County fromPostcode

Total