Part 1 Module 3 Single File

Module 3 - Relationships between variables: association, correlation and regression

This module is about relationships between variables. bivariate statistics for interval variables: Association, Correlation and Regression.

Introduction - Relationships between variables: 'bivariate statistics'

Fitting a regression line to two interval level variables

How should we draw a line to best 'summarise' the data in the scatterplot when the points do not fall exactly on one line?

Correlation and Association

The Pearson product moment correlation coefficient.

Strength of association -How strong is strong?

Transforming variables

A Golden Rule: correlation is not causation

Conclusion

Introduction - Relationships between variables: 'bivariate statistics'

As we started to see in modules 1 and 2, quantitative data analysis can become powerful when we start to look at more than one variable at a time, so that we are able to investigate relationships between variables. That is to say, when we are able to establish that in the cases in our dataset the values of one variable vary in a consistent way with, or relate to, the values of another variable in our dataset.

The frequency distribution of values for one variable may be of interest in themselves (what is called univariate statistics). For example, a variable describing people's voting intentions might give us some insight into the likely result of an election. However if we were able to link voting intention to other variables, such as social class, or age, gender, area of residence etc., then we start to be able to understand not just what an election result might be, but also what kinds of people vote in particular ways, or even why they vote as they do. We can thus start to build up a picture of how and why the political system works as it does. When we examine two variables in combination we speak of bivariate statistics.

What constitutes a 'relationship' between two variables?

It's useful to think of two limiting situations. If there were no relationship at all between our variables, then knowing the value of one of our variables for any case would give us no guide at all to the value of the other variable for that case.

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/12_conclusion.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/11_golden_rule.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/9_transforming.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/8_strong.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/7_pearson.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/6_correlation.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/5_line.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/5_line.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/2_association.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/1_relationships.htm

For example, imagine we had data on a group of people's views about whether pornography ought to be banned or not, and on their sex and age. We might find that sex was not associated with any difference in views: men might be as likely to support or oppose a ban as women. On the other hand, we might find that older people (male or female) were more in favour of some kind of restriction. In that situation, while knowing the sex of respondent would give us no indication of their likely views, knowing their age would give us some clue. We would have a relationship between age and views on banning pornography.

If it were the case that all old respondents wanted a ban and all young respondents rejected this idea, knowing their age would allow us to predict their views perfectly. We would have the strongest possible relationship between the two variables.

Relationships between variables allow us to build models or theories of social relations. In this example, the relationship between age and views on whether restrictions on pornography were desirable might form part of a theory about changing attitudes towards sexuallity across generations.

Glossary

Bivariate

Involving two variates. Technically bivariate is an expression that relates two dependent variables with one or more independent variables, although in colloquial usage it is taken to mean two variables.

Univariate

Involving a single variate. Technically a variate is an expression that relates a single dependent variable with one or more independent variables, although in colloquial usage it is taken to mean a single variable.

A scattergram for two interval level variables

When we have variables at the interval level of measurement and especially when these variables are continuous or take a large number of values, then the first step in investigating whether a relationship exists is to plot a 'scattergram' or 'scatterplot'. We simply plot the values of each variable along one of the two axes of the graph, so that any one point, or coordinate, in the resulting plot corresponds to the combination of values for those two variables for a particular case.

A scatterplot has two axes - a vertical axis, called the Y axis and a horizontal axis, called the X axis. It is customary (but not obligatory) to put any variable that is thought of as a potential cause (the independent, predictor or explanatory variable) on the X-axis and place the variable that is thought of as an effect (the dependent or response variable) on the Y-axis. Each case or observation is represented in the scatterplot by a coordinate which is located at the intersection of a line drawn vertically from the point on the X axis representing its value for that variable, with a line drawn horizontally from the point on the Y axis representing its value for the other variable.

The scatterplot below (Figure 1) plots the variable for respondent's age (on the X axis) against the variable for age at which the respondent finished their education (on the Y axis) for a selection of cases from the GHS Simple dataset. The vertical line drawn on the scatterplot at age 44 and horizontal line representing an age upon completing education of 26 intersect at the coordinate for the case that takes that combination of values (representing a respondent 44 years old who finished their education when they were 26).

Figure 1.

What statisticians routinely call 'visual inspection of' but lesser mortals call 'looking at' a scatterplot can tell us all kinds of things about whether a relationship exists between our variables, and if so, what kind of relationship. First, imagine what the absence of any relationship would look like. If the values of one variable for the cases had no relationship at all to the values of the other variable, then the scatterplot would simply look like a mass of dots with no visible pattern. Dots would pepper the graph at random.

The following scatterplot (Figure 2) shows the relationship between the code number of respondent and their gross weekly pay for a subset of the GHS survey. There is no

obvious pattern to the dots (nor should there be! The code number is just a unique identifier for each participant in the survey).

Figure 2.

To produce scatterplots in SPSS go to the 'graphs' menu choose 'legacy dialogs' and choose 'scatter'. The dialog box will ask you what kind of plot you want to produce: choose a 'simple' scatterplot. The dialog box which then opens will ask you which variables you want to place on the horizontal (X) and vertical (Y) axes.

Glossary

Scattergram

This is a 2 (or 3) dimensional plot of 2 (or 3) variables, with each data point representing the values of each case on each of the variables. Typically, in a 2-dimensional plot, this would show an independent variable along the horizontal (X) axis and a dependent variable along the vertical (Y) axis. A 3-dimensional plot would allow a second independent variable to be shown in the third dimension (Z).

Fitting a regression line to a scattergram

Next, imagine what would happen if the relationship was so strong that almost every case that took a particular value for one variable, took a common value for the other variable. All the dots on our scatterplot would tend to cluster along a single line. Such a line could take all kinds of different forms: it might form a smooth curve, or bend more erratically. Or it might also be straight. This last state of affairs is of particular interest because it makes it easier for us to model the nature of the relationship. When this is the case we can speak of a linear relationship. If we have a linear relationship we can use what is called the general linear model to investigate it.

The figure below shows the relationship between the age in years and average height in centimetres of boys in Toronto (each point of the plot records the results for the average height of boys of that age).

Age is on the X axis, height on the Y axis. The coordinate in red, for example, shows that boys aged 10 had an average height of 125cm. When the relationship on the scatterplot approximates a straight line we can summarise the relationship between our two variables in terms of a formula or equation. It looks as if for each 5 years increase in age, height increases about 20 cm. We could extend our line so as to mark exactly where it would cross the Y axis at the point where X is equal to zero, the point known as the intercept. If height at 5 years is 105cm, if we extended the line down to 0 years it would cross the Y axis at about 105-20 = 85cm. We can use this information to produce a formula that would give us the value of Y for any value of X. This formula is called a regression equation. For the above example it looks as if our regression equation would be something like:

Height in cm = 85 + (4*age in years)

We could now use this formula to predict the average height of Toronto boys for any age that interests us. E.g. for those aged 13, height would be 85 + (4*13) = 137cm.

In fact, we can represent the relationship between the two variables in terms of an algebraic formula:

Y = a + bX

This equation is called a regression equation and describes a regression line that summarises the relationship between our two variables.

In this formula 'a' is the ‘intercept’ or the point at which our line would cross the Y axis when X is equal to zero. 'b' is the ‘slope’ of our line. If our line rises steeply, then the value of Y increases a lot for any increase in the value of X. If it rises gently, the value of Y increases more gradually. Or it may be that as the value of X increases, the value of Y decreases. In this case the slope of our line would be negative.

Imagine that boys in Toronto got smaller as they got older. Our line would slope downwards from left to right. The gradient of the slope (whether positive or negative) represents the form of the relationship between the two variables. Think of the example of a perfectly flat line. it would imply that the value of X had no effect at all on the value of Y: there would be no relationship between the two variables. If the slope (positive or negative) of the line is steep, this implies that small differences in X are associated with rather large differences in Y.

What is more, our formula can be used to describe any possible relationship that might exist between any two variables, because by substituting just two values 'a' (representing the intercept) and 'b' (representing the slope) we can describe anypossible straight line that we could draw on a graph. This gives us a very powerful 'summary statement' that can summarise the form of the relationship between two variables in just two numbers, provided the relationship is a linear one.

Glossary

Regression

A model of the relationship between variables from which the mean value of a dependent variable can be estimated from one or more independent variables.

Note for those unfamiliar with algebra

Alegbra is a shorthand way of representing relationships between numbers such as constants (things which do not change) and variables (things that do).

We use capital and lower case letters to represent different numbers of interest. The formula Y = a + bX thus means 'any value of the variable Y for a given case equals a constant number (a) plus another constant number (b) times the value of variable X for that same case'

If you are rusty or unfamilar with basic algebra try the algebra help website which provides a friendly introduction.

Fitting a regression line - continued

We can note a couple of other points from our example of the height of boys in Toronto.

First, although we can use the formula we have constructed to predict the values of our second variable (Y) from the values of the first, we can only do this safely for the range of values for which we already have observations. In this case we have data for ages from about 5 to 18 years.

We do not actually know what relationship observations outside this range might give us, because we do not know if the line would continue to be straight: that is to say whether the relationship would continue to be linear. In fact, we could guess that it almost certainly will NOT be straight. Boys in Toronto are not, on average, 85cm tall when they are born, to the great relief of Toronto mothers. Similarly, we might guess that they do not continue to grow at around 4cm a year for ever. Not many middle-aged men in Toronto are three metres tall. Were we to continue to plot height against age we would see that at some point after age 18 the line levels off as boys stop getting taller: the line is not straight for all values of X.

Second, although the points in our scatterplot are very near a straight line in the example here, we might expect that such a situation is rare. The reason our coordinates fit a line so well in this example is that each observation was already

http://www.algebrahelp.com/lessons/

based on averages from many cases. Had we plotted the height of individual boys we would have obtained a range of values for each age. This kind of situation raises the question:

How should we draw a line to best ‘summarise’ the data in the scatterplot when the points do not fall exactly on one line?

Fitting an 'Ordinary Least Squares' (OLS) regression line and calculating a regression equation How should we draw a line to best ‘summarise’ the data in the scatterplot when the points do not fall exactly on one line?

The first thing we need to do is look at the data to see if a straight line is a good approximation to the layout of the data points. If it looks as if the data follows some kind of curve, it could make no sense to try to summarise it with a straight line. If we decide that a straight line is appropriate, Marsh & Elliott (2008: 195-6) suggest a number of different possible rules.

1. Make half the points lie above the line and half below along the full length of the line.

2. Make each point as near to the line as possible (minimizing distances perpendicular to the line).

3. Make each point as near to the line in the Y direction as possible (minimizing vertical distances).

4. Make the squared distance between each point and the line in the Y direction as small as possible (minimizing squared vertical distances).

For reasons we'll explore later, we almost always use rule 4. Because of this we sometimes refer to ordinary least squares (OLS) regression, because we base our line on minimizing the sum of the squared vertical distances between our line and the observed data points.

Whichever rule we use, our calculation of Y = a + bX can be thought of as an example of the more general D = F + R formula we discussed in module 2 .

Need a reminder?

It simplifies reality in order to avoid overwhelming us with unnecessary detail. However such simplification has a cost (few things in life are free!). Being a simplification of reality, our model is no longer a wholly accurate description of reality in all its detail. We can think of the difference or distance between our model and reality as all the information we have left out, or what is know in statistics as the residual. This gives us the following formula: Data = Fit + Residual (D = F + R) This equation describes the idea that our data can be thought of as our summary of it (the mean age of respondents in our example above), which we call the fit or model, and

the distance, for each case, between the value given in our model and that in the actual data - which we call the residual.

We have our Data (the points in our scatterplot), our Fit (the line we are trying to draw to best summarise them) and our Residual (the distance between the actual points and where our summary line passes). A good fit both minimizes the residuals, and leaves them without a pattern. If we have a pattern in the residuals it suggests that we could probably have drawn a better fitting line, because our actual data diverges from it in some kind of systematic way.

In practice our formula is almost never perfect. The 'fit' or 'model' will almost never be exactly the same as the actual data. There will always be residuals or errors. Thus we add a term to express this in our formula. This is the Greek letter epsilon ε which stands for 'error':

Y = a + bX + ε

To illustrate the calculations involved, let us take a simple example creating a scatterplot and calculating a regression equation.

The following table shows the value for two variables for an imaginary group of six countries, one showing GDP (or wealth produced) per head and the other showing the number of cars per 1000 population.

GDP $ per head Cars per 1000 pop. a 1600 139b 5000 200c 10000 248d 24000 458e 33000 516f 19000 320

It looks as if there is some kind of relationship between these two variables. Countries with higher GDP also seem to have more cars.

First we can examine the distribution of the 'cars' variable by calculating the mean, variance and standard deviation as we did in the previous module.

Mean value = 313.5

Variance = sum of squared deviations / (number of cases -1)

= (30450.25 + 12882.25 + 4290.25 + 20880.25 + 41006.25 + 42.25 ) / 5

= 109551.5 / 5

= 21910.3

standard deviation = square root of the variance = 148.02 cars

Thus across these six countries, there is just under one car for every three people, while the variation across countries is substantial but not enormous, with the standard deviation equal to about half the value of the mean.

The following scatterplot shows the relationship between the two variables, and plots a regression line for them.

Note that the relationship is approximately linear for the observations we have, but that this must clearly not be the case for all possible values of the variables. The value of our intercept (where the value for our X variable, GDP per head, is zero) is about 126 cars per 1000 people (the value of our Y variable)! Note too that only one of our cases lies on the regression line itself. This is quite a normal situation, since the regression line is our best attempt at summarising the relationship for all of our

observations taken together. The scatterplot also shows the vertical distances, or residuals, between each coordinate and the regression line. Let us continue to model the relationship for countries with GDP's per head within the range of our observations.

To calculate the slope of the regression line we could fit various lines by trial and error each time trying to minimise the sum of the squared vertical distances between the coordinates and the line. Even for a simple line such as ours this would be a tedious process. However there is a formula that we use to calculate the line's slope, or the value for 'b' in our regression equation. For each coordinate we can multiply its deviation from the mean value of the X variable by its deviation from the mean value of the Y variable. We sum these. This sum will be greater to the extent that values above the mean on one variable are also above the mean on the other variable, and vice versa. We then divide this by the sum of the squared deviations from the mean of the X variable, which takes account of how much variation there is in our X variable. The result will give us an average measure of how much the value of Y changes for a change in the value of X. The formula for this equation is given on p. 178 of Fielding & Gilbert, but there is no need to memorise it, as in practice we can always use SPSS to calculate this value. For our example here the result of this calculation gives us a value for 'b' in our equation of 12.1.

Once we have a value for b we can calculate the value for the 'a' the intercept. Since Y = a + b(X) it follows that a = Y - b(X). We can put the mean values of X and Y into ths equation (the OLS regression line will always pass through the mean value of each variable).

This gives us

a = 313.5 - (12.1* 15.43)

= 313.5 - 186.9

= 126.6

Now we have the values for our regression equation we can use it to make a very concise summary statement about the relationship between our two variables. A one unit (thousand dollar) increase in GDP per head is associated with an increase in 12.1 cars per 1000 population. Where GDP is zero, there will be 127 cars for every thousand people. We only have observations of GDP for the range 1600 to 33000 dollars, and we can be fairly sure that relationship is not linear for low values of GDP. Thus the intercept (127 cars) is more useful for estimating car ownership at higher level of GDP than for predicting car ownership in a very poor country.

We can also calculate how much of the variation in car ownership can be accounted for by variation in GDP. We do this by comparing how much of the overall variation in car ownership is accounted for by the variation in GDP as expressed in our regression equation.

The following table shows the number of cars per 1000 people estimated for each case by our regression line (the 'fit' of our regression line), the vertical distance between this fit and the actual value (the observed number of cars per 1000 people), which is the residual, and the square of this residual.

X (GDP 1000$ per head )

YCars per 1000 pop.

Estimated (fitted) Cars per 1000 pop.

Residual(Y-fitted value of Y)

Squared residual

a 1.6 139 146 -7 49b 5.0 200 187 13 169c 10.0 248 248 0 0d 24.0 458 417 41 1681e 33.0 516 526 -10 100f 19.0 320 357 -37 1369

The sum of the squared residuals is (49 + 169 + 0 + 1681 + 100 +1369) = 3,368. To calculate their variance we can divide this by the number of cases minus one = 3368/5 = 673.6. We can think of this variance as the amount that the values for car ownership vary that our regression line does not take account of, or does not explain. If we had a straight line that fitted every point perfectly, there would be no residuals, and the value for this unexplained variance would be zero, since our independent variable, GDP per head, would explain all of the variation in amount of car ownership. If, on the other hand, our coordinates remained pretty far from even our best fitting line, the value for this variance would be large.

We now have two variances that we can compare. The original variance of the car ownership variable and the variance unexplained by variation in GDP per head, that is represented by the variance of the residuals. The difference between these two variances is, logically, the amount of variance in the variable for car ownership explained by GDP per head. Thus:

Total variance = variance explained by the regression + unexplained variance

21910 = explained variance + 673.6

explained variance = (21910 - 673.6) = 21236.4

We can express this explained variance as a proportion of the original total variance. This is called the coefficient of determination, or r squared.

In our example it is 21236.4 / 21910 = 0.97

This means that the variation in GDP per head accounts for 97% of the variation in cars per 1000 population. Our value for r squared is very high because in our example here our regression line was a very good fit indeed: all our points were fairly close to the line.

View animation of how to run a regression in SPSS

The regression output in SPSS is not very user friendly, but it does not take too long to learn how to identify the key parts of the output that we are interested in. The figure below shows the SPSS output for the example we have just worked through. SPSS treats the calculation of a regression equation as the building of a model. Our model is a simple one in which the dependent variable is 'cars' and the independent variable is 'GDP'. The first box of SPSS output simply records this fact. The second box gives the value for r squared while the third box 'ANOVA', or analysis of variance, reports the sum of squared deviations for the dependent variable (Total) the regression model (regression) and the residuals (residual). If we wished the values for the variance rather than the sum of squares we could divide these numbers by five (n-1). You can check for yourself that dividing the value for the regression by that for the total will give you the value for r squared. In the final part of the output labelled 'coefficients', under the columns of the table headed 'Unstandardised Coefficients' SPSS calculates the value of a (the intercept) and b (the slope) in our regression equation. 'a' or the intercept is referred to by SPSS as the constant. The value of 'b' appears just underneath the value of the constant, against the name of the independent or X variable. We will discover the interpretation of the standardised coefficients in the columns to the right in a later module that considers multiple regression.

http://www.sps.ed.ac.uk/elearninggallery/quantda/captivate_files/3_5_demo/3_5_demo.htm

Regression is a tool that we can use in a number of different ways. First, as we have just seen, it allows us to summarise the nature of the relationship between two variables with just two numbers. Second, once we have summarised the relationship we can investigate how good our summary is. If the residuals in our model are small, we have evidence of a strong relationship: knowing the value of the X variable give us a fairly good idea of the likely value of the Y variable. If the residuals are large, we have evidence of a weak relationship, or possibly no relationship at all. Third, under certain circumstances, we can use our model to make predictions about the value of the Y variable on the basis of the value of the X variable, when we do not know what the value of the Y variable actually is. Next we will look more closely at the second point, and how we can use correlation to summarise the strength of assocation between two variables.

References

Marsh, C. & Elliott (2008) Exploring Data, Cambridge: Polity.

Correlation and Association

As well as the form of the relationship between our variables, it would be good to have some summary measure of how close is the fit between our two variables: of how closely the points cluster around our line, or how strong the relationship between them is. As we saw in the previous section, this is a matter of explaining how much the variation in one variable is associated with the variation in another variable.

Think again of our two limiting cases. The closest possible fit would occur when there were no residuals and all data coordinates lay exactly along the regression line. This would be a situation where knowing the value of one variable for any case allows us to predict perfectly the value of the other variable for that case.

The absence of any fit would be a situation where we could not identify any systematic relationship between X and Y at all: we could not fit a line because we could not see any pattern in the data points, because no consistent relationship between the values taken by X and the values taken by Y exists.

It would be good to have a summary measure that allowed us to describe the closeness of fit in our data, or the strength of the relationship between two variables. Such a measure is called the correlation coefficient, represented by the letter 'r'.

If we could calculate this measure in such a way that for any pair of variables it always took a value between 0 and 1, so that 1 represented a perfect relationship and 0 the absence of any relationship at all, we would have a useful way of comparing the strengths of relationships that might exist between different pairs of variables. We could also represent the direction of the relationship by allowing this coefficient to take a negative value when as the value of one variable increases, the value of the other variable decreases.

Such a coefficient exists for interval level variables. Let us see how to calculate it.

Glossary Reminder:

Residual

This is the unexplained part of a statistical model. It is also known as the error.

The Pearson product moment correlation coefficient

When we examined how to calculate the standard deviation of an interval level variable in module 2, the first step we took was to calculate the variance: the sum of the squared residuals or of how much each case differed from the mean value of the variable. We needed to square the residuals as the total of the residuals themselves would always be equal to zero.

We can use a similar calculation for two variables to calculate their covariance. For each case the difference from the mean (or residual) on the first variable is multiplied by the difference from the mean on the second variable.

It is not too difficult to see why we might want to do this. If we have two variables where cases that tend to have values above the mean on one variable also tend to have values above the mean on the other variable, and cases with values below the mean on one variable also tend to be below the mean on the other, then when we multiply together the residuals for each case the results will always be positive, so that our final sum will be high and positive.

Alternatively if cases below the mean on one variable are above the mean on the other, and vice versa, our final sum will be high and negative. If values far from the mean on one variable are also far from the mean on the other variable, then the covariance will be large.

On the other hand if there is no clear relationship between the values of the two variables, we will have a mixture of small and large differences of different signs (positive and negative) that will give us a small covariance.

We can then standardise this calculation by dividing it by the standard deviations of the variables to take account of the different units that they are measured in and the different total amount of their variation. If we do this we get a very useful result because we will have produced a measure of the relationship between our two variables with exactly those properties that we set out as desirable on the previous page for a correlation coefficient.

The result of the calculation can never exceed one, nor fall below minus one, and will describe how closely related our two variables are. If the variables were perfectly related, such that their values moved 'in step' the result of our calculation would be exactly one. At the other end of the spectrum, if the variables are not related at all, so that there is no pattern in the relationship, the amount of covariance will be zero. Even better, because we have allowed for different kinds of units of measurement and amounts of spread in different variables by using their standard deviations in the calculation, we have a value that we can use to compare the strength of the relationship between any pair of variables. The calculation we have described is that for the Pearson product moment correlation coefficient. However we will usually just refer to this term by its letter 'r'.

The formula for the Pearson product moment correlation coefficient is given on p. 173 of Fielding and Gilbert (2006). You may also find their discussion of the coefficient on the preceeding pages a useful alternative explanation to the one given here.You can also refer to the description of the calculation given in the powerpoint presentation for Lecture 2

But how strong is strong and how weak is weak in a relationship between two variables?

The coefficient of determination

Often we are interested in how much of the variablility of one variable is correlated with and possibly (but by no means necessarily) explained by the variablility of another variable. This is called the coefficient of determination, and being obtained

simply by squaring r, is often referred to as R squared or r². This is a statistic we will return to later when we look at relationships between more than one variable.

Reading

Fielding, J. and Gilbert, N. (2006) 'Understanding Social Statistics' Sage, London. (pg. 173)

Strength of association - How strong is strong?

The answer to the obvious question 'how strong is strong?' is rather complex and we are not able to consider all its complexities in this introductory course. The first, and most important answer to this question is that the relationship between a pair of variables with a higher correlation coefficient is stronger than that between a pair of variables with a lower coefficient and it is usually such comparisons that we are interested in making: e.g. there is a stronger relationship between age and income in this groups of cases than in another group of cases or e.g. in this group of cases the relationship between age and length of residence is stronger than that between age and percentage of monthly income devoted to housing costs.

A second, more general type of answer can be given using Rowntree's classification guidance, which gives us a very general, rule of thumb description of coefficient strengths.

Numerical coefficient size (positive or negative)

Verbal description

0.0 - 0.2 Very weak, negligible

0.2 - 0.4 Weak

0.4 - 0.7 Moderate

0.7 - 0.9 Strong

0.9 - 1.0 Very strong

Perfect relationships are as rare in social science as in affairs of the heart. Often we may be interseted in relationships that are weak or moderate, rather than confining our interest to those with very high correlations.

Using SPSS to calculate correlation coefficients for interval level variables is straightforward, as this animation shows.

Transforming variables



There is one important qualification to the usefulness of the correlation and regression measures that we have just looked at. They will only work properly if the relationship between our variables is linear that is, it is expressed well graphically by a straight line. That is to say that the nature of the relationship does not change as the values of the variables change. If, for example, a curved rather than straight line best summarised their relationship in a scatterplot then the change in Y for a unit change in X would not be the same for all values of X. For example it might be the case that for low values of X, changes in the value of X are associated with small changes in Y, but that for higher values of X the associated change in the value of Y gets larger and larger.

Actually this situation is quite common in the real world. The scatterplot below shows the relationship between GDP per capita in US dollars by life expectancy at birth in years taking the countries of the world around the year 2000 as our cases. We can see that for low values of life expectancy, quite large changes in life expectancy are associated with very small changes in GDP. However for higher values, small changes in life expectancy are associated with very large changes in GDP. If we just calculate a Pearson correlation coefficient for these two variables we get only a moderate result: about 0.5. [Rocket scientists among you, and hopefully others too, may well have spotted and questioned the placing of life expectancy on the X axis: shouldn´t it be the dependent variable? The answer is that it can be treated as either (some demographers claim that the reproductive efficiency represented by long lifespans feed through into economic gains).]

GDP per capita in US dollars by life expectancy at birth in years

Source: WDI 2003

We could deal with this situation in two ways. One would be to produce a non-linear equation for the relationship between the two variables, analogous to our linear Y = a + bX + ε formula, and search for non-linear equivalents for all our other measures. This can be a fruitful, but very complex route and would take us well beyond the

techniques examined in this course. However a simpler, and usually just as effective solution is to transform our variables so that the relationship between the transformed variables becomes linear. Transformation is a procedure similar to standardisation, except that the mathematical procedure applied changes different values of the variable by different amounts. If we multiply the value of a variable by itself (that is, square it or raise it to the power 2) we will obviously change larger values more than smaller ones. Indeed values less than one will get smaller, while those greater than one will get larger. To obtain the opposite effect we can take the square root of a variable (the reverse procedure of squaring it: the square root of a number is the number, multiplied by itself that would produce the original number). Another very useful transformation is to take the logarithm of a number. If you are not sure what a logarithm is, consult the explanation here.

Let us see what happens when we log GDP, and repeat our scatterplot.

Our line has become very much straighter. We can obtain a similar effect by taking the square root of GDP. The Pearson product moment correlation coefficient for our relationship is now 0.76: indicating a strong relationship. This is the good news. The bad news is that having transformed our variable our relationship has become more complex as we have correlated life expectancy in years not with GDP itself but with the log of GDP. To calculate estimates of what value GDP might take for any value of life expectancy we therefore have to untransform our variable again so as to express it in its original units. If I say that the natural log of GDP equals 1.222 + 0.098 *(life expectancy in years) this is much less easily understandable than our original equation. I can express it in dollars rather than logged dollars by raising each side of the equation by the exponent of the log. This is still a complex calculation, but in the days of computers and calculators, hardly impossible.

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/10_logs.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/10_logs.htm

We transform variables in SPSS using the compute command in the transform menu.

View animation of how to transform variables using the 'Compute' command

Remember to label any new variables that you create and to save the new version of your dataset

Rocket Scientist reference:

Spearman's rank correlation coefficient

This is used when the data are at the ordinal level of measurement, or when we are not sure of the reliability of the intervals in stronger data, or where the data is highly skewed, which would adversely affect the Pearson's coefficient, and we are not able to make a linear transformation of the data. Note for rocket scientists: while this seems to be a better alternative to Pearson's correlation coefficient for these reasons, being a nonparametric statistic it has less power.

A Golden Rule: correlation is not causation

Just because we find correlation between two variables, we do not necessarily have any evidence about causation, let alone the direction of cause and effect (i.e. whether variable A causes variable B or the other way around). If we do have a cause-effect relationship between two variables then we must also have correlation. But the reverse is not the case. Moreover, any theory we might have about causation is just that: a theory. Our data can never give us information directly about this. Consider our example of the relationship between age and height of boys in Toronto. The data is equally compatible with the theory that changes in height cause boys to get older as it is with the theory that ageing causes growth. Our preference for the first theory cannot come from anything the data in our scatterplot tells us: rather it comes from other sources,



such as our ideas about the nature of time, physiological development of the human body, etc. Marshand Elliott(2008, p. 244) give a good example of the kind of logic we need to be careful to avoid.

Supposing we collected data on the number of fire engines sent to a fire and the amount of economic losses sustained as a result of the fire. We would almost certainly find the two variables to be closely correlated. But were we to assume that such a correlation also implied causation, we would reach the conclusion that a good way to reduce the cost of fire damage would be to stop sending fire engines to deal with fires! This nonsensical conclusion follows from our failure to take account of another variable that has more to do with the probable causal path: the size of fire in the first place!

Reference

Marsh, C. (1988) Exploring Data, Cambridge: Polity.

Module 3 conclusion In this module we have looked at two essential ways of looking at the relationship between two variables: regression and correlation. Although they are related it is important to keep them distinct and get a clear idea of the difference between the two. Regression is a way of summarising the form of the relationship, which works by fitting a line to the points in a scatterplot minimising the sum of the squared vertical distances between the actual values of Y and the line. It can help us to summarise the form of a relationship between two variables: does the value of one variable tend to increase or decrease as the value of the other increases, for example. Our regression equation, which our fitted line represents graphically, is a simple model of the relationship.

Correlation, on the other hand, tells us about the strength of the relationship, how closely our line fits the data we have and therefore how much of the variation of each variable is accounted for by the variation in the other. If our datapoints cluster close to our line the we have a strong relationship: knowing the value of one variables tells us a lot about the probable value of the other variable. Alternatively, if our variable are only weakly related, knowing the value of one variable may be a poor guide to the value of the other.

Finally both these methods work only if we can sensibly use a straight line to approximate our data. However in situations where this is not the case, we can often use transformations to save the day. Transforming data so that it approximates a linear distribution is much easier than trying to model different shapes of line directly.

After working through this module you should have learned:

What a scatterplot comprises: the X axis, Y axis and coordinates What is meant by a linear relationship What a regression line is, and the equation that describes it: Y = a + bX + ε

What it means to fit a regression line that minimises the sum of the squared residuals (OLS regression)

What is meant by the intercept (a) and the slope (b) of a regression line How to calculate a regression equation, or use SPSS to do this for you

How to calculate the explained and unexplained variance in a regression model to examine its 'fit'

How a regression equation can only be used to describe a relationship for the range of values for which we have observations.

What the cooefficient of determination, r squared, means What is meant by the strength of a relationship between two variables. How the strength of a relationship between two variables can be summarised

by a correlation coefficient. How to calculate the covariance and the Pearson product moment correlation

coefficient, r, for a pair of variables, and how to use SPSS to do this. What is meant by transforming variables, and why we should want to do this. How taking the square or the logarithm of a variable can change its

relationship to another variable so that it more closely resembles a linear relationship

That without correlation there is no causation, but that the reverse is not the case. The existence of correlation or association is never in itself proof of causation.

End of module activities:

Self-test

Tutorial activities

Further Reading: Fielding and Gilbert Ch 8. Marsh and Elliott Ch. 9 and Ch. 10.

Reading from the Course Library: Field 2005.

http://www.sps.ed.ac.uk/elearninggallery/quantda/library.htm

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/tutorial.html

http://www.sps.ed.ac.uk/elearninggallery/quantda/module_3/selftest.html

Part 1 Module 3 Single File

Documents

Transcript of Part 1 Module 3 Single File