Lies, damned lies & statistics Communication Research week 10.

43
Lies, damned lies & statistics Communication Research week 10

Transcript of Lies, damned lies & statistics Communication Research week 10.

Lies, damned lies & statistics

Communication Research

week 10

Communication Research Spring 2005 2

Basics of descriptive statistics

Statisticians use mathematical methods to analyse, summarise and interpret data that have been collected

Descriptive statistics describe the basic features of the study and allows the researcher to get a feel for the data

The choice of statistical method of analysis depends on the data that have to be analysed

Communication Research Spring 2005 3

Descriptive vs inferential statistics

Descriptive statistics refer to methods used to obtain, from raw data, information that characterises or summarises the whole set of data

Inferential statistics allow us to generalise from the data collected to the general population they were taken from

Communication Research Spring 2005 4

Descriptive Statistics

Qualitative Quantitative

FrequencyRelative frequency

Percentage

Measures of Central TendencyMeasures of spreadFive number system

TablesPie ChartsBar Graphs

Tables HistogramsBox plotsBar chartsLine charts

Communication Research Spring 2005 5

Different statistical measures

Raw data is unorganised but can be tabulated to make it easier to understand and to interpret

It is usually presented as a frequency table or graph

A frequency chart will allow a researcher to see trends or groupings of data and how they are distributed

Communication Research Spring 2005 6

Data: The raw material of statistics. Numbers that result from measurements or counting.

Statistics: The field of study concerned with the collection, organization, summarization and analysis of data and the drawing of inferences about a body of data when only a part of the data is observed.

Sources of Data

Routinely kept records

Surveys

Experiments

External Sources

Some Basic Concepts Related to Statistics

Communication Research Spring 2005 7

Random Variable: A variable whose values arise as a result of chance factors and cannot be exactly predicted in advance.

Population: A population of entities is defined as the largest collection of entities for which we have an interest at a particular time.

Sample: A part of a population.

Some Basic Concepts Related to Statistics

Communication Research Spring 2005 8

Statistical Inference

The procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population.

Simple random sample

If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected, the sample is called a simple random sample. 2 out of 4=2C4=6

The Simple Random Sample

Communication Research Spring 2005 9

Characteristics of each distribution

Location – where on the axis is the distribution positioned?

Dispersion – how broad is the distribution? Shape – what is the form (appearance, pattern) of

the distribution? The type of data you have to analyse will determine

the statistical measure chosen Statistics describing the location of the distribution

are called measures of central tendency

Communication Research Spring 2005 10

Measures of central tendency – the mean The mean is the sum of all observed data

values divided by the sample size (the arithmetic average)

Describing data that are interval or ratio in nature (eg speed of response, age in years) calls for the mean

One of the main disadvantages is that it is most profoundly affected by extreme scores

Communication Research Spring 2005 11

Calculating a Mean ScoreScores:79818286868891939597

total = 878Divide by n = 10 scoresMean = 87.8

Communication Research Spring 2005 12

Measures of central tendency – the median

The median is the score or the point of distribution above which one half of the scores lie eg in a simple set of scores such as 1, 3, & 5 the median is 3

The median is best suited to data that are ordinal or ranked ( eg birth order, rank in class)

To compute the median Order the scores from lowest to highest Count the number of scores Select the middle score

When the number of scores is even, find the mean of the two middle scores

eg 31 33 35 38 40 41 42 43 44 46 47 48 49 50 N = 14 (no of scores); Median = (42 + 43) ÷ 2 = 42.5

Communication Research Spring 2005 13

Two distributions of scoresDistribution 1 Distribution 2

24 24 25 25 26 26

Mean = 25 Range = 3

16 19 22 25 28 30 35

Mean = 25 Range = 20

Communication Research Spring 2005 14

Measures of central tendency – the mode

The mode is the most frequently observed value in the frequency distribution ie it is the score that occurs most frequently

The mode is best used for nominal data and for data that are qualitative in nature such as gender, eye colour, ethnicity, school or group membership

In the following list of numbers: 58 27 24 41 27 26 41 53 24 29 41 53 47 28 56 The mode is 41 because it occurs 3 times

A common mistake is to identify the mode as how frequently the value occurs (3) not the value itself (41)

Communication Research Spring 2005 15

Which measure when?

Which measure of central tendency?

Measure Level of measurement Examples

ModeNominal or categorical – ie

qualitativeGender, hair or eye colour, group membership, ethnicity, school etc

Median Ordinal or ranked Rank in class, birth order

Mean Interval and ratio Speed of response, age in years

Communication Research Spring 2005 16

Three Measures of Variability

Range: the difference between the highest and lowest scores in a distribution of scores.

Variance: a measure of dispersion indicating the degree to which scores cluster around the mean score.

Standard deviation: index of the amount of variation in a distribution of scores.

Communication Research Spring 2005 17

Standard deviation

SD is a measure of the variability indicating the degree to which all observed values deviate from the mean

SD can only be used for interval and ratio data It is the most frequently used statistic as a

measure of dispersion or variability The larger the SD, the more variable the set of

scores is

Communication Research Spring 2005 18

COMPUTING DEVIATION SCORES

Raw Mean DEV. SQUAREDscore score deviation score 4 - 10 = -6 36 8 - 10 = -2 4 9 - 10 = -1 110 - 10 = 0 010 - 10 = 0 010 - 10 = 0 012 - 10 = 2 413 - 10 = 3 914 - 10 = 4 1690/9 = 10.00 = MEAN

70/9 = 7.77 = VarianceSTANDARD DEVIATION: (Square Root of Variance) = 2.79

Communication Research Spring 2005 19

Types of Variables

Variable Element that is identified in the hypothesis

or research question Property or characteristic of people or

things that varies in quality or magnitude Must have two or more levels Must be identified as independent or

dependent

Communication Research Spring 2005 20

Independent Variables

Manipulation or variation of this variable is the cause of change in other variables

Technically, independent variable is the term reserved for experimental studies Also called antecedent variable,

experimental variable, treatment variable, causal variable, predictor variable

Communication Research Spring 2005 21

Dependent Variables

The variable of primary interest Research question/hypothesis describes,

explains, or predicts changes in it The variable that is influenced or changed by the

independent variable In non-experimental research, also called criterion

variable, outcome variable

Communication Research Spring 2005 22

Relationship Between Independent and Dependent Variables

Cannot specify independent variables without specifying dependent variables

Number of independent and dependent variables depends on the nature and complexity of the study

The number and type of variables dictates which statistical test will be used

Communication Research Spring 2005 23

Issues of Reliability and Validity

Reliability = consistency in procedures and in reactions of participants

Validity = truth - Does it measure what it intended to measure?

When reliability and validity are achieved, data are free from systematic errors

Communication Research Spring 2005 24

Threats to Reliability and Validity If measuring device cannot make fine

distinctions If measuring device cannot capture

people/things that differ When attempting to measure something

irrelevant or unknown to respondent Can measuring device really capture the

phenomenon?

Communication Research Spring 2005 25

Other Sources of Variation Variation must represent true differences Other sources of variation

Factors not measured Personal factors Differences in situational factors Differences in research administration Number of items measured Unclear measuring device Mechanical or procedural issues Statistical processing of data

Communication Research Spring 2005 26

Types of variables

Data

Variables

Quantitative(numeric)

Qualitative(categorical)

Discrete Continuous Nominal Ordinal

Communication Research Spring 2005 27

Definitions

Variable: a characteristic that changes or varies over time and/or different subjects under consideration.

Changing over time Blood pressure, height, weight

Changing across a population gender, race/ethnicity

Communication Research Spring 2005 28

Definitions (con’t) Quantitative variables (numeric): measure a

numerical quantity of amount on each experimental unit

Qualitative variables (categorical): measure a non numeric quality or characteristic on each experimental unity by classifying each subject into a category

Communication Research Spring 2005 29

Categorical variables

Nominal: unordered categories Race/ethnicity Gender

Ordinal: ordered categories likert scales( disagree, neutral, agree ) Income categories

Communication Research Spring 2005 30

Univariate statistics (numerical variables)

Summary measures Measures of location Measures of spread

Overall pattern (distribution) Unimodal (one major peak) vs. bimodal) (2 peaks) Symmetric vs. skewed Outliers-an individual value that falls outside the

overall pattern

Communication Research Spring 2005 31

Skewness

The skewness of a distribution is measured by comparing the relative positions of the mean, median and mode. Distribution is symmetrical

Mean = Median = Mode

Distribution skewed right Median lies between mode and mean,

and mode is less than mean

Distribution skewed left Median lies between mode and mean,

and mode is greater than mean

Communication Research Spring 2005 32

Relative positions of the mean and median for (a) right-skewed, (b) symmetric, and(c) left-skewed distributions

Note: The mean assumes that the data is normally distributed. If this is not the case it is better to report the median as the measure of location.

Communication Research Spring 2005 33

Summary statisticsMeasures of spread (scale)

Variance: The average of the squared deviations of each sample value from the sample mean, except that instead of dividing the sum of the squared deviations by the sample size N, the sum is divided by N-1.

Standard deviation: The square root of the sample variance

Range: the difference between the maximum and minimum values in the sample.

( )∑=

−−

=n

ii xx

ns

1

2

1

1

( )∑=

−−

=n

ii xx

ns

1

22

1

1

Communication Research Spring 2005 34

Normal curvessame mean but different standard deviation

Communication Research Spring 2005 35

Graphical display of numerical variables (histogram)

Class IntervalFrequency

20-under 30 6

30-under 40 18

40-under 50 11

50-under 60 11

60-under 70 3

70-under 80 10

10

20

0 10 20 30 40 50 60 70 80

Years

Frequency

Communication Research Spring 2005 36

86

76

23

77

81

79

68

77

92

59

68

75

83

49

91

47

72

82

74

70

56

60

88

75

97

39

78

94

55

67

83

89

67

91

81

Raw Data Stem

2

3

4

5

6

7

8

9

Leaf

3

9

7 9

5 6 9

0 7 7 8 8

0 2 4 5 5 6 7 7 8 9

1 1 2 3 3 6 8 9

1 1 2 4 7

Graphical display of numerical variables (stem and leaf plot)

Communication Research Spring 2005 37

NegativelySkewed

PositivelySkewed

Symmetric(Not Skewed)

S < 0 S = 0 S > 0

Graphical display of numerical variables (box plot)

Communication Research Spring 2005 38

Summary measures Count=frequency Percent=frequency/total sample

The distribution of a categorical variable lists the categories and gives either a count or a percent of individuals who fall in each category

Univariate statistics(categorical variables)

Communication Research Spring 2005 39

Displaying categorical variables

Rank Cause of Death

Frequency (%)

1 Heart Disease

710,760 (43%)

2 Cancer 553,091 (33%)

3 Stroke 167,661 (11%)

4 CLRD 122,009

( 7%)

5 Accidents 97,900

( 6%)

Total All five causes

1,651,4210

20

40

60

heart cancerstrokeCLRDaccident

heart cancer stroke CLRD accident

Communication Research Spring 2005 40

Common Applications T-Tests – the independent t-test is used to test for a

difference between two independent groups (like males and females) on the means of a continuous variablecontinuous variable.

one sample – compare a group to a known value For example, comparing the IQ of convicted felons to the known average of

100)

paired samples – compare one group at two points in time For example, comparing pretest and posttest scores

independent samples – compare two groups to each other

Communication Research Spring 2005 41

Common Applications

The Pearson's correlation is used to find a correlation between at least two continuous variables. The value for a Pearson's can fall between 0.00 (no correlation) and 1.00 (perfect correlation).

Other factors such as group size will determine if the correlation is significant. Generally, correlations above 0.80 are considered pretty high

Communication Research Spring 2005 42

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Days AbsentNon-significant t-test

60

50

40

30

20

10

0

Common Applications

Num

ber

of

people

Male Female

Communication Research Spring 2005 43

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Days AbsentSignificant t-test

60

50

40

30

20

10

0

Common Applications

Num

ber

of

people

Male Female