Basic Quantitative Methods in the Social Sciences (AKA Intro Stats) 02-250-01 Lecture 2.

Basic Quantitative Methods in the Social

Sciences

(AKA Intro Stats)02-250-0102-250-01

Lecture 2Lecture 2

Sign Up for Participant Pool!!

• see Psychology research first hand!see Psychology research first hand!• earn up to earn up to 2 bonus points2 bonus points• HOW????HOW????• sign up on the web (takes less than 5 minutes):sign up on the web (takes less than 5 minutes):

• www.uwindsor.ca/psychology/www.uwindsor.ca/psychology/signupsignup

• or access through psych homepageor access through psych homepage• You You MUSTMUST sign up by May 19 to be included sign up by May 19 to be included

Major Points Today

• Types of MeasurementTypes of Measurement• Summation NotationSummation Notation• Organizing DataOrganizing Data• Stem and Leaf DisplaysStem and Leaf Displays• GraphsGraphs• Measures of Central TendencyMeasures of Central Tendency

Types of Measurement

• There are 4 types of measurement most often used in statistics:NominalOrdinalIntervalRatio

Nominal Measurement

• Nominal MeasurementNominal Measurement:: the classification of measurements into a set of categories

• The numbers produced by nominal measurement are frequencies of frequencies of occurrenceoccurrence in the categories (e.g., 22 ducks, 12 chickens, 2 geese, etc)

Nominal Measurement cont.

• A second example is gender – 2 categories, male and female

• Nominal measurement applies to qualitative variables - elements are assigned to a category because they possess one characteristic or another

• Nominal data is also termed qualitative data

Ordinal Measurement

• Ordinal MeasurementOrdinal Measurement:: the rank ordering of elements on a continuum

• Ordinal measurement does not measure the amount of the variable - it represents the individual’s placement in a continuum (or rankingranking; e.g., the winner of a race is in “first place”)

Ordinal Measurement cont.

• It is important to note that the amount of variable difference between rank position is not constant - the difference in amount of talent between the 1st and 2nd place finishers in a race cannot be assumed to be the same as the difference in amount of talent between the 5th and 6th place finishers

• Ordinal data can tell you that the person in 1st place finished before the person in 3rd place, but not by how much

Interval Measurement

• Interval MeasurementInterval Measurement:: the assignment of numerical quantity to the variable in a way that:the number assigned reflects the

amount of the variable the size of the measurement unit

remains constant and the zero point is defined arbitrarily

and does not represent an absence of the property being measured

Interval Measurement cont.

• The best example is temperature 40°C represents how hot something is (the

amount of heat it has)The unit of measurement (1°C) represents

the same amount of heat regardless of where it occurs in the range of measurement (the amount of change in temperature is the same between 25°C - 26°C and 32°C - 33°C)

The zero point (0°C) is arbitrary - it represents the point at which water freezes, not the absence of temperature

Interval Measurement cont.

• Interval measurement can contain negativenegative numbers, whereas Nominal and Ordinal Measurement do not

Ratio Measurement

• Ratio MeasurementRatio Measurement:: The assignment of numerical quantity to the variable in such a way that:the number assigned reflects the

amount of the variable the size of the measurement unit

remains constantand the zero point represents an

absence of the property being measured

Ratio Measurement

• Good examples are time and length

• A ratio scale cannot produce negative numbers

• Interval and ratio measurement are equivalent for statistical purposes and are often referred to as one thing (interval/ratio datainterval/ratio data)

Summation Notation

• We commonly use the letters “X” and “Y” to represent the variables we have measured

• Upper case Greek letter sigma () is known as the summation operator; it means “the sum of”

Example

• Suppose we keep a record for 6 days of every time someone slips in the CAW Student Centre Cafeteria (represented by X), the data may look like this:

Data Example

Day XMon 10Tues 5Weds 12Thurs 11Fri 21Sat 28

X

X means the sum of all the X scores, so that: X = X1 + X2 + X3 + ... XN

= 10 + 5 + 12 + 11 + 21 + 28

= 87

• Note: X1 means the first X score – XN means the last X score

(X)2

• (X)2 means the square of the sum (total all numbers within parentheses and then square), so that: (X)2 = (X1 + X2 + X3 + ... XN)2

= (10 + 5 + 12 + 11 + 21 + 28)2

= (87)(87) = 7569

X 2

X 2 means the sum of the squares (square each number and then sum), so that:X 2 = X1 2 + X2 2 + X3 2 + ... XN 2

= 10 2 + 5 2 + 12 2 + 11 2 + 21 2 + 28 2

= 100 + 25 + 144 + 121 + 441 + 784

= 1615

More Summation Notation

• Suppose you also keep track of the number of pieces of garbage dropped on the floor of the CAW Student Centre for the same days as above (variable Y) and the data were as follows:

Example Data

Day X Y Mon 10 210Tues 5 160Weds 12 245Thurs 11 240Fri 21 340Sat 28 415

XY

XY means the sum of the products:XY = (X1)(Y1) + (X2)(Y2) + (X3)(Y3)

+ ... (XN)(YN) = (10)(210) + (5)(160)

+ (12)(245) + (11)(240) + (21)(340) + (28)(415)

= 2100 + 800 + 2940 + 2640 + 7140 + 11620

= 27240

Organizing Data

• Frequency DistributionsFrequency Distributions:: A frequency distribution is a table which shows the number of individuals or events that occurred at each measurement value

• this is the most common form of organizing data

Frequency Distributions

• The following hypothetical frequency distribution shows the number of women in different majors at the University of Windsor:

Major # of WomenArt 15Biology 35Chemistry 34Music 85Psychology 97


• This frequency distribution organizes the data into nominal categories (by major)

• Frequency distributions can also organize data by points of measurement on a continuous variable, as follows:


Age of Students in 02-250:Age Frequency 18 1419 85 20 58 21 40 22 3523 1624 1025 626 4


• Frequency distributions should not exceed 15 to 20 lines, as the point is to summarize the data in a way that represents all the information concisely

• When there are more data than can be classified in 20 lines, the data can be grouped into score ranges known as class intervals, as in this example:

Class Interval Example

Canada Population Estimates for the Year 2016 (in millions) Age Pop Age Pop 0 - 4 2.05 50 - 54 2.79 5 - 9 2.07 55 - 59 2.6910 - 14 2.12 60 - 64 2.3115 - 19 2.19 65 - 69 1.9720 - 24 2.38 70 - 74 1.4225 - 29 2.48 75 - 79 0.9930 - 34 2.54 80 - 84 0.7135 - 39 2.53 85 - 89 0.4740 - 44 2.51 90 + 0.3345 - 49 2.57

Frequency Distributions cont.

• Looking at the frequency distribution tells us:

The most frequently occurring age is expected to be in the 50-54 age range (b/c this is the largest population estimate, 2.79 million).

The age frequencies are expected to be fairly evenly distributed from 0 to 70 years old and then fall off

The expected distributions of ages is not symmetrical: very low (young) and high (old) ages do not occur with equal likelihood

Frequency Distributions cont.

• Dividing the data into class intervals makes the data more accessible

• Data which has been divided into class intervals is sometimes referred to as grouped datagrouped data

Cumulative Frequency Distributions

• Frequency distributions can be made to contain more information, as when a column of cumulative frequencies is added

• Cumulative Frequency DistributionCumulative Frequency Distribution:: A table in which the frequency of individuals or events at each measurement value is added to previous frequencies so that each line reads as the total frequency of that and lower measurement values

Cumulative Frequency Ex.

Age of Students in 02-250:Age Frequency Cumulative Frequency18 14 1419 85 9920 58 15721 40 19722 35 23223 16 24824 10 25825 6 26426 4 268

More Frequency Distributions

• Frequency distributions can also contain information about the percentages and cumulative percentages of observations at the various scores:

More Frequency Distributions

Age of Students in 02-250:

Age Frequency Cumulative % Cumulative Frequency %

18 14 14 5.22 5.2219 85 99 31.72 36.9420 58 157 21.64 58.5821 40 197 14.93 73.5122 35 232 13.06 86.5723 16 248 5.97 92.5424 10 258 3.93 96.2725 6 264 2.24 98.5126 4 268 1.49 100.00

Exact Limits

• All measurements are expressed in discrete units, such as seconds or centimeters

• No matter how small the unit of measurement, it is always possible to imagine finer measurement 1 cm = 10 mm

Exact Limits

• So, for continuous variables, any measure should be viewed as representing a range of values

• This range has a width equal to the unit of measurement used, and the boundaries of this range are the exact limits of the measure

Exact Limits

• E.g., If we say an event lasted 12 seconds, we mean it is closer to 12 seconds than to 11 or 13 seconds. A score of 12 represents a range of values. This range is one second wide (one unit of the measurement) and extends between 11.5 and 12.5 seconds

Exact Limits

• Exact limits identify the upper and lower ends of the range represented by the raw score and are the real boundaries of the measure in question

Exact Limits

• Exact LimitsExact Limits:: Values one-half unit of measurement above and above and belowbelow the score or class interval. Exact limits are the boundaries of the range of values represented by the measure

• Some authors refer to exact limits as real limitsreal limits

Exact Limits Examples

MeasureMeasure Exact Limits Exact Limits 52 51.5 - 52.5 51 50.5 - 51.5

52.2 52.15 - 52.25 52.1 52.05 - 52.15

Exact Limits Examples

MeasureMeasure Exact Limits Exact Limits50.02 50.015 - 50.02550.01 50.005 - 50.015

Class IntervalClass Interval Exact Limits Exact Limits 50 - 54 49.5 - 54.5 55 - 59 54.5 - 59.5

Stem-and-Leaf Displays

• Stem-and-Leaf DisplayStem-and-Leaf Display:: partitions each score into a “stem” and a “leaf” and groups the scores according to common stems

• The “Leaf” is the rightmost digit• The “Stem” is the digit (or digits)

to the left of the leaf (the stem is 0 for 1 digit numbers)

Stem-and-Leaf

E.g.,Stem Leaf

4 0 4 54 5 4 123 12 31234 123 4

The numbers 24 and 26 have different “leaves”(4 and

6) but the same stem (2)

Stem-and-Leaf

• Consider this raw data and their stem-and-leaf plot:

Stem-and-Leaf

stem leaf3 64 4775 058996 012257887 245598 5789 2

Data: 36, 44, 47, 47, 50, 55, 58, 59, 59, 60, 61, 62, 62, 65, 67, 68, 68, 72, 74, 75, 75, 79, 85, 87, 88, 92

Stem-and-Leaf

• Or this example:• Data: 102, 104, 115, 116, 116, 125, 127,

128, 129, 129, 131, 136, 137, 145, 145stem leaf

10 2411 56612 5789913 16714 55

Stem-and-Leaf

• Unlike frequency distributions, stem-and-leaf plots give an indication of the overall distribution of the scores (e.g., evenly spread or bunched, symmetrical or nonsymmetrical)

• Note: Make sure you include every instance of a given value, e.g., if 57 occurs 3 times in the data set, this should be represented in the stem and leaf display with a stem of 5 and three 7s in the leaf.

Graphs

• GraphGraph refers to all manner of pictorial, or graphic, representation of data

• We will consider histogramshistograms and frequency polygonsfrequency polygons

Graphs

• The horizontal axis (X axis) is labeled with units representing points of measurement and the vertical axis (Y axis) is labeled with values representing frequency of occurrence

• Histograms and frequency polygons are like 2-dimensional representations of frequency distributions

Histogram

• HistogramHistogram:: A graphic in which the horizontal axis identifies points of measurement, and the vertical axis represents frequency of occurrence

• Solid bars are used to represent the frequency at each point of measurement (a histogram is a bar graph)

Age Data Histogram Example

0

20

40

60

80

100

Frequency

18 19 20 21 22 23 24 25 26Age

Frequency Polygon

• Frequency PolygonFrequency Polygon:: A graphic in which the horizontal axis identifies points of measurement, and the vertical axis represents frequency of occurrence (a frequency polygon is a line graph)

Age Data Frequency Polygon

0

20

40

60

80

100

Frequency

18 20 22 24 26 Age

Graphs cont.

• Both histograms and frequency polygons can be embellished by the simultaneous plotting of more than one variable, as shown next

Graphs cont.

0

10

20

30

40

50

Frequency

18 19 20 21 22 23 24 25 26Age Female = red

Male = green

Graphs cont.

0

10

20

30

40

50

Fre

quency

18 19 20 21 22 23 24 25 26Age

0

10

20

30

40

50

Fre

quency

18 19 20 21 22 23 24 25 26Age Female = red

Male = green

Describing Data

• AveragesAverages:: an average is a numerical value that indicates the middle point or central region of the raw data

• Averages are sometimes referred to as measures of central central tendencytendency

Averages

• 3 statistics are commonly termed averages: ModeMedianMean

Mode

• ModeMode:: The most frequently occurring score

A distribution with a single most frequently occurring score (one hump) is termed a unimodalunimodal (single mode) distribution

A distribution with 2 values that share the quality of being most frequently occurring (2 humps) is termed bimodalbimodal (2 modes)

Mode Example

Age of Students in 02-250:Age Frequency 18 141919 8585 In this example, the Mode is 20 58 1919 as it has the highest21 40 frequency22 3523 1624 1025 626 4

A la Mode• The mode does not take into account all

of the data - only the one most frequently occurring score

• The mode is the score with the highest bar in a histogram, or the highest point in a frequency polygon

• When the data are combined into class intervals, the mode is the mid-point of the class interval that contains the most scores

Median

• MedianMedian:: The middle point of the distribution, or the score which bisects the distribution (divides it into upper and lower halves)

Median

• If there are an ODDODD number of scores, the median is the middle score:

1, 3, 6, 7, 8, 13, 15, 17, 18, 21, 23

Median = 13There are 5 scores above the median,and 5 below

Median

• If there are an EVENEVEN number of scores, the median is the midpointmidpoint between the two middle scores:

1, 3, 6, 7, 8, 13, 15, 17, 18, 23

Median = (8 + 13)/2 = 10.5

Median Notes

NOTE!NOTE! • When determining the median, you

must arrange the scores in ascending or descending order first!

Steps to Finding the Median

1. Arrange data in ascending or descending order

2. Count the number of scores (N)

3. If there are an oddodd number of scores, find the middle point (the score where there are the same number of scores above and below it) - this is the median

Steps to Finding the Median

4. 4. If there are an eveneven number of scores,

find the 2 middle scores - add them,

and divide by 2 - this is the median

More Median

• When a distribution is viewed as area, the median divides the total area in half:

MedianMedian

50% 50%50%

Median cont.

• The median is based on the value of one or two scores, and does not take into account all of the data

• When the data are grouped into class intervals, the median can be viewed as the midpoint of the class interval which contains the middle score (50th frequency). This is only a rough estimate

Arithmetic Mean

• Arithmetic MeanArithmetic Mean:: the sum of the scores divided by the number of scores (what is generally thought of as the averageaverage)

Mean

• The mean of a samplesample of X scores is symbolized as , which is said as “X bar”

• The mean of a populationpopulation of X scores is symbolized by the Greek letter mu (µµ)

• Greek letters tend to be used for parameters, while conventional letters are used for statistics

Mean

• The algebraic definition of the populationpopulation mean is as follows:

NN is used to refer to the number of scores in the data set (termed population sizesize)

N

X

Sample Mean

• The algebraic definition of the sample mean is as follows:

nn is used to refer to the number of scores in the data set (termed sample sizesize)

n

Xx

Mean cont.

• The algebraic formula for the sample and population mean is the same, (although some terms have different formulae for samples and populations)

Mean cont.

• The mean is used as the measure of average almost exclusively (rather than the mode or median) because it is defined algebraically and considers allall the raw scores in the data set

Mean cont.• In any group of scores, the sum of the

deviations from the mean equals zero:

X X- n = 6

3 3 - 5.50 = -2.50 = X/n

5 5 - 5.50 = -0.50 = 33/6

9 9 - 5.50 = +3.50 = 5.502 2 - 5.50 = -3.508 8 - 5.50 = +2.506 6 - 5.50 = +0.50

X = 33 (X- ) = 0.00

Relative Characteristics of Averages

• If the distribution is symmetrical, the mean, median, and mode have the same value

• The longer tail of a non-symmetrical distribution “pulls” the mean more than the mode and median

• Therefore: the mean is more effected by outliers (very large or very small data points) than are the mode and median

Relative Characteristics of Averages

• Relative positions of the mean and Relative positions of the mean and median:median:

• Note: The mode is the highest Note: The mode is the highest point in the distributionpoint in the distribution

Basic Quantitative Methods in the Social Sciences (AKA Intro Stats) 02-250-01 Lecture 2.

Documents

Transcript of Basic Quantitative Methods in the Social Sciences (AKA Intro Stats) 02-250-01 Lecture 2.