Basic Quantitative Methods in the Social Sciences (AKA Intro Stats) 02-250-01 Lecture 2.
-
Upload
coral-carter -
Category
Documents
-
view
218 -
download
3
Transcript of Basic Quantitative Methods in the Social Sciences (AKA Intro Stats) 02-250-01 Lecture 2.
Basic Quantitative Methods in the Social
Sciences
(AKA Intro Stats)02-250-0102-250-01
Lecture 2Lecture 2
Sign Up for Participant Pool!!
• see Psychology research first hand!see Psychology research first hand!• earn up to earn up to 2 bonus points2 bonus points• HOW????HOW????• sign up on the web (takes less than 5 minutes):sign up on the web (takes less than 5 minutes):
• www.uwindsor.ca/psychology/www.uwindsor.ca/psychology/signupsignup
• or access through psych homepageor access through psych homepage• You You MUSTMUST sign up by May 19 to be included sign up by May 19 to be included
Major Points Today
• Types of MeasurementTypes of Measurement• Summation NotationSummation Notation• Organizing DataOrganizing Data• Stem and Leaf DisplaysStem and Leaf Displays• GraphsGraphs• Measures of Central TendencyMeasures of Central Tendency
Types of Measurement
• There are 4 types of measurement most often used in statistics:NominalOrdinalIntervalRatio
Nominal Measurement
• Nominal MeasurementNominal Measurement:: the classification of measurements into a set of categories
• The numbers produced by nominal measurement are frequencies of frequencies of occurrenceoccurrence in the categories (e.g., 22 ducks, 12 chickens, 2 geese, etc)
Nominal Measurement cont.
• A second example is gender – 2 categories, male and female
• Nominal measurement applies to qualitative variables - elements are assigned to a category because they possess one characteristic or another
• Nominal data is also termed qualitative data
Ordinal Measurement
• Ordinal MeasurementOrdinal Measurement:: the rank ordering of elements on a continuum
• Ordinal measurement does not measure the amount of the variable - it represents the individual’s placement in a continuum (or rankingranking; e.g., the winner of a race is in “first place”)
Ordinal Measurement cont.
• It is important to note that the amount of variable difference between rank position is not constant - the difference in amount of talent between the 1st and 2nd place finishers in a race cannot be assumed to be the same as the difference in amount of talent between the 5th and 6th place finishers
• Ordinal data can tell you that the person in 1st place finished before the person in 3rd place, but not by how much
Interval Measurement
• Interval MeasurementInterval Measurement:: the assignment of numerical quantity to the variable in a way that:the number assigned reflects the
amount of the variable the size of the measurement unit
remains constant and the zero point is defined arbitrarily
and does not represent an absence of the property being measured
Interval Measurement cont.
• The best example is temperature 40°C represents how hot something is (the
amount of heat it has)The unit of measurement (1°C) represents
the same amount of heat regardless of where it occurs in the range of measurement (the amount of change in temperature is the same between 25°C - 26°C and 32°C - 33°C)
The zero point (0°C) is arbitrary - it represents the point at which water freezes, not the absence of temperature
Interval Measurement cont.
• Interval measurement can contain negativenegative numbers, whereas Nominal and Ordinal Measurement do not
Ratio Measurement
• Ratio MeasurementRatio Measurement:: The assignment of numerical quantity to the variable in such a way that:the number assigned reflects the
amount of the variable the size of the measurement unit
remains constantand the zero point represents an
absence of the property being measured
Ratio Measurement
• Good examples are time and length
• A ratio scale cannot produce negative numbers
• Interval and ratio measurement are equivalent for statistical purposes and are often referred to as one thing (interval/ratio datainterval/ratio data)
Summation Notation
• We commonly use the letters “X” and “Y” to represent the variables we have measured
• Upper case Greek letter sigma () is known as the summation operator; it means “the sum of”
Example
• Suppose we keep a record for 6 days of every time someone slips in the CAW Student Centre Cafeteria (represented by X), the data may look like this:
Data Example
Day XMon 10Tues 5Weds 12Thurs 11Fri 21Sat 28
X
X means the sum of all the X scores, so that: X = X1 + X2 + X3 + ... XN
= 10 + 5 + 12 + 11 + 21 + 28
= 87
• Note: X1 means the first X score – XN means the last X score
(X)2
• (X)2 means the square of the sum (total all numbers within parentheses and then square), so that: (X)2 = (X1 + X2 + X3 + ... XN)2
= (10 + 5 + 12 + 11 + 21 + 28)2
= (87)(87) = 7569
X 2
X 2 means the sum of the squares (square each number and then sum), so that:X 2 = X1 2 + X2 2 + X3 2 + ... XN 2
= 10 2 + 5 2 + 12 2 + 11 2 + 21 2 + 28 2
= 100 + 25 + 144 + 121 + 441 + 784
= 1615
More Summation Notation
• Suppose you also keep track of the number of pieces of garbage dropped on the floor of the CAW Student Centre for the same days as above (variable Y) and the data were as follows:
Example Data
Day X Y Mon 10 210Tues 5 160Weds 12 245Thurs 11 240Fri 21 340Sat 28 415
XY
XY means the sum of the products:XY = (X1)(Y1) + (X2)(Y2) + (X3)(Y3)
+ ... (XN)(YN) = (10)(210) + (5)(160)
+ (12)(245) + (11)(240) + (21)(340) + (28)(415)
= 2100 + 800 + 2940 + 2640 + 7140 + 11620
= 27240
Organizing Data
• Frequency DistributionsFrequency Distributions:: A frequency distribution is a table which shows the number of individuals or events that occurred at each measurement value
• this is the most common form of organizing data
Frequency Distributions
• The following hypothetical frequency distribution shows the number of women in different majors at the University of Windsor:
Major # of WomenArt 15Biology 35Chemistry 34Music 85Psychology 97
Frequency Distributions
• This frequency distribution organizes the data into nominal categories (by major)
• Frequency distributions can also organize data by points of measurement on a continuous variable, as follows:
Frequency Distributions
Age of Students in 02-250:Age Frequency 18 1419 85 20 58 21 40 22 3523 1624 1025 626 4
Frequency Distributions
• Frequency distributions should not exceed 15 to 20 lines, as the point is to summarize the data in a way that represents all the information concisely
• When there are more data than can be classified in 20 lines, the data can be grouped into score ranges known as class intervals, as in this example:
Class Interval Example
Canada Population Estimates for the Year 2016 (in millions) Age Pop Age Pop 0 - 4 2.05 50 - 54 2.79 5 - 9 2.07 55 - 59 2.6910 - 14 2.12 60 - 64 2.3115 - 19 2.19 65 - 69 1.9720 - 24 2.38 70 - 74 1.4225 - 29 2.48 75 - 79 0.9930 - 34 2.54 80 - 84 0.7135 - 39 2.53 85 - 89 0.4740 - 44 2.51 90 + 0.3345 - 49 2.57
Frequency Distributions cont.
• Looking at the frequency distribution tells us:
The most frequently occurring age is expected to be in the 50-54 age range (b/c this is the largest population estimate, 2.79 million).
The age frequencies are expected to be fairly evenly distributed from 0 to 70 years old and then fall off
The expected distributions of ages is not symmetrical: very low (young) and high (old) ages do not occur with equal likelihood
Frequency Distributions cont.
• Dividing the data into class intervals makes the data more accessible
• Data which has been divided into class intervals is sometimes referred to as grouped datagrouped data
Cumulative Frequency Distributions
• Frequency distributions can be made to contain more information, as when a column of cumulative frequencies is added
• Cumulative Frequency DistributionCumulative Frequency Distribution:: A table in which the frequency of individuals or events at each measurement value is added to previous frequencies so that each line reads as the total frequency of that and lower measurement values
Cumulative Frequency Ex.
Age of Students in 02-250:Age Frequency Cumulative Frequency18 14 1419 85 9920 58 15721 40 19722 35 23223 16 24824 10 25825 6 26426 4 268
More Frequency Distributions
• Frequency distributions can also contain information about the percentages and cumulative percentages of observations at the various scores:
More Frequency Distributions
Age of Students in 02-250:
Age Frequency Cumulative % Cumulative Frequency %
18 14 14 5.22 5.2219 85 99 31.72 36.9420 58 157 21.64 58.5821 40 197 14.93 73.5122 35 232 13.06 86.5723 16 248 5.97 92.5424 10 258 3.93 96.2725 6 264 2.24 98.5126 4 268 1.49 100.00
Exact Limits
• All measurements are expressed in discrete units, such as seconds or centimeters
• No matter how small the unit of measurement, it is always possible to imagine finer measurement 1 cm = 10 mm
Exact Limits
• So, for continuous variables, any measure should be viewed as representing a range of values
• This range has a width equal to the unit of measurement used, and the boundaries of this range are the exact limits of the measure
Exact Limits
• E.g., If we say an event lasted 12 seconds, we mean it is closer to 12 seconds than to 11 or 13 seconds. A score of 12 represents a range of values. This range is one second wide (one unit of the measurement) and extends between 11.5 and 12.5 seconds
Exact Limits
• Exact limits identify the upper and lower ends of the range represented by the raw score and are the real boundaries of the measure in question
Exact Limits
• Exact LimitsExact Limits:: Values one-half unit of measurement above and above and belowbelow the score or class interval. Exact limits are the boundaries of the range of values represented by the measure
• Some authors refer to exact limits as real limitsreal limits
Exact Limits Examples
MeasureMeasure Exact Limits Exact Limits 52 51.5 - 52.5 51 50.5 - 51.5
52.2 52.15 - 52.25 52.1 52.05 - 52.15
Exact Limits Examples
MeasureMeasure Exact Limits Exact Limits50.02 50.015 - 50.02550.01 50.005 - 50.015
Class IntervalClass Interval Exact Limits Exact Limits 50 - 54 49.5 - 54.5 55 - 59 54.5 - 59.5
Stem-and-Leaf Displays
• Stem-and-Leaf DisplayStem-and-Leaf Display:: partitions each score into a “stem” and a “leaf” and groups the scores according to common stems
• The “Leaf” is the rightmost digit• The “Stem” is the digit (or digits)
to the left of the leaf (the stem is 0 for 1 digit numbers)
Stem-and-Leaf
E.g.,Stem Leaf
4 0 4 54 5 4 123 12 31234 123 4
The numbers 24 and 26 have different “leaves”(4 and
6) but the same stem (2)
Stem-and-Leaf
• Consider this raw data and their stem-and-leaf plot:
Stem-and-Leaf
stem leaf3 64 4775 058996 012257887 245598 5789 2
Data: 36, 44, 47, 47, 50, 55, 58, 59, 59, 60, 61, 62, 62, 65, 67, 68, 68, 72, 74, 75, 75, 79, 85, 87, 88, 92
Stem-and-Leaf
• Or this example:• Data: 102, 104, 115, 116, 116, 125, 127,
128, 129, 129, 131, 136, 137, 145, 145stem leaf
10 2411 56612 5789913 16714 55
Stem-and-Leaf
• Unlike frequency distributions, stem-and-leaf plots give an indication of the overall distribution of the scores (e.g., evenly spread or bunched, symmetrical or nonsymmetrical)
• Note: Make sure you include every instance of a given value, e.g., if 57 occurs 3 times in the data set, this should be represented in the stem and leaf display with a stem of 5 and three 7s in the leaf.
Graphs
• GraphGraph refers to all manner of pictorial, or graphic, representation of data
• We will consider histogramshistograms and frequency polygonsfrequency polygons
Graphs
• The horizontal axis (X axis) is labeled with units representing points of measurement and the vertical axis (Y axis) is labeled with values representing frequency of occurrence
• Histograms and frequency polygons are like 2-dimensional representations of frequency distributions
Histogram
• HistogramHistogram:: A graphic in which the horizontal axis identifies points of measurement, and the vertical axis represents frequency of occurrence
• Solid bars are used to represent the frequency at each point of measurement (a histogram is a bar graph)
Age Data Histogram Example
0
20
40
60
80
100
Frequency
18 19 20 21 22 23 24 25 26Age
Frequency Polygon
• Frequency PolygonFrequency Polygon:: A graphic in which the horizontal axis identifies points of measurement, and the vertical axis represents frequency of occurrence (a frequency polygon is a line graph)
Age Data Frequency Polygon
0
20
40
60
80
100
Frequency
18 20 22 24 26 Age
Graphs cont.
• Both histograms and frequency polygons can be embellished by the simultaneous plotting of more than one variable, as shown next
Graphs cont.
0
10
20
30
40
50
Frequency
18 19 20 21 22 23 24 25 26Age Female = red
Male = green
Graphs cont.
0
10
20
30
40
50
Fre
quency
18 19 20 21 22 23 24 25 26Age
0
10
20
30
40
50
Fre
quency
18 19 20 21 22 23 24 25 26Age Female = red
Male = green
Describing Data
• AveragesAverages:: an average is a numerical value that indicates the middle point or central region of the raw data
• Averages are sometimes referred to as measures of central central tendencytendency
Averages
• 3 statistics are commonly termed averages: ModeMedianMean
Mode
• ModeMode:: The most frequently occurring score
A distribution with a single most frequently occurring score (one hump) is termed a unimodalunimodal (single mode) distribution
A distribution with 2 values that share the quality of being most frequently occurring (2 humps) is termed bimodalbimodal (2 modes)
Mode Example
Age of Students in 02-250:Age Frequency 18 141919 8585 In this example, the Mode is 20 58 1919 as it has the highest21 40 frequency22 3523 1624 1025 626 4
A la Mode• The mode does not take into account all
of the data - only the one most frequently occurring score
• The mode is the score with the highest bar in a histogram, or the highest point in a frequency polygon
• When the data are combined into class intervals, the mode is the mid-point of the class interval that contains the most scores
Median
• MedianMedian:: The middle point of the distribution, or the score which bisects the distribution (divides it into upper and lower halves)
Median
• If there are an ODDODD number of scores, the median is the middle score:
1, 3, 6, 7, 8, 13, 15, 17, 18, 21, 23
Median = 13There are 5 scores above the median,and 5 below
Median
• If there are an EVENEVEN number of scores, the median is the midpointmidpoint between the two middle scores:
1, 3, 6, 7, 8, 13, 15, 17, 18, 23
Median = (8 + 13)/2 = 10.5
Median Notes
NOTE!NOTE! • When determining the median, you
must arrange the scores in ascending or descending order first!
Steps to Finding the Median
1. Arrange data in ascending or descending order
2. Count the number of scores (N)
3. If there are an oddodd number of scores, find the middle point (the score where there are the same number of scores above and below it) - this is the median
Steps to Finding the Median
4. 4. If there are an eveneven number of scores,
find the 2 middle scores - add them,
and divide by 2 - this is the median
More Median
• When a distribution is viewed as area, the median divides the total area in half:
MedianMedian
50% 50%50%
Median cont.
• The median is based on the value of one or two scores, and does not take into account all of the data
• When the data are grouped into class intervals, the median can be viewed as the midpoint of the class interval which contains the middle score (50th frequency). This is only a rough estimate
Arithmetic Mean
• Arithmetic MeanArithmetic Mean:: the sum of the scores divided by the number of scores (what is generally thought of as the averageaverage)
Mean
• The mean of a samplesample of X scores is symbolized as , which is said as “X bar”
• The mean of a populationpopulation of X scores is symbolized by the Greek letter mu (µµ)
• Greek letters tend to be used for parameters, while conventional letters are used for statistics
Mean
• The algebraic definition of the populationpopulation mean is as follows:
NN is used to refer to the number of scores in the data set (termed population sizesize)
N
X
Sample Mean
• The algebraic definition of the sample mean is as follows:
nn is used to refer to the number of scores in the data set (termed sample sizesize)
n
Xx
Mean cont.
• The algebraic formula for the sample and population mean is the same, (although some terms have different formulae for samples and populations)
Mean cont.
• The mean is used as the measure of average almost exclusively (rather than the mode or median) because it is defined algebraically and considers allall the raw scores in the data set
Mean cont.• In any group of scores, the sum of the
deviations from the mean equals zero:
X X- n = 6
3 3 - 5.50 = -2.50 = X/n
5 5 - 5.50 = -0.50 = 33/6
9 9 - 5.50 = +3.50 = 5.502 2 - 5.50 = -3.508 8 - 5.50 = +2.506 6 - 5.50 = +0.50
X = 33 (X- ) = 0.00
Relative Characteristics of Averages
• If the distribution is symmetrical, the mean, median, and mode have the same value
• The longer tail of a non-symmetrical distribution “pulls” the mean more than the mode and median
• Therefore: the mean is more effected by outliers (very large or very small data points) than are the mode and median
Relative Characteristics of Averages
• Relative positions of the mean and Relative positions of the mean and median:median:
• Note: The mode is the highest Note: The mode is the highest point in the distributionpoint in the distribution