Biderman's Psychology 201 Handouts › ... › docs › p2010_lecture02_frequ… · Web...
Transcript of Biderman's Psychology 201 Handouts › ... › docs › p2010_lecture02_frequ… · Web...
Biderman's Psychology 201 Handouts
PSY 2010 Lecture Notes Chapter 2 - Frequency Distributions
The Goals of Descriptive Statistics
Characteristics of peoplekind, aloof, gregarious, tall, friendly, mean, spacy, etc.
Characteristics of Cities:forward-looking, violent, progressive; outdoorsy
Characteristics of Cars:fast, economical, hybrid; 8-seat etc.
Just as there are certain characteristics which seem to "belong" to people or cities or cars, there are a few characteristics which "belong" to collections of numbers and which statisticians feel should be mentioned whenever an attempt is made to describe a collection.
The Big Three Characteristics of collections of numbers
1: Central Tendency (known loosely as “average value”)
Consider the following weights:230, 260, 305, 195.
Compare them with the following:115, 120, 105, 94, 110,115, 100 90, 85.
Central tendency refers to how big the individual numbers in the collection are.
The central tendency of the first is larger - The scores in the first collection are larger than those in the second.
2: Variability
Consider the following collection:150, 155, 158, 160, 153, 156, 152.
Compare it with:85, 175, 305, 95, 130.
Variability refers to how different the numbers in the collection are from each other.
Small variability: All numbers are approximately the same value.
Large variability: All numbers are quite different from each other.
Note that the second collection is more variable than the first.
3: Shape of the distribution of scores
Shape refers to the way score values are positioned or placed on the number line.
In some distributions, the scores are all piled up on one side or the other.
In others, the scores are piled up in the middle.
Shape will be considered in detail after graphical methods of description have been introduced.
Other Characteristics
We will consider the correlation between paired data later in the course.
Descriptive Techniques
To describe
1) central tendency
2) variability, and
3) shape
Overview (Boldfaced are most important for this class)
1. Tables
Regular frequency distribution
Grouped frequency distribution
Stem and leaf displays
Chapter 2
2. Graphs.
Bar graph
Histogram
Frequency polygon
Dot plot
Scatterplots
3. Numeric summaries
Mean, median, mode
Chapter 14
Chapter 3
Standard deviation, range
Measures of skewness and kurtosis
Correlation Coefficient
Ungrouped Frequency DistributionsCorty p 35-36
Definition: A list of all possible score values from the largest down to the smallest along with the frequency of occurrence of each score value, even if it was 0.
Example . . .
Hypothetical responses to a survey item, Please indicate how much you agree with the following statement on a 1= Strongly Disagree to 7=Strongly agree scale . . .
“I think Taylor Swift is the best vocalist, male or female, of this decade.”
Hypothetical answers of 20 university undergraduates
7 4 6 3 3 4 6 4 6 4 5 4 5 1 5 4 6 6 7 4
Ungrouped Frequency Distribution of the Responses
ResponseFrequencyPercentage
7 (Strongly Agree)210.00
6525.00
5315.00
4735.00
3210.00
200.00
1 (Strongly Disagree)15.00
Same Table annotated . . .
ResponseFrequencyPercentage
Largest score value is at the top of the table.
7 (Strongly Agree)210.00
6525.00
5315.00
4735.00
3210.00
Smallest score value is at the bottom of the table.
200.00
1 (Strongly Disagree)15.00
Frequency Distributions from the ATV Data
Score values with 0 frequency are included in the table.
Whether wearing helmet at time of crash
ValueFrequency
Yes 63
No344
Unavailable 93
OK, so from the above, we can easily see that most of the persons involved in accidents that ended in the hospital were NOT wearing helmets.
Whether drinking Alcohol before time of crash Start here on 8/27/15
ValueFrequency
Yes141
No296
No info 39
No test 24
From this we can see that most people involved in hospitalization accidents on ATVs were not using alcohol.
Helmet usage by age group
Age 20 or YoungerAge 21 or Older
ValueFreqPctValueFreqPct
Yes3522%Yes2712%
No13878%No20288%
We can see that the % was slightly greater for older drivers.
Alcohol usage by age group
Age 20 or YoungerAge 21 or Older
ValueFreqPctValueFreqPct
Yes138%Yes12548%
No15692%No13852%
We can easily see that the older drivers were more likely to have been drinking.
The point of these examples is that the questions we asked were easily answered once we’d created frequency distributions. They’d have been much harder to answer with just the raw data. Grouped Frequency Distribution
Definition: A list of equal-sized score groups ordered from the group with the largest scores in it down to the group with the smallest scores in it along with number of scores in each group.
Rules
1) Groups must be equal size.
2) The left label of each interval must be divisible without remainder by the interval width.
Example
ISS: Injury Severity Scores from the ATV data
ISS: A number representing how sever the injury is. Larger scores represent more severe injuries.
More than you ever wanted to know about the ISS . . .
From http://www.trauma.org/archive/scores/iss.html
Injury Severity Score
The Injury Severity Score (ISS) is an anatomical scoring system that provides an overall score for patients with multiple injuries.
Each injury is assigned an Abbreviated Injury Scale (AIS) score and is allocated to one of six body regions (Head, Face, Chest, Abdomen, Extremities (including Pelvis), External). Only the highest AIS score in each body region is used.
The 3 most severely injured body regions have their score squared and added together to produce the ISS score.
An example of the ISS calculation is shown here:
An attempt to creat an Ungrouped Frequency Distribution of all the ISS scores in the ATV data. ARGH!!!!
iss
Frequency
Percent
Valid Percent
Cumulative Percent
Valid
75
1
.2
.2
.2
50
1
.2
.2
.4
45
1
.2
.2
.6
43
1
.2
.2
.8
This just won’t work.
The table will have to be too big.
35
1
.2
.2
1.0
34
1
.2
.2
1.2
33
1
.2
.2
1.4
30
1
.2
.2
1.6
29
5
1.0
1.0
2.6
27
2
.4
.4
3.0
26
7
1.4
1.4
4.4
25
8
1.6
1.6
6.0
24
5
1.0
1.0
7.0
22
7
1.4
1.4
8.4
21
13
2.6
2.6
11.0
20
2
.4
.4
11.4
19
6
1.2
1.2
12.6
18
4
.8
.8
13.4
17
25
5.0
5.0
18.4
16
9
1.8
1.8
20.2
14
26
5.2
5.2
25.4
13
30
6.0
6.0
31.4
12
6
1.2
1.2
32.6
11
5
1.0
1.0
33.6
10
36
7.2
7.2
40.8
9
67
13.4
13.4
54.2
8
22
4.4
4.4
58.6
6
13
2.6
2.6
61.2
5
79
15.8
15.8
77.0
4
72
14.4
14.4
91.4
3
1
.2
.2
91.6
2
9
1.8
1.8
93.4
1
33
6.6
6.6
100.0
Total
500
100.0
100.0
Problem: The above is not an appropriate regular frequency distribution.
That’s because not all internal values are listed.
If ALL the possible score values were listed, here is what it would look like
ISSFrequency
751
740
730
720
710
700
690
680
670
660
650
640
630
620
610
600
590
580
570
560
550
540
530
520
Table continues
510
501
.
.
The problem with this is that it’ll be much too tall.
Whenever it will require more than about 10 lines for a table, the data should be grouped.A Grouped Frequency Distribution of the ISS Variable
ISS IntervalFrequencyPercent
70-79 1 0.2
60-69 0 0.0
50-59 1 0.2
40-49 2 0.4
30-39 4 0.8
20-29 49 9.8
10-1914729.4
0- 929659.2
Extended rules for Grouped Frequency Distributions . . .
1. Number of groups: About 10 although you might have a few more or less.
2. Group width of 3, 5, 10, 20, 50, 100, etc.
I suggest that you try to use 3, 5, or 10 as the interval width. A large majority of data sets will fit one of those choices.
3. Group width is the same for ALL groups, including top group and bottom group.
4. Left-hand labels are divisible without remainder by width
Group size above is 10. 0, 10, 20, 30, 40, 50, 60, 70 are divisible without remainder by 10.
5. Groups are contiguous.
Biderman's 201 Lecture Notes: Frequency Distributions - 88/27/2015
Some example frequency distributions
From the UTC Factbook: www.utc.edu -> About UTC -> Academic & Institutional Research -> Factbook (not Facebook) -> Fact Summary Sheet
Grades awarded in Psychology
Grade#%
A100443.0
B67428.9
C42218.1
D1165.0
F1195.1
Stem & Leaf Displays Not in Corty
Stem & Leaf Display: An ordered representation of scores in which rows (the stems) represent score intervals and numbers within rows (the leaves) represent individual values. The rows are called stems and the numbers within rows are called leaves.
The most straightforward such table is one representing two-digit scores. In this case, rows correspond to the first digit of each number . Within each row, the last digit of each number represents the number.
For example, consider the following two-digit values . . .
24 29 40 58 42 9 15 20 78 90 96 26 10 16 38 46 29 65 82 71 81 45 52 68 49 94
These would be represented in a stem & leaf display as follows . . .
StemsLeaves
09
Stem
Leaves
15 0 6
24 9 0 6 9
38
40 2 6 5 9
58 2
65 8
78 1
82 1
90 6 4
Usually, the leaves are ordered from
smallest to largest within stems . . .
StemsOrdered Leaves
09
10 5 6
20 4 6 9 9
38
40 2 5 6 9
52 8
65 8
71 8
81 2
90 4 6
More than you ever wanted to know about stem and leaf displays. This page left in the notes for your reference but you will not be required to “split” the stems for this class.
For large samples, data analysts will “split” the stems so that the table won’t be too wide.
Consider the following scores
24 29 40 58 42 9 15 20 78 90 96 26 10 16 38 46 29 65 82 71 81 45 52 68 49 94
56 61 74 84 90 88 79 83 86 83 76 75 79 80 75 98 97 93 95 92 81 80 78 94 92 91
Here’s the stem and leaf display for them
0 9
10 5 6
24 9 0 6 9
38
42 6 5 9
58 2 6
65 8 1
78 1 4 9 6 5 9 5 8
82 1 4 8 3 6 3 0 1 0
90 6 4 0 8 7 3 5 2 4 2 1
The “splitting” replaces each stem label with identical stem labels, one that contains all values that end in 0 thru 4 and the other that represents all values that end in 5 thru 9. So the above display would be
0
09
10
15 6
20 4
26 9 9
3
38
42
4 5 6 9
52
56 8
61
65 8
71 4
75 5 6 8 8 9
80 0 1 1 2 3 3 4
86 8
90 0 1 2 2 3 4
95 6 7 8
Graphic Representation
Bar graph / Bar charts(Corty p. 50)
A graph of bars in which each bar represents a different value and the length of the bar represents frequency of occurrence. Bars do not touch each other.
Used for nominal/categorical data – Gender, Handedness, College major, Religion, Letter grade
Gives same information as a Regular Frequency Distribution.
Mike – Open SPSS and demonstrate this.
Example from Employee Data.sav
Histograms Corty p. 52
A graph in which each value or interval of values is represented by a column whose length corresponds to frequency.
Columns may touch.
Mike - Demo this.
Used for Quantitative data: GPA, IQ, Height, Weight
Used for same data as a Grouped Frequency Distribution
Example from Employee Data.sav
Frequency Polygons – p. 54 Start here on 9/2/15
Classic Positively Skewed Distribution
A graph in which each value is represented by a point on an axis, and frequency of occurrence at each value is represented by the height of a line above the axis.
Note: SPSS’s Line graph does not create this display correctly. SPSS can only create an Idealized Frequency Polygon, created assuming that the data follow a normal distribution. (More on the normal distribution in a later chapter.)
Used for Quantitative data, typically for very large samples.
Idealized Frequency Polygon
Scores of 329 UTC students on a measure of Conscientiousness
Note that the idealized frequency polygon pretty much matches the observed histogram.
ACTComp scores of 4700+ UTC students.
Note that the idealized frequency polygon doesn’t match the histogram of actual scores very well.
This means that the observed data are not distributed as pictured by the idealized curve.
Idealized Frequency Polygon
Dot plots (Not covered in the text)
A graph in which each score is represented by a symbol placed at a location corresponding to the score’s value.
Used for quantitative data, often with paper and pencil.
If two or more scores have the same value, a “pile” of symbols at the location is created.
Following are dot plots of PSY 1010 test scores of 185 UTC students. (from Sebren data)
Menu sequence . . .
Scores on a Psych 1010 test of 185 students – Argh!! SPSS made the dots too big.
To edit the dot sizes, 1) double-click on the graph, 2) then right-click on a dot, and 3) change the size in the Properties Window. Following is the same plot with smaller dots . . .
Who did better overall on the test – females or males?
Answer: It seems as if males and females did about equally well.
The dots don’t even have be dots - I use vertical lines for my final grade distributions . . .
Here’s an example of a paper and pencil dot plot.
Box and Whisker PlotsNot in Chapter 2
A plot in the form of a rectangular box with whiskers at the top and bottom that represents 5 key quantities – Scores on PSY 1010 test.
Smallest non-outlying value
Possible outliers
25th Percentile
50th Percentile
75th Percentile
Largest value
Box and whisker plots used to compare groups . . .
Eventually you’ll be able to see from such a comparison that
1) there is not a huge difference in “average” performance of males and females on this particular test.
2) Both distributions are slightly negatively skewed. (The bottom whiskers are longer than the top whiskers. See the next page.)
3) There were more females who did really poorly on the test than males who did really poorly. What’s that about?
Distribution Shapes
Symmetric Distribution
A distribution for which each tail is a mirror image of the other.
Unimodal and Symmetric (US) distributions
The US distribution is the statistician’s favorite. (Who says statisticians aren’t patriotic.)
Most statistical techniques have been developed for such distributions.
A roughly symmetric and unimodal distribution – conscientiousness scores of 300+ students.
Skewed Distributions
Chapter 9
Positively skewed distribution
Long tail points in positive direction
A positively skewed distribution has a long tail in the positive direction.
Example: Income or money-related scores
Dot plots of male and female salariesBox and whisker plots of same salaries
From the text
Negatively skewed distributions
Long tail points in negative direction
The long tail is in the negative direction.
Examples:
Scores on an easy test.
Faked personality tests
A Dot plot of Conscientiousness scores of 166 students instructed to respond honestly.
Note that the distribution is symmetric.
Dot plot of conscientiousness scores of same 166 students, instructed to “fake good”.
When people are told to fake a personality test, they do a great job of increasing their scores from what they would be without faking.
Many of the student achieved the highest possible Conscientiousness score.
Example of comparison of graphs - Effectiveness of playing Bridge as an enrichment activity
In a school system in the Midwest one hour per week was dedicated as an “enrichment” hour. Students were given a variety of activities in which they could participate.
A group of parents volunteered to teach students to play bridge during the hour.
The parents believed that playing bridge would lead to higher math scores than those of students who did other activities. Among other notables who have endorsed bridge as a worthwhile enrichment activity are Bill Gates, founder of Microsoft, and Warren Buffet, successful investor.
The results for the first measurement of math skills after about 6 months of activity
Inspection of the two dot plots suggests that there is no huge difference in math achievement test scores between the Bridge group and the Control group. Statistical tests we’ll learn about later in the semester confirmed that there is no statistically significant difference between the groups.
Conclusion: Playing bridge during the one-hour “enrichment” period did not result in higher math achievement scores than doing other activities.
20
19
18
17
16
15
14
13
12
11
10
Biderman's 201 Handouts 2 (Tabular, Graphical Techniques) - 8/27/2015
Clerical Custodial Manager
Employment Category
0
100
200
300
400
Count
$10,000$20,000$30,000$40,000$50,000$60,000$70,000$80,000
Beginning Salary
0
50
100
150
200
250
Frequency
Mean = $17,016.09
Std. Dev. = $7,870.638
N = 474