What Types Of Data Are Collected?
description
Transcript of What Types Of Data Are Collected?
What Types Of Data Are Collected?
What Kinds Of Question Can Be
Asked Of Those Data?
Do people who say they study for more hours also think they’ll finish their doctorate earlier?
Are computer literates less anxious about statistics?
…. ?
Are men more likely to study part-time?
Are women more likely to enroll in CCE?
…. ?
Questions that Require Us To
Examine Relationships
Between Features of the
Participants.
How tall are class members, on average?
How many hours a week do class members report that they study?
…. ?
How many members of the class are women?
What proportion of the class is fulltime?
…. ?
Questions That Require Us To
DescribeSingle Features
of the Participants
“Continuous”
Data
“Categorical”
Data
Research Is A Partnership Of
Questions And Data
Research Is A Partnership Of
Questions And Data
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 1
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 2
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
Today, I’ll focus on generating summaries using the arithmetic manipulation principlearithmetic manipulation principle.Today, I’ll focus on generating summaries
using the arithmetic manipulation principlearithmetic manipulation principle.
Last time, I focused on generating such summaries using the ordering principleordering principle.Last time, I focused on generating such summaries using the ordering principleordering principle.
We have distinguished two broad approachestwo broad approaches forcreating statistical summaries statistical summaries of these properties:
Approach #2Approach #2Based on the arithmetic manipulation of data arithmetic manipulation of data valuesvalues:
Mean, standard deviation, skewness, kurtosis, …
Approach #1Approach #1Based on the ordering of data valuesordering of data values: Median, quartiles, percentiles, inter-
quartile range, …
It is more difficult to summarize the sample distribution of a continuous variable, like MAT score, than it is to summarize the sample distribution of a categorical variable, because the sample distributions of continuous variables like MAT scores have so many interesting properties, including:
The “center” or “location” of the batch. The “spread” of the batch.
The “one-sidedness” of the batch. The “peakiness” of the batch.
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 3
Let’s use the arithmetic principlearithmetic principle to develop a statistic for describing the center of the distributioncenter of the distribution of the values of a continuous variable like MAT score … for the “Early” “Elsewhere” batch, for instance …Let’s use the arithmetic principlearithmetic principle to develop a statistic for describing the center of the distributioncenter of the distribution of the values of a continuous variable like MAT score … for the “Early” “Elsewhere” batch, for instance …
987654321
2 3 3
1
9
7
650
61000
98870
23
3
197
65
0
61
00
0
98
87
0
9 8 7 6 5 4 3 2 1
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
A good summary statistic for describing the center of a distribution of the values of a continuous variable is the place where the distribution
would need to be supported so that it could “balance.”
A good summary statistic for describing the center of a distribution of the values of a continuous variable is the place where the distribution
would need to be supported so that it could “balance.”
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 4
A good summary statisticsummary statistic for describing the center of the distribution of the values of a continuous center of the distribution of the values of a continuous variablevariable, like MAT score, is the place where the distribution must be supported for it to balanceA good summary statisticsummary statistic for describing the center of the distribution of the values of a continuous center of the distribution of the values of a continuous variablevariable, like MAT score, is the place where the distribution must be supported for it to balance
23
3
197
65
0
61
00
0
98
87
0
9 8 7 6 5 4 3 2 1
Known as the sample mean, or
average.
Known as the sample mean, or
average.
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
4.63
19
120519
83...50473921
values of Number
values the all up AddPoint
Balance
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 5
let’s use the arithmetic principlearithmetic principle to create a summary statistic for describing the spread of the distribution spread of the distribution of values of a continuous variableof values of a continuous variable … how about the “average distance from the center”?let’s use the arithmetic principlearithmetic principle to create a summary statistic for describing the spread of the distribution spread of the distribution of values of a continuous variableof values of a continuous variable … how about the “average distance from the center”?
23
3
197
65
0
61
00
0
98
87
0
9 8 7 6 5 4 3 2 1
Why don’t we just find the average distance of all the “blocks” from the center?
Why don’t we just find the average distance of all the “blocks” from the center?
1 - blocks ofNumber
center thefrom blocks"" theof distances theall Add
center thefromblocks"" theofdistance Average
1 - blocks ofNumber
center thefrom blocks"" theof distances theall Add
center thefromblocks"" theofdistance Average
018
018
)4.42()4.24(...)6.19()6.19(
1 - 19
63.4)-(2163.4)-(39......63.4)-(8363.4)-(83
center thefromblocks"" theofdistance Average
018
018
)4.42()4.24(...)6.19()6.19(
1 - 19
63.4)-(2163.4)-(39......63.4)-(8363.4)-(83
center thefromblocks"" theofdistance Average
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 6
When you sum, everything goes to zero, so what do we do now …. ?When you sum, everything goes to zero, so what do we do now …. ?2
33
197
65
0
61
00
0
98
87
0
9 8 7 6 5 4 3 2 1
Let’s do what we’ve done before,square all the distances before averaging?
Let’s do what we’ve done before,square all the distances before averaging?
1 - blocks ofNumber
center thefrom blocks all of distances theAdd
center thefromblocks""
theof distancesquared Average
squared
1 - blocks ofNumber
center thefrom blocks all of distances theAdd
center thefromblocks""
theof distancesquared Average
squared
26.27918
64.502618
)4.42()4.24(...)6.19()6.19(
1 - 19
63.4)-(2163.4)-(39......63.4)-(8363.4)-(83
center thefromblocks""
theof distancessquared Average
2222
2222
26.27918
64.502618
)4.42()4.24(...)6.19()6.19(
1 - 19
63.4)-(2163.4)-(39......63.4)-(8363.4)-(83
center thefromblocks""
theof distancessquared Average
2222
2222
Now I guess we should take the square root, to reverse the squaring that we did to begin with?
Let’s call this the standard deviationstandard deviation.
Now I guess we should take the square root, to reverse the squaring that we did to begin with?
Let’s call this the standard deviationstandard deviation.
7.1626.279
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 7
And so, creating summary statistics based on the arithmetic principlearithmetic principle, here’s the story so far…...And so, creating summary statistics based on the arithmetic principlearithmetic principle, here’s the story so far…...
23
3
197
65
0
61
00
0
98
87
0
9 8 7 6 5 4 3 2 1
Mean63.4
Mean63.4 46.746.780.180.1
1 standard deviation1 standard deviation
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 8
You don’t have to do all these computations by hand – SAS can do them for you:
Here are the MAT data you worked with, supplemented by data from the 1987 cohort.
All in the MAT.txt dataset.
You don’t have to do all these computations by hand – SAS can do them for you:
Here are the MAT data you worked with, supplemented by data from the 1987 cohort.
All in the MAT.txt dataset.
1 01 1 64 21 02 1 54 21 03 1 93 21 04 1 82 21 05 1 75 21 06 1 72 21 07 1 59 21 08 1 76 21 09 1 38 21 10 1 73 21 11 1 88 21 12 1 50 11 13 1 96 11 14 1 66 11 15 1 93 11 16 1 63 1
(74 cases omitted)
Entering cohort:1 =19872 =1989
Entering cohort:1 =19872 =1989
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
ID labelID label Location of test site:1 = Harvard2 = Elsewhere
Location of test site:1 = Harvard2 = Elsewhere
When the test was received in the Admissions Office:
1 = Early2 = Late
When the test was received in the Admissions Office:
1 = Early2 = Late
Raw MAT scoreRaw MAT score
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 9
OPTIONS Nodate Pageno=1; TITLE1 ‘S010Y: Answering Questions with Quantitative Data';TITLE2 'Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I';TITLE3 'MAT Scores from 2 Years of Doctoral Applicants';TITLE4 'Data in MAT.txt'; *-----------------------------------------------------------------------------*Input data, name and label variables in dataset*-----------------------------------------------------------------------------*; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; LABEL ID = 'Case identification number' YEARTEST = 'Year test taken' WHENRECD = 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE = 'Test site'; *-----------------------------------------------------------------------------*Format labels for values of categorical variables*-----------------------------------------------------------------------------*; PROC FORMAT; VALUE YEARFMT 1='1987' 2='1989'; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere';
Here’s a PC-SAS program to provide descriptive univariate statistics on these data … Handout C08_1 Here’s a PC-SAS program to provide descriptive univariate statistics on these data … Handout C08_1
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
Standard data input statements, notice that there are several other variables in the dataset
The usual process of formatting the categorical variables
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 10
*--------------------------------------------------------------------------* Data Listing*--------------------------------------------------------------------------*;PROC PRINT LABEL DATA=MAT; TITLE5 'Listing of MAT Scores & Background Variables for all Applicants'; VAR ID YEARTEST WHENRECD TESTSITE MATSCOR; FORMAT YEARTEST YEARFMT. WHENRECD WHENFMT. TESTSITE SITEFMT.; *--------------------------------------------------------------------------* Displaying and summarizing the MAT scores for the whole sample*--------------------------------------------------------------------------*;PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Univariate Descriptive Summaries of MAT Score for all Applicants'; VAR MATSCOR; ID ID;RUN;
And here’s the rest of the PC_SAS program … this part provides the requested univariate descriptive statistics ...And here’s the rest of the PC_SAS program … this part provides the requested univariate descriptive statistics ...
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
Printing, titling and formatting a few cases for inspection
PROC UNIVARIATE provides all kind of univariate (“single variable”) descriptive statistics for continuous
variables
The PLOT command requests various data plots, including the stem.leaf plot.
The ID command identifies a variables that contains respondent
identifying information
The VAR command specifies the continuous variable to be
summarized
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 11
S010Y: Answering Questions with Quantitative Data Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I MAT Scores from 2 Years of Doctoral Applicants Data in MAT.txt
Listing of MAT Scores and Background Variables for all Applicants Case Year When Millers identification test application AnalogiesObs number taken received Test site Test Score 1 1 1987 Early Elsewhere 64 2 2 1987 Early Elsewhere 54 3 3 1987 Early Elsewhere 93 4 4 1987 Early Elsewhere 82 5 5 1987 Early Elsewhere 75 6 6 1987 Early Elsewhere 72 7 7 1987 Early Elsewhere 59 8 8 1987 Early Elsewhere 76 9 9 1987 Early Elsewhere 38 10 10 1987 Early Elsewhere 73 . . 83 83 1989 Late Elsewhere 55 84 84 1989 Late Harvard 72 85 85 1989 Late Elsewhere 32 86 86 1989 Late Elsewhere 53 87 87 1989 Late Elsewhere 76 88 88 1989 Late Elsewhere 62 89 89 1989 Late Elsewhere 78 90 90 1989 Late Elsewhere 54
Here’s a listing of a few cases from the dataset …Here’s a listing of a few cases from the dataset …
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
Harvard graduation, 1890The six class day speakers; with W.E.B. Du Bois
on the far right
Harvard graduation, 1890The six class day speakers; with W.E.B. Du Bois
on the far right
Each row is a case, as usual
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 12
Variable: MATSCOR (Millers Analogies Test Score) Moments N 90 Sum Weights 90Mean 63.3888889 Sum Observations 5705Std Deviation 18.6924815 Variance 349.408864Skewness -0.5406701 Kurtosis -0.320241 Basic Statistical Measures Location Variability Mean 63.38889 Std Deviation 18.69248 Median 65.00000 Variance 349.40886 Mode 62.00000 Range 78.00000 Interquartile Range 24.00000 Quantiles
Quantile Estimate 100% Max 96.0 99% 96.0 95% 90.0 90% 85.5 75% Q3 77.0 50% Median 65.0 25% Q1 53.0 10% 35.0 5% 27.0 1% 18.0 0% Min 18.0
And the “orderingordering” and “arithmetic manipulationarithmetic manipulation” summary statistics for MATSCOR are …And the “orderingordering” and “arithmetic manipulationarithmetic manipulation” summary statistics for MATSCOR are …
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
The sample mean of MATSCOR is 63.39The sample mean of MATSCOR is 63.39
The sample standard deviation of MATSCOR is 18.69.
The sample standard deviation of MATSCOR is 18.69.
The median (or 50th percentile) of MATSCOR is 65
The median (or 50th percentile) of MATSCOR is 65
The inter-quartile range is the difference between the upper and lower quartiles:
• Lower quartile = 53• Upper quartile = 77• Inter-quartile range =
(77-53) = 24
The inter-quartile range is the difference between the upper and lower quartiles:
• Lower quartile = 53• Upper quartile = 77• Inter-quartile range =
(77-53) = 24
The range is the difference between the minimum and the maximum:
• Minimum = 18• Maximum = 96• Range = (96-18) = 78
The range is the difference between the minimum and the maximum:
• Minimum = 18• Maximum = 96• Range = (96-18) = 78
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 13
Millers Analogies Test Score Stem Leaf # 9 6 1 9 00333 5 8 5689 4 8 222334 6 7 556667788899 12 7 0011122223344 13 6 55669 5 6 000122223444 12 5 556899 6 5 00333444 8 4 57 2 4 022 3 3 55889 5 3 124 3 2 7 1 2 114 3 1 8 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1
Millers Analogies Test Score Stem Leaf # 9 6 1 9 00333 5 8 5689 4 8 222334 6 7 556667788899 12 7 0011122223344 13 6 55669 5 6 000122223444 12 5 556899 6 5 00333444 8 4 57 2 4 022 3 3 55889 5 3 124 3 2 7 1 2 114 3 1 8 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1
Here’s SAS’s version of the stem.leaf plot for the values of MATSCOR …Here’s SAS’s version of the stem.leaf plot for the values of MATSCOR …
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
This is scientific notation:
And don’t forget the inverses …
10000104**101000103**10100102**10
10101**10
4
3
2
1
001.01000
1
10
1103**10
01.0100
1
10
1102**10
1.010
1
10
1101**10
33
22
11
1.8 x 101 = 18, etc.
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 14
We can bring several of these univariate descriptive statistics – both the “ordering” and “arithmetic
manipulation” versions -- together in a useful single summary figure called the “box and whisker” plot,
or boxplot…
We can bring several of these univariate descriptive statistics – both the “ordering” and “arithmetic
manipulation” versions -- together in a useful single summary figure called the “box and whisker” plot,
or boxplot…
Recall that, for the full sample (n=90) …. Minimum, Maximum, & Range:
• Min = 18• Max = 96• Range =78
Quartiles, Median & Inter-Quartile Range:• 25 %ile Q1 = 53• Median = 65• 75 %ile Q3 = 77• Interquartile Range = 24
Mean:• Mean = 63.4
Recall that, for the full sample (n=90) …. Minimum, Maximum, & Range:
• Min = 18• Max = 96• Range =78
Quartiles, Median & Inter-Quartile Range:• 25 %ile Q1 = 53• Median = 65• 75 %ile Q3 = 77• Interquartile Range = 24
Mean:• Mean = 63.4
100
90
80
70
60
50
40
30
20
10
100
90
80
70
60
50
40
30
20
10
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 15
The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot 9 6 1 | 9 00333 5 | 8 5689 4 | 8 222334 6 | 7 556667788899 12 +-----+ 7 0011122223344 13 | | 6 55669 5 *-----* 6 000122223444 12 | + | 5 556899 6 | | 5 00333444 8 +-----+ 4 57 2 | 4 022 3 | 3 55889 5 | 3 124 3 | 2 7 1 | 2 114 3 | 1 8 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot 9 6 1 | 9 00333 5 | 8 5689 4 | 8 222334 6 | 7 556667788899 12 +-----+ 7 0011122223344 13 | | 6 55669 5 *-----* 6 000122223444 12 | + | 5 556899 6 | | 5 00333444 8 +-----+ 4 57 2 | 4 022 3 | 3 55889 5 | 3 124 3 | 2 7 1 | 2 114 3 | 1 8 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
And here’s the PROC UNIVARIATE version of the box-plot from the previous handout…..And here’s the PROC UNIVARIATE version of the box-plot from the previous handout…..
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical?
What would the box-plot look like if there was very little variability in MATSCOR in the sample?
What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?
What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical?
What would the box-plot look like if there was very little variability in MATSCOR in the sample?
What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 16
An interesting aside on the normal distribution …..An interesting aside on the normal distribution …..
There is a special relationship between percentiles and standard deviation in a
normal distribution
There is a special relationship between percentiles and standard deviation in a
normal distribution
Normal distribution simulationNormal distribution simulation
MeanMean Mean+2sdMean+2sd
Mean+1sdMean+1sd
Mean-2sd
Mean-2sd
Mean- 1sdMean- 1sd
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”:
Height Weight, Test Scores, Opinions, etc.…
A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”:
Height Weight, Test Scores, Opinions, etc.…
If you were to plot a vertical histogram of the values of variables like these, you would get the familiar “bell-shaped curve”…
If you were to plot a vertical histogram of the values of variables like these, you would get the familiar “bell-shaped curve”…
Ball-drop simulationBall-drop simulation
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 17
OPTIONS Nodate Pageno=1;
TITLE1 'S010Y: Answering Questions with Quantitative Data';TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II';TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP';TITLE4 'Data in MAT.txt';
*-----------------------------------------------------------------------------*Input data, name and label variables in dataset*-----------------------------------------------------------------------------*; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID = 'Case identification number' YEARTEST = 'Year test taken' WHENRECD = 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE = 'Test site';
*-----------------------------------------------------------------------------*Format labels for the values of the categorical variables*-----------------------------------------------------------------------------*; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere';
OPTIONS Nodate Pageno=1;
TITLE1 'S010Y: Answering Questions with Quantitative Data';TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II';TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP';TITLE4 'Data in MAT.txt';
*-----------------------------------------------------------------------------*Input data, name and label variables in dataset*-----------------------------------------------------------------------------*; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID = 'Case identification number' YEARTEST = 'Year test taken' WHENRECD = 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE = 'Test site';
*-----------------------------------------------------------------------------*Format labels for the values of the categorical variables*-----------------------------------------------------------------------------*; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere';
The boxplot is very useful if you want to compare sample distributions of a continuous variable like MATSCOR across different groups, as in Activity #1 – see Handout C08_2 …The boxplot is very useful if you want to compare sample distributions of a continuous variable like MATSCOR across different groups, as in Activity #1 – see Handout C08_2 …
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
Let’s use categorical variables WHENRECD and TESTSITE to sub-divide the sample, so that we can
compare sub-sample distributions of MATSCOR using boxplots … like original Activity #1.
Here, I’ve picked out only applicants in the 1989 (YEARTEST = 2) cohort, so that the new analyses will match the analyses that you conducted in original Activity #1.
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 18
*-----------------------------------------------------------------------------* Comparing Distributions of MAT scores across groups of testees*-----------------------------------------------------------------------------*;PROC SORT DATA=MAT; BY TESTSITE WHENRECD;
PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.;
*-----------------------------------------------------------------------------* Comparing Distributions of MAT scores across groups of testees*-----------------------------------------------------------------------------*;PROC SORT DATA=MAT; BY TESTSITE WHENRECD;
PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.;
And here’s the rest of the PC-SAS program…..And here’s the rest of the PC-SAS program…..
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
To split the sample, first you need to sort it by the categorical variables of interest: Here, I have sorted first by
TESTSITE and then by WHENRECD.
So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere,
The new analyses should therefore have an ordering that matches the ordering in Activity #1.
To split the sample, first you need to sort it by the categorical variables of interest: Here, I have sorted first by
TESTSITE and then by WHENRECD.
So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere,
The new analyses should therefore have an ordering that matches the ordering in Activity #1.
To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module): When the “BY” command is
implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later).
To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module): When the “BY” command is
implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later).
Here’s the usual use of PROC UNIVARIATE to generate “single variable” summary statistics for MATSCOR, with the PLOT option exercised.
Here’s the usual use of PROC UNIVARIATE to generate “single variable” summary statistics for MATSCOR, with the PLOT option exercised.
© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 19
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data
Conclusions? Mean scores of those who took the MAT test at Harvard are
generally higher than the mean scores of applicants who took the test elsewhere.
• Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample
• The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population?
The sample distribution of MAT scores is less spread out for those who took the test at Harvard:
• Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population.
The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard:
• Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail.
• Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early.
• Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?.
Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric:
• Most like a sample drawn from the general population?• Perhaps those who took the test elsewhere and submitted a late
application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time?
Conclusions? Mean scores of those who took the MAT test at Harvard are
generally higher than the mean scores of applicants who took the test elsewhere.
• Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample
• The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population?
The sample distribution of MAT scores is less spread out for those who took the test at Harvard:
• Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population.
The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard:
• Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail.
• Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early.
• Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?.
Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric:
• Most like a sample drawn from the general population?• Perhaps those who took the test elsewhere and submitted a late
application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time?