Chapter 1
description
Transcript of Chapter 1
Chapter 1
Data Collection
Section 1.1Introduction to the Practice of
Statistics
Statistics The science of statistics is
Collecting Organizing Summarizing Analyzinginformation to draw conclusions or answer questions
Statistics provides a measure of confidence in any conclusion
Data Solve 3x + 5 = 11
Everyone (should) get the same answer
How long was your drive (or walk) to class today? Different answers…this is why we need
statistics! We can then break down the data to
meaningful information
Statistics and mathematics have similarities but are different
Mathematics Solves problems with 100% certainty Has only one correct answer
Statistics, because of variability Does not solve problems with 100% certainty
(95% certainty is much more common) Frequently has multiple reasonable answers
Population vs. Sample● A population (Greek μ)
Is the group to be studied Includes all of the individuals in the group
● A sample Is a subset of the population Is often used in analyses because getting
access to the entire population is impractical
x
Population vs. Sample Population Example
People 18 years and older
Sample Example Students at SHU 18 and older
Parameter vs. Statistic A statistic is a numerical summary of
the sample Descriptive statistics organize and
summarize the data in ways such as tables and graphs
Inferential statistics use the sample results and extend them to the population so we can measure the reliability of the results
A Parameter is a numerical summary of a population
Example Suppose the actual percentage of all
students at SHU that own a car is 48.2% This is a ________________________
We surveyed 100 students and found 46% own a car This is a _________________________
The Process of Statistics Identify the research objective: what do
you want answered Collect he data needed to answer the
question: Usually a sample (1.2 – 1.6) Describe the data: the descriptive
statistics (ch. 2 – 4) Perform Inferences: Use appropriate
techniques to test reliability for population (ch. 9 – 12)
Variables Characteristics of the individuals under
study are called variables Some variables have values that are
attributes or characteristics … those are called qualitative or categorical variables
Some variables have values that are numeric measurements … those are called quantitative variables
Qualitative Variables Examples of qualitative variables
Gender Zip code Blood type States in the United States Brands of televisions
Qualitative variables have category values … those values cannot be added, subtracted, etc.
Quantitative Variables Examples of quantitative variables
Temperature Height and weight Sales of a product Number of children in a family Points achieved playing a video game
Quantitative variables have numeric values … those values can be added, subtracted, etc.
Discrete Vs. Continuous● Quantitative variables can be either discrete
or continuous ● Discrete variables
Variables that have a finite or a countable number of possibilities
Frequently variables that are counts● Continuous variables
Variables that have an infinite but not countable number of possibilities
Frequently variables that are measurements
Discrete Variables Examples of discrete variables
The number of heads obtained in 5 coin flips The number of cars arriving at a McDonald’s
between 12:00 and 1:00 The number of students in class The number of points scored in a football
game The possible values of qualitative
variables can be listed
Continuous Variables Examples of continuous variables
The distance that a particular model car can drive on a full tank of gas
Heights of college students Sometimes the variable is discrete but has so
many close values that it could be considered continuous The number of DVDs rented per year at video
stores The number of ants in an ant colony
Section 1.2Observational Studies Versus
Designed Experiments
Observational Study● A survey sample is an example of an
observational study An observational study is one where there is no
attempt to influence the value of the variable An observational study is also called an ex post facto (after the fact) study
● Advantages It can detect associations between variables
● Disadvantages It cannot isolate causes to determine causation
Designed Experiment● A designed experiment is an experiment
That applies a treatment to individuals Often compares the treated group to a control
(untreated) group Where the variables can be controlled
● Advantages Can analyze individual factors
● Disadvantages Cannot be done when the variables cannot be
controlled Cannot apply in cases for moral / ethical reasons
Lurking & Confounding Variables A danger in observational studies are confounding and
lurking variables In an observational study, two explanatory variables can
be linked, thus causing the relation to the response to be due to another variable not accounted for: Confounding variables.
Lurking Variables are variables not initially considered in the study but affect the response variable.
Associated does not mean that one causes the other A simple observational study may find that smoking and
cancer are associated Cannot conclude that smoking causes cancer Cannot conclude that cancer causes people to smoke
What are some Lurking Variables with Smoking and Cancer?
Types of Observational StudiesCross-sectionalCase-controlcohort
1-22
Cross-sectional Studies Observational studies that collect information about individuals at a specific point in time, or over a very short period of time.
Case-control Studies These studies are retrospective, meaning that they require individuals to look back in time or require the researcher to look at existing records. In case-control studies, individuals that have certain characteristics are matched with those that do not.
Cohort Studies A cohort study first identifies a group of individuals to participate in the study (cohort). The cohort is then observed over a period of time. Over this time period, characteristics about the individuals are recorded. Because the data is collected over time, cohort studies are prospective.
Census● A census is a list
Of all the individuals in a population That records the characteristics of the individuals An example is the US Census held every 10 years
(this is only an example though)● Advantages
Answers have 100% certainty● Disadvantages
May be difficult or impossible to obtain Costs may be prohibitive
Section 1.3Simple Random
Sampling
Simple Random Sample● A simple random sample is when
every possible sample of size n out of a population of N has an equally likely chance of occurring
Simple Random Sample
Let’s Try It!● 5 Volunteers…● A simple (but not foolproof) method
Write each individual’s name on a separate piece of paper
Put all the papers into a hat Draw 2 random papers from the hat
● Physical methods have some issues Are the papers sufficiently mixed? Are some of the papers folded? What else???
Random Numbers A method using a table of random numbers (Back pages Table 1)
List and number the individuals Decide on a way to pick the random
numbers (how to choose the starting point and what rule to use to select which digits to choose after that)
Select the random numbers Match the numbers to the individuals
With the technology available today, this method is almost silly
Calculator Randint(start #, end #, how many)
Leave the 3rd entry blank for 1 value Table 3 Page 25:
Randomly survey 5 of their 30 clients. Number them 1 – 30 RandInt(1,30,5) Survey the clients corresponding to the
generated values.
Section 1.4Other Effective
Sampling Methods
Collecting Data There are other effective ways to collect
data Stratified sampling Systematic sampling Cluster sampling
Each of these is particularly appropriate in certain specific circumstances
Stratified Sample● A stratified sample is obtained when we
choose a simple random sample from subgroups of a population This is appropriate when the population is
made up of nonoverlapping (distinct) groups called strata
Within each strata, the individuals are likely to have a common attribute
Between the stratas, the individuals are likely to have different common attributes
Stratified Sample
Stratified Sample Example – polling a population about a
political issue It is reasonable to divide up the population into
Democrats, Republicans, and Independents It is reasonable to believe that the opinions of
individuals within each party are the same It is reasonable to believe that the opinions
differ from group to group Therefore it makes sense to consider each
strata separately Method can help ensure all subgroups are
represented so our data is more reliable
Stratified Sample● Example – a poll about safety within a
university● Three identified strata
Resident students Commuter students Faculty and staff
● It is reasonable to assume that the opinions within each group are similar
● It is reasonable to assume that the opinions between each group are different
Stratified Sample Assume that the sizes of the strata are
Resident students – 5,000 Commuter students – 4,000 Faculty and staff – 1,000
If we wish to obtain a sample of size n = 100 that reflects the same relative proportions, we would want to choose 50 resident students 40 commuter students 10 faculty and staff
Finally, conduct a simple random sample within each subgroup to obtain data.
Systematic Sample A systematic sample is obtained when we
choose every kth individual in a population The first individual selected corresponds to a
random number between 1 and k Systematic sampling is appropriate
When we do not have a frame When we do not have a list of all the individuals
in a population
Systematic Sampling
Systematic Sampling Example – polling customers about
satisfaction with service We do not have a list of customers
arriving that day We do not even know how many
customers will arrive that day Simple random sampling (and stratified
sampling) cannot be implemented
Systematic Sampling● Assume that
We want to choose a sample of 40 customers We believe that there will be about 350 customers
● Values of k k = 7 is reasonable because it is likely that enough
customers will arrive to reach the 40 target k = 2 is not reasonable because we will only
interview the very early customers k = 20 is not reasonable because it is unlikely that
enough customers will arrive to reach the 40 target
Cluster Sample A cluster sample is obtained when we
choose a random set of groups and then select all individuals within those groups
We can obtain a sample of size 50 by choosing 10 groups of 5
Cluster sampling is appropriate when it is very time consuming or expensive to choose the individuals one at a time
Cluster Sample
Cluster Sample Example – testing the fill of bottles
It is time consuming to pull individual bottles It is expensive to waste an entire cartons of
12 bottles to just test one bottle If we would like to test 240 bottles, we
could Randomly select 20 cartons Test all 12 bottles within each carton
This reduces the time and expense required
Convenience Sample● A convenience sample is obtained when we
choose individuals in an easy, or convenient way
● Self-selecting samples are examples of convenience sampling Individuals who respond to television or radio
announcements● “Just asking around” is an example of
convenience sampling Individuals who are known to the pollster
Convenience SampleConvenience sampling has little
statistical validity The design is poor The results are suspect
However, there are times when convenience sampling could be useful as a rough guess
Multistage Sample A multistage sample is obtained using a
combination of Simple random sampling Stratified sampling Systematic sampling Cluster sampling
Many large scale samples (the US census in noncensus years) use multistage sampling
Section 1.5Errors in Sampling
BiasIf the results of the sample are not
representative of the population, then the sample has bias. Three Sources of Bias
1.Sampling Bias
2.Nonresponse Bias
3.Response Bias
Sampling BiasTechnique used to obtain individuals
tends to favor one part of population over another.
Occurs often in convenience samplingOften results in undercoverage,
proportion of subgroup of population is lower in sample than actual population.
Nonresponse BiasOccurs when the “nonresponders”
to a survey have different opinions than those who do.
Frequent with surveysControlled using callbacks or
incentives
Response Bias Answers do not reflect true feelings of respondent Types of Response Bias
1. Interviewer error – need a trained interviewer
2. Misrepresented answers – responder gives inflated answers
3. Words used in survey question – wording can lead to misinterpretation or bias towards a specific position
4. Order of the questions or words within the question – be able to rearrange questions and choices to eliminate bias towards order.
5. Type of Question – open ended or closed
6. Data entry error – make sure to check for accuracy during entry
Types of Questions● Open ended questions
Allows the respondent to choose their own answer
Gives the flexibility to represent a variety of options
● Closed ended questions Limits the number of possible responses,
making the analysis easier Gives the respondents a structure explaining
more of the meaning and purpose of the question
● A combination of open ended and closed ended questions could be effective
Errors Two Questions for you…
Question 1 – “Do you feel that no final exams should not be required to not last more than 2½ hours?”
Error Two Questions for you…
Question 2 – “Do you feel that no final exams should not be required to not last less than 2½ hours?”
Errors● Two questions for students
Question 1 – “Do you feel that no final exams should not be required to not last more than 2½ hours?”
Question 2 – “Do you feel that no final exams should not be required to not last less than 2½ hours?”
● What will students say? May tend to say “yes” to both questions,
even though they are complete opposites May tend to say “yes” because both have
the words “no” and “required” and “2½ hours”
Sampling Error● One type of error, sampling errors, occur
because we use only part of the population in our study Samples consist of only part of the total data Samples are usually more realistic to analyze Because there are individuals in the
population that are not in our sample, sampling errors are difficult to control
Nonsampling Error Another type of error, nonsampling
errors, occur from the actual survey process Preference is given to selecting some
individuals over others Individual answers are not accurate (for
various reasons) Nonsampling errors can often be
controlled or minimized with a well-designed survey and sampling technique
Nonsampling Errors● Types of nonsampling error
Individuals who respond have different characteristics than individuals who do not respond
Interviewer errors Misrepresented answers Data checks Questionnaire design Wording of questions Order of questions, words, and responses
Presidential Election 2000
Section 1.6The Design of Experiments
Experiment: Think Science!!!
Controlled study between variables Explanatory variables or factors (independent
variable) Response variables (independent variable)
Designed to determine effect of explanatory variables on the response
Require Control group: serves as baseline
Other Controls Placebo – usually a sugar pill, looks tastes
and, smells like actual medication but no medical effects
Blinding Single-blind – participant does not know which
treatment they are receiving Double-blind – participant and researcher do not
know which treatment the person is receiving.
Designing an ExperimentHave overall plan for
experiment to eliminate bias and keep controls
Think “Scientific Method”Table p. 47 – 48
Steps in Conducting an ExperimentStep 1: Identify the problem to be solved.
• Should be explicit• Should provide the researcher
direction• Should identify the response variable
and the population to be studied.
Steps in Conducting an ExperimentStep 2: Determine the factors that affect the
response variable.• Once the factors are identified, it must
be determined which factors are to be fixed at some predetermined level (the control), which factors will be manipulated and which factors will be uncontrolled.
1-66
Steps in Conducting an Experiment
Step 3: Determine the number of experimental units.
• Consider time & money
• What will a good sample size be?
Steps in Conducting an ExperimentStep 4: Determine the level of the predictor variables
1. Control: There are two ways to control the factors.
(a) Fix their level at one predetermined value throughout the experiment. These are variables whose affect on the response variable is not of interest.
(b) Set them at predetermined levels. These are the factors whose affect on the response variable interests us. The combinations of the levels of these factors represent the treatments in the experiment.
2. Randomize: Randomize the experimental units to various treatment groups so that the effects of variables whose level cannot be controlled is minimized.
Step 5: Conduct the Experimenta) Replication occurs when each treatment is applied
to more than one experimental unit. This helps to assure that the effect of a treatment is not due to some characteristic of a single experimental unit. It is recommended that each treatment group have the same number of experimental units.
b) Collect and process the data by measuring the value of the response variable for each replication. Any difference in the value of the response variable can be attributed to differences in the level of the treatment.
Steps in Conducting an ExperimentStep 6: Test the claim.• This is the subject of inferential statistics.
Completely Randomized DesignEach experimental unit randomly
assigned to the treatment
Example1-71
Step 1: The response variable in miles per gallon.
Step 2: Factors that affect miles per gallon:
Engine size, outside temperature, driving style, driving conditions, characteristics of car
The octane of fuel is a measure of its resistance to detonation with a higher number indicating higher resistance. An engineer wants to know whether the level of octane in gasoline affects the gas mileage of an automobile. Assist the engineer in designing an experiment.
Step 3: We will use 12 cars all of the same model and year.
Step 4: We list the variables and their level.• Octane level - manipulated at 3 levels. Treatment A: 87 octane, Treatment B: 89 octane, Treatment C: 92 octane
• Engine size - fixed• Temperature - uncontrolled, but will be the same for all 12 cars.• Driving style/conditions - all 12 cars will be driven under the same conditions on a closed track - fixed.
• Other characteristics of car - all 12 cars will be the same model year, however, there is probably variation from car to car. To account for this, we randomly assign the cars to the octane level.
Step 5: Randomly assign 4 cars to the 87 octane, 4 cars to the 89 octane, and 4 cars to the 92 octane. Give each car 3 gallons of gasoline. Drive the cars until they run out of gas. Compute the miles per gallon.
Step 6: Determine whether any differences exist in miles per gallon.
Completely Randomized Design
Matched Pairs DesignExperimental units are paired up,
pairs match (before-after, twins, husband-wife)
One person gets treatment A, the other gets B (chosen randomly!)