Chapter 1

Chapter 1

Data Collection

Section 1.1Introduction to the Practice of

Statistics

Statistics The science of statistics is

Collecting Organizing Summarizing Analyzinginformation to draw conclusions or answer questions

Statistics provides a measure of confidence in any conclusion

Data Solve 3x + 5 = 11

Everyone (should) get the same answer

How long was your drive (or walk) to class today? Different answers…this is why we need

statistics! We can then break down the data to

meaningful information

Statistics and mathematics have similarities but are different

Mathematics Solves problems with 100% certainty Has only one correct answer

Statistics, because of variability Does not solve problems with 100% certainty

(95% certainty is much more common) Frequently has multiple reasonable answers

Population vs. Sample● A population (Greek μ)

Is the group to be studied Includes all of the individuals in the group

● A sample Is a subset of the population Is often used in analyses because getting

access to the entire population is impractical

x

Population vs. Sample Population Example

People 18 years and older

Sample Example Students at SHU 18 and older

Parameter vs. Statistic A statistic is a numerical summary of

the sample Descriptive statistics organize and

summarize the data in ways such as tables and graphs

Inferential statistics use the sample results and extend them to the population so we can measure the reliability of the results

A Parameter is a numerical summary of a population

Example Suppose the actual percentage of all

students at SHU that own a car is 48.2% This is a ________________________

We surveyed 100 students and found 46% own a car This is a _________________________

The Process of Statistics Identify the research objective: what do

you want answered Collect he data needed to answer the

question: Usually a sample (1.2 – 1.6) Describe the data: the descriptive

statistics (ch. 2 – 4) Perform Inferences: Use appropriate

techniques to test reliability for population (ch. 9 – 12)

Variables Characteristics of the individuals under

study are called variables Some variables have values that are

attributes or characteristics … those are called qualitative or categorical variables

Some variables have values that are numeric measurements … those are called quantitative variables

Qualitative Variables Examples of qualitative variables

Gender Zip code Blood type States in the United States Brands of televisions

Qualitative variables have category values … those values cannot be added, subtracted, etc.

Quantitative Variables Examples of quantitative variables

Temperature Height and weight Sales of a product Number of children in a family Points achieved playing a video game

Quantitative variables have numeric values … those values can be added, subtracted, etc.

Discrete Vs. Continuous● Quantitative variables can be either discrete

or continuous ● Discrete variables

Variables that have a finite or a countable number of possibilities

Frequently variables that are counts● Continuous variables

Variables that have an infinite but not countable number of possibilities

Frequently variables that are measurements

Discrete Variables Examples of discrete variables

The number of heads obtained in 5 coin flips The number of cars arriving at a McDonald’s

between 12:00 and 1:00 The number of students in class The number of points scored in a football

game The possible values of qualitative

variables can be listed

Continuous Variables Examples of continuous variables

The distance that a particular model car can drive on a full tank of gas

Heights of college students Sometimes the variable is discrete but has so

many close values that it could be considered continuous The number of DVDs rented per year at video

stores The number of ants in an ant colony

Section 1.2Observational Studies Versus

Designed Experiments

Observational Study● A survey sample is an example of an

observational study An observational study is one where there is no

attempt to influence the value of the variable An observational study is also called an ex post facto (after the fact) study

● Advantages It can detect associations between variables

● Disadvantages It cannot isolate causes to determine causation

Designed Experiment● A designed experiment is an experiment

That applies a treatment to individuals Often compares the treated group to a control

(untreated) group Where the variables can be controlled

● Advantages Can analyze individual factors

● Disadvantages Cannot be done when the variables cannot be

controlled Cannot apply in cases for moral / ethical reasons

Lurking & Confounding Variables A danger in observational studies are confounding and

lurking variables In an observational study, two explanatory variables can

be linked, thus causing the relation to the response to be due to another variable not accounted for: Confounding variables.

Lurking Variables are variables not initially considered in the study but affect the response variable.

Associated does not mean that one causes the other A simple observational study may find that smoking and

cancer are associated Cannot conclude that smoking causes cancer Cannot conclude that cancer causes people to smoke

What are some Lurking Variables with Smoking and Cancer?

Types of Observational StudiesCross-sectionalCase-controlcohort

1-22

Cross-sectional Studies Observational studies that collect information about individuals at a specific point in time, or over a very short period of time.

Case-control Studies These studies are retrospective, meaning that they require individuals to look back in time or require the researcher to look at existing records. In case-control studies, individuals that have certain characteristics are matched with those that do not.

Cohort Studies A cohort study first identifies a group of individuals to participate in the study (cohort). The cohort is then observed over a period of time. Over this time period, characteristics about the individuals are recorded. Because the data is collected over time, cohort studies are prospective.

Census● A census is a list

Of all the individuals in a population That records the characteristics of the individuals An example is the US Census held every 10 years

(this is only an example though)● Advantages

Answers have 100% certainty● Disadvantages

May be difficult or impossible to obtain Costs may be prohibitive

Section 1.3Simple Random

Sampling

Simple Random Sample● A simple random sample is when

every possible sample of size n out of a population of N has an equally likely chance of occurring

Simple Random Sample

Let’s Try It!● 5 Volunteers…● A simple (but not foolproof) method

Write each individual’s name on a separate piece of paper

Put all the papers into a hat Draw 2 random papers from the hat

● Physical methods have some issues Are the papers sufficiently mixed? Are some of the papers folded? What else???

Random Numbers A method using a table of random numbers (Back pages Table 1)

List and number the individuals Decide on a way to pick the random

numbers (how to choose the starting point and what rule to use to select which digits to choose after that)

Select the random numbers Match the numbers to the individuals

With the technology available today, this method is almost silly

Calculator Randint(start #, end #, how many)

Leave the 3rd entry blank for 1 value Table 3 Page 25:

Randomly survey 5 of their 30 clients. Number them 1 – 30 RandInt(1,30,5) Survey the clients corresponding to the

generated values.

Section 1.4Other Effective

Sampling Methods

Collecting Data There are other effective ways to collect

data Stratified sampling Systematic sampling Cluster sampling

Each of these is particularly appropriate in certain specific circumstances

Stratified Sample● A stratified sample is obtained when we

choose a simple random sample from subgroups of a population This is appropriate when the population is

made up of nonoverlapping (distinct) groups called strata

Within each strata, the individuals are likely to have a common attribute

Between the stratas, the individuals are likely to have different common attributes

Stratified Sample

Stratified Sample Example – polling a population about a

political issue It is reasonable to divide up the population into

Democrats, Republicans, and Independents It is reasonable to believe that the opinions of

individuals within each party are the same It is reasonable to believe that the opinions

differ from group to group Therefore it makes sense to consider each

strata separately Method can help ensure all subgroups are

represented so our data is more reliable

Stratified Sample● Example – a poll about safety within a

university● Three identified strata

Resident students Commuter students Faculty and staff

● It is reasonable to assume that the opinions within each group are similar

● It is reasonable to assume that the opinions between each group are different

Stratified Sample Assume that the sizes of the strata are

Resident students – 5,000 Commuter students – 4,000 Faculty and staff – 1,000

If we wish to obtain a sample of size n = 100 that reflects the same relative proportions, we would want to choose 50 resident students 40 commuter students 10 faculty and staff

Finally, conduct a simple random sample within each subgroup to obtain data.

Systematic Sample A systematic sample is obtained when we

choose every kth individual in a population The first individual selected corresponds to a

random number between 1 and k Systematic sampling is appropriate

When we do not have a frame When we do not have a list of all the individuals

in a population

Systematic Sampling

Systematic Sampling Example – polling customers about

satisfaction with service We do not have a list of customers

arriving that day We do not even know how many

customers will arrive that day Simple random sampling (and stratified

sampling) cannot be implemented

Systematic Sampling● Assume that

We want to choose a sample of 40 customers We believe that there will be about 350 customers

● Values of k k = 7 is reasonable because it is likely that enough

customers will arrive to reach the 40 target k = 2 is not reasonable because we will only

interview the very early customers k = 20 is not reasonable because it is unlikely that

enough customers will arrive to reach the 40 target

Cluster Sample A cluster sample is obtained when we

choose a random set of groups and then select all individuals within those groups

We can obtain a sample of size 50 by choosing 10 groups of 5

Cluster sampling is appropriate when it is very time consuming or expensive to choose the individuals one at a time

Cluster Sample

Cluster Sample Example – testing the fill of bottles

It is time consuming to pull individual bottles It is expensive to waste an entire cartons of

12 bottles to just test one bottle If we would like to test 240 bottles, we

could Randomly select 20 cartons Test all 12 bottles within each carton

This reduces the time and expense required

Convenience Sample● A convenience sample is obtained when we

choose individuals in an easy, or convenient way

● Self-selecting samples are examples of convenience sampling Individuals who respond to television or radio

announcements● “Just asking around” is an example of

convenience sampling Individuals who are known to the pollster

Convenience SampleConvenience sampling has little

statistical validity The design is poor The results are suspect

However, there are times when convenience sampling could be useful as a rough guess

Multistage Sample A multistage sample is obtained using a

combination of Simple random sampling Stratified sampling Systematic sampling Cluster sampling

Many large scale samples (the US census in noncensus years) use multistage sampling

Section 1.5Errors in Sampling

BiasIf the results of the sample are not

representative of the population, then the sample has bias. Three Sources of Bias

1.Sampling Bias

2.Nonresponse Bias

3.Response Bias

Sampling BiasTechnique used to obtain individuals

tends to favor one part of population over another.

Occurs often in convenience samplingOften results in undercoverage,

proportion of subgroup of population is lower in sample than actual population.

Nonresponse BiasOccurs when the “nonresponders”

to a survey have different opinions than those who do.

Frequent with surveysControlled using callbacks or

incentives

Response Bias Answers do not reflect true feelings of respondent Types of Response Bias

1. Interviewer error – need a trained interviewer

2. Misrepresented answers – responder gives inflated answers

3. Words used in survey question – wording can lead to misinterpretation or bias towards a specific position

4. Order of the questions or words within the question – be able to rearrange questions and choices to eliminate bias towards order.

5. Type of Question – open ended or closed

6. Data entry error – make sure to check for accuracy during entry

Types of Questions● Open ended questions

Allows the respondent to choose their own answer

Gives the flexibility to represent a variety of options

● Closed ended questions Limits the number of possible responses,

making the analysis easier Gives the respondents a structure explaining

more of the meaning and purpose of the question

● A combination of open ended and closed ended questions could be effective

Errors Two Questions for you…

Question 1 – “Do you feel that no final exams should not be required to not last more than 2½ hours?”

Error Two Questions for you…

Question 2 – “Do you feel that no final exams should not be required to not last less than 2½ hours?”

Errors● Two questions for students

Question 1 – “Do you feel that no final exams should not be required to not last more than 2½ hours?”

Question 2 – “Do you feel that no final exams should not be required to not last less than 2½ hours?”

● What will students say? May tend to say “yes” to both questions,

even though they are complete opposites May tend to say “yes” because both have

the words “no” and “required” and “2½ hours”

Sampling Error● One type of error, sampling errors, occur

because we use only part of the population in our study Samples consist of only part of the total data Samples are usually more realistic to analyze Because there are individuals in the

population that are not in our sample, sampling errors are difficult to control

Nonsampling Error Another type of error, nonsampling

errors, occur from the actual survey process Preference is given to selecting some

individuals over others Individual answers are not accurate (for

various reasons) Nonsampling errors can often be

controlled or minimized with a well-designed survey and sampling technique

Nonsampling Errors● Types of nonsampling error

Individuals who respond have different characteristics than individuals who do not respond

Interviewer errors Misrepresented answers Data checks Questionnaire design Wording of questions Order of questions, words, and responses

Presidential Election 2000

Section 1.6The Design of Experiments

Experiment: Think Science!!!

Controlled study between variables Explanatory variables or factors (independent

variable) Response variables (independent variable)

Designed to determine effect of explanatory variables on the response

Require Control group: serves as baseline

Other Controls Placebo – usually a sugar pill, looks tastes

and, smells like actual medication but no medical effects

Blinding Single-blind – participant does not know which

treatment they are receiving Double-blind – participant and researcher do not

know which treatment the person is receiving.

Designing an ExperimentHave overall plan for

experiment to eliminate bias and keep controls

Think “Scientific Method”Table p. 47 – 48

Steps in Conducting an ExperimentStep 1: Identify the problem to be solved.

• Should be explicit• Should provide the researcher

direction• Should identify the response variable

and the population to be studied.

Steps in Conducting an ExperimentStep 2: Determine the factors that affect the

response variable.• Once the factors are identified, it must

be determined which factors are to be fixed at some predetermined level (the control), which factors will be manipulated and which factors will be uncontrolled.

1-66

Steps in Conducting an Experiment

Step 3: Determine the number of experimental units.

• Consider time & money

• What will a good sample size be?

Steps in Conducting an ExperimentStep 4: Determine the level of the predictor variables

1. Control: There are two ways to control the factors.

(a) Fix their level at one predetermined value throughout the experiment. These are variables whose affect on the response variable is not of interest.

(b) Set them at predetermined levels. These are the factors whose affect on the response variable interests us. The combinations of the levels of these factors represent the treatments in the experiment.

2. Randomize: Randomize the experimental units to various treatment groups so that the effects of variables whose level cannot be controlled is minimized.

Step 5: Conduct the Experimenta) Replication occurs when each treatment is applied

to more than one experimental unit. This helps to assure that the effect of a treatment is not due to some characteristic of a single experimental unit. It is recommended that each treatment group have the same number of experimental units.

b) Collect and process the data by measuring the value of the response variable for each replication. Any difference in the value of the response variable can be attributed to differences in the level of the treatment.

Steps in Conducting an ExperimentStep 6: Test the claim.• This is the subject of inferential statistics.

Completely Randomized DesignEach experimental unit randomly

assigned to the treatment

Example1-71

Step 1: The response variable in miles per gallon.

Step 2: Factors that affect miles per gallon:

Engine size, outside temperature, driving style, driving conditions, characteristics of car

The octane of fuel is a measure of its resistance to detonation with a higher number indicating higher resistance. An engineer wants to know whether the level of octane in gasoline affects the gas mileage of an automobile. Assist the engineer in designing an experiment.

Step 3: We will use 12 cars all of the same model and year.

Step 4: We list the variables and their level.• Octane level - manipulated at 3 levels. Treatment A: 87 octane, Treatment B: 89 octane, Treatment C: 92 octane

• Engine size - fixed• Temperature - uncontrolled, but will be the same for all 12 cars.• Driving style/conditions - all 12 cars will be driven under the same conditions on a closed track - fixed.

• Other characteristics of car - all 12 cars will be the same model year, however, there is probably variation from car to car. To account for this, we randomly assign the cars to the octane level.

Step 5: Randomly assign 4 cars to the 87 octane, 4 cars to the 89 octane, and 4 cars to the 92 octane. Give each car 3 gallons of gasoline. Drive the cars until they run out of gas. Compute the miles per gallon.

Step 6: Determine whether any differences exist in miles per gallon.

Completely Randomized Design

Matched Pairs DesignExperimental units are paired up,

pairs match (before-after, twins, husband-wife)

One person gets treatment A, the other gets B (chosen randomly!)

Chapter 1

Documents

Transcript of Chapter 1