Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf ·...

37
week5 1 Producing Data - Introduction Statistic is a tool that helps data produce knowledge rather that confusion. As such, it must be concerned with producing data as well as interpreting already available data. Exploratory data analysis helps reveal information in data. However, alone it can rarely provide convincing evidence for its conclusions. We may also use data to provide clear answers to specific questions such as what is the average life time of humans? This lecture is devoted to developing the skills needed to produce trustworthy data and to judge the quality of data produced by others. The techniques for producing data are among the most important ideas in statistics; they are the basis for formal statistical inference.

Transcript of Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf ·...

Page 1: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 1

Producing Data - Introduction

• Statistic is a tool that helps data produce knowledge rather that confusion. As such, it must be concerned with producing data as well as interpreting already available data.

• Exploratory data analysis helps reveal information in data. However, alone it can rarely provide convincing evidence for its conclusions.

• We may also use data to provide clear answers to specific questions such as what is the average life time of humans?

• This lecture is devoted to developing the skills needed to produce trustworthy data and to judge the quality of data produced by others.

• The techniques for producing data are among the most important ideas in statistics; they are the basis for formal statistical inference.

Page 2: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 2

Collecting data• Available data are the data that were produced in the past for

some other purpose but they may help answer a present question.

• Statistical designs for producing data rely on either sampling or experiments.

• A sample survey collects information about a population by selecting and measuring a sample from the population.

• Example: The General Social Survey interviews about 3000 adult residents of US every 2nd year. That is GSS selects a sample of adults to represent the larger population of all adults living in US.

• Census is an attempt to contact every individual in the population.

Page 3: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 3

Observation versus Experiment

• An observational study observes individuals and measures variables of interest but does not attempt to influence the response.

• An experiment imposes a treatment on individuals in order to observe their response.

• An observational study, even one based on a statistical sample is a poor way to study the effect of a treatment. To see the effect of a treatment we must actually impose the treatment.

• When our goal is to understand the cause and effect, experiments are the only source of fully convincing data.

Page 4: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 4

Design of experiments• The individuals on which, the experiment is done are the

experimental units.

• A specific experimental condition applied to the units is calleda treatment.

• A placebo is a dummy treatment. The response to a dummy treatment is the placebo effect.

• The explanatory variables in an experiment are called factors.

• The values of a factor are called levels.

• Many experiments study the joint effect of several factors. In such an experiment, each treatment is formed by combining a specific value of each of the factors.

• In principal, experiments can give good evidence of causation.

Page 5: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 5

ExampleWe want to study the effects of aspirin and beta carotene on heart attacks and cancer.Factors: Aspirin (levels: yes, no), Beta carotene (levels: yes, no).Response variables: occurrence of heart attacks and cancer.Treatments are the factor level combinations (4 treatments ).

The example above is a factorial (two factor) experiment.

Page 6: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 6

Bias

• The design of a study is biased if it systematically favors certain outcomes.

• An uncontrolled study of a new medical therapy, for example is biased in favor of finding the treatment effective because ofthe placebo effect.

• The group of patients who received a dummy treatment is called a control group, because it enable us to control the effects of outside variables on the outcome.

• Control is the first basic principle of statistical design of experiments. Comparisons of several treatments in the same environment is the simplest form of control.

• Example 3.9 page 180 in IPS.

Page 7: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 7

Randomization• The design of an experiment first describes the response

variable or variables, factors (explanatory variables), and the layout of the treatments, with comparison as the leading principle.

• The second aspect of design is the rule used to assign experimental units to the treatments. Comparison of the effects of treatments is valid only when all treatments are applied to similar groups of experimental units.

• Systematic differences among the groups of experimental units in a comparative experiment cause bias.

• The use of chance to divide experimental units into groups is called randomization.

• Randomization can be done by the Hat method, random number tables or software.

Page 8: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 8

Example• A food company assesses the nutritional quality of a new

“instant breakfast” product by feeding it to newly weaned male white rats and measuring their weight gain over a 28-day period. A control group of rats receives a standard diet for comparison. This experiment has a single factor (diet) with two levels. 30 rats were used for this experiment.

• The outline of the design is given in the following diagram

• The design in the above figure combines comparison and randomization to arrive at the simplest randomized comparative design.

Page 9: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 9

Principles of experimental design

• Control the effects of lurking variables on the response, simplyby comparing two or more treatments.

• Randomize - use impersonal chance to assign experimental units to treatments.

• Repeat each treatment on many units to reduce chance variation in the results.

Statistical Significance• An observed effect so large that it would rarely occur by

chance is called statistically significant.

Page 10: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 10

How to randomize

• The idea of randomization is to assign subjects to treatments by drawing names from a hat. In practice, experimenters use software to carry out randomization. We can randomize without software by using a table of random digits.

• A table of random digits is a list of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 that has the following properties:

The digit in any position in the list has the same chance of being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.

The digits in different positions are independent in the sense that the value of one has no influence on the value of any other.

Page 11: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 11

Completely randomized design (CRD)• When all experimental units are allocated at random among all

treatments, the experimental design is completely randomized.• Example (rats example on slide 8)

- Label each rate with a numerical value from 01, …, 30.- Start at line 164 in Table B and read two-digit groups.The first 10 two-digit groups in this line are 11 02 27 91 24 49 52 56 30 78 So the rates labeled 11, 02, 27, 24, 30 go into the experimental group. Run your finger across line 164 (and continue to line 165 if needed) until you have chosen 15 rates. They are the rates labeled

11, 02, 27, 24, 30, 17, 22, 21, 01, 13, 23, 16, 28, 20, 08.

Page 12: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 12

Cautions about experimentation• The study of the effects of aspirin and beta carotene on heart

attacks and cancer in the example on slide 5, was double-blind - neither the subjects nor the medical personnel who worked with them knew which treatment any subject had received. The double-blind method avoids unconscious bias, e.g. a doctor who doesn’t think that “just a placebo” can benefit a patient.

• Lack of realismThe subjects or treatment or setting of an experiment may not realistically duplicate the conditions we really want to study.

• Example 3.16 page 188 in IPS.

Page 13: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 13

Matched pairs designs

• Match pairs designs compare just two treatments. We choose blocks of two units that are as closely matched as possible. Alternatively, each block in a matched pairs design may consist of just one subject, who gets both treatments one after the other and serves as his or her own control.

• The idea is that matched subjects are more similar than unmatched ones, so that comparing responses within a number of pairs is more efficient than comparing the responses of groups of randomly assigned subjects.

• Randomization remains important; which one of the a matched pair receive the first treatment.

• Example 3.17 page 189.

Page 14: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 14

Block design• A block is a group of experimental units or subjects that are

known before the experiment to be similar in some way that is expected to affect the response to the treatments. In a randomized block design (RBD), the random assignment of units to treatments is carried out separately within each block.

• Example 3.18 page 190 in IPSProgress of a type of cancer differs in women and men. We want to compare 3 therapies.- gender is a blocking variable- two randomizations done, one assigning female subjects totreatments, and the other assigning male subjects.As described in the following diagram

Page 15: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 15

Page 16: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 16

Sampling design• A political scientist want to know what percent of the voting age

population consider themselves conservatives. He needs to gatherinformation about large group of individuals.

• Time, cost and inconvenience forbid contacting every individual.

• We gather information about only part of the group in order to draw conclusions about the whole population.

• We will not, as in an experiment, impose treatment in order to observe the response.

Page 17: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 17

Population and sample

• The entire group of individuals that we want information about is called the population.

• A sample is a part of the population that we actually examine in order to gather information.

• Sample designThe design of a sample refers to the method used to choose the sample from the population.

Poor sample design can produce misleading conclusions.

Page 18: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 18

Example

• The ABC network program Nightline asked (in a call-in poll) whether the UN should continue to have its headquarters in United States. More than 186000 callers responded ( telephone companies charge for these calls) and 67% said “No”.

• People who spend time and money to respond to call-in polls are not representative of the entire adult population. In fact they tend to be the same people who call radio talk shows.

• People who feel strongly, especially those with strong negative opinions, are more likely to call.

• It is not surprising that a properly designed sample showed that 72% of adults want UN to stay.

Page 19: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 19

Voluntary response sample

• A voluntary response sample consists of people who choose themselves by responding to a general appeal.

• Voluntary response samples are biased because people with strong opinions, especially negative opinions are most likely torespond.

• Random selection of a sample eliminates bias giving all individuals an equal chance to be chosen.

Page 20: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 20

Simple Random Sample

• A simple random sample (SRS) of size n consists of nindividuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

• How to select an SRS?Hat method, Random number tables or software.

• Example 3.24 page 200 in IPS.

Page 21: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 21

Stratified Random Sampling

• To select a stratified random sample, first divide the population into groups of similar individuals, called strata.

• Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.

• Example 3.26 page 203 in IPS.

Page 22: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 22

Multistage sampling design - Example• Data on employment/ unemployment are gathered by the Gov.’s

Current Population survey, which conducts interviews in about 55000 households each month.

• Its not practical to maintain a list of all US household from which to select a SRS. Cost of sending interviewers to the widely scattered households in an SRS would be too high. So use multistage design.

• The Current Population Survey sampling design is:Stage 1. Divide US into 2007 geographical areas called primary

sampling units (PSU). Select a sample of 754 PSUs.Stage 2. Divide each PSU selected into smaller areas called

“blocks”. Stratify blocks using ethnic and other information and take a stratified sample of the blocks in each PSU

Stage 3. Sort the housing units in each block into clusters of 4 nearby units. Interview the households in a random sample of these clusters.

Page 23: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 23

Systematic random samples – Example

We want to choose 4 addresses from a list of 100.

divide the list into 4 smaller lists each of 100/4 = 25 addresses.

Choose one of the first 25 at random (using random number tables) and then choose every 25th address.

E. g. If 13 is the random number selected, the sample consists of the addresses numbered 13, 38, 63, 88.

Page 24: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 24

Cautions about sample surveys• Undercoverage

Sample surveys require an accurate and complete list of the population (sampling frame). Because such lists are rarely available, most samples suffer from some degree of undercoverage, which occurs when some groups in the population are left out of the process of choosing the sample.

• Examples: (i) A sample survey of households will miss homeless people,

prison inmates, students in dormitories.(ii) An opinion poll conducted by telephone will miss the 6%

of American households without residential phones.

• Nonresponse occurs when an individual chosen for the sample can’t be contacted or doesn’t cooperate.

Page 25: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 25

Response bias• The behavior of the respondent or the interviewer can cause

response bias in sample results.

• Respondents may lie, especially if asked about illegal or unpopular behavior. The sample then underestimates the occurrences of such behavior in the population.

• Answers to questions that ask the respondent to recall past events are often inaccurate because of faulty memory.

• Wording of questions Confusing or leading questions can introduce a strong bias in a sample survey and even minor changes in wording can change a survey’s outcome.

Page 26: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 26

Statistical inference - Parameters and statistics

• A parameter is a number that describes the population. It is a fixed number, but in practice we do not know its value.

• A statistic is a number that describes a sample.The value of a statistic is known when we have taken a sample, but it can change from sample to sample.

• We often use a statistic to estimate an unknown parameter.

Page 27: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 27

Sampling distribution

• The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

• Example 3.33 page 214 in IPSWe simulate drawing SRSs of size 100 from the population of all adult US residents. Suppose that in fact 60% of the population find shopping frustrating. Then the true value of the parameter we want to estimate is p = 0.6.

The following diagrams describe the sampling distribution ofthe statistics for different sample size.p̂

Page 28: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 28

Page 29: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 29

Bias and Variability• A statistic used to estimate a parameter is unbiased if the

mean of its sampling distribution is equal to the true value of the parameter being estimated.

• The variability of a statistic is described by the spread of its sampling distribution.

• The spread is determined by the sampling design and the sample size n. Statistics from larger probability samples have smaller spreads.

• Managing Bias and Variability. To reduce bias, use SRS.To reduce the variability of a statistic from an SRS, use larger samples.

Page 30: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 30

Question - Final Dec 2001• Two drugs A and B, used to the treatment of glaucoma, were tested for

effectiveness on 10 diseased dogs. Drug A was administered to one eye of each dog and drug B to the other eye. Pressure measurements were taken 1 hour later on both eyeballs of each dog. Which of the following statements are true?(a) This is an example of a matched pairs design.(b) This is an example of a CRD. (c) This is an example of a RBD.

• Re the above study which of the following is the most important.(a) We need to randomize the assignment of dogs to drugs.(b) We need to randomize the assignment of drugs to eyes.(c) We need to select the dogs randomly from a bigger population.(d) We need to stratify the dogs before assigning the drugs.(e) We need to pair the dogs based on some relevant criteria related to the

response.

Page 31: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 31

Question - Dec 2001• A list enumeration areas in Ontario is made. From this list we

pick every 10th one after a random start. For the selected areas, we obtain maps. For each map we number the blocks, from 1 to N (N = number of blocks in that area). Using a RN table, we select two distinct numbers between 1 and N and include the corresponding blocks in our sample. On each selected block, we start at the northeast corner, and walk around the block, selecting every 5th household into our sample (from a random start). The types of sampling methods used here (in no particular order) are (a) stratified, SRS, systematic(b) systematic, multistage, stratified(c) multistage, SRS, systematic(d) multistage, SRS, stratified(e) SRS, systematic

Page 32: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 32

Question - Summer 2001 test-2a) In order to study various aspects of child abuse 15 ‘Child Welfare Service

Areas’ (CWSA) are randomly selected, from all those across Canada. From each selected CWSA, 10% of cases are chosen, by taking every 10th file from a cabinet.i) Is this an observational study or an experiment?

ii) Describe the design (in statistical terminology)

b) You want to determine the best colour for attracting cereal leaf beetles to boards on which they will be trapped. You will compare three colours: Blue, green, Yellow. The response variable is the count of beetles trapped. You will mount one board on each of 9 poles evenly spaced in a square field, with 3 poles in each row as shown below. You will proceed with a completely randomized experiment in order to compare the colours. Randomly assign colours to poles, and mark on the field sketch, the coloursassigned to each pole. Indicate exactly how you assigned the colours to the poles.

Page 33: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 33

c) In the cigarette smoking and cancer video, there was one study in which smokers and non-smokers were matched up w. r. t. 30 different variables making them ‘as like as possible’ in the words of the speaker. Cancer rates differed substantially between the smoker and non-smokers.i) Is this an observational study or a randomized block

design?

ii) Why does this or why does this not prove smoking cases cancer?

d) Increasing the sample size is one method for reducing bias. True or false?

Page 34: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 34

Question - Term test Summer 99Suppose that we want to select a sample of students from sta220 class (150 students in total)a) If we assign each student in the class a number from 001-150, and then use a

RN table to pick 2 distinct RNs from 001-150, and then take the corresponding students, what do we call this type of sample?

b) If we select the 5th student, after ordering the students in some fashion, what do we call this type of sampling design?

c) If we select randomly 4 students from the centre section, and then 2 at random from the section on the left side and finally 2 randomly from the section on the right side, what type of sampling design is this?

d) If we select randomly 5 rows in the classroom, then 2 students randomly from each selected row, what do we call this type of sampling design?

Page 35: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 35

Question - Term Test summer 2000For each of the following studies,

i) Indicate whether it is an observational study or a controlledexperiment.

ii) if an observational study:(a) Describe precisely the sampling design utilized. Use

appropriate statistical terminology.(b) Indicate the source of bias, if any are present.

ii) If an experiment, identify(a) the experimental unit(s) and the response variable(s).(b) the factors, treatments and the number of treatments.

Page 36: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 36

(A) A city has 2000 city-blocks in each of 4 geographical areas (NE, NW, SE, SW). Five blocks will be selected at random from each geographical area. For each selected block, 20% of households will be selected, by having the interviewer walk around the block, and take every 5th household, starting with the house at the Northwest corner. When the interviewer arrives at a household, one of the adults present is randomly selected to be interviewed.

Page 37: Producing Data - Introductionfisher.utstat.toronto.edu/~hadas/STAB22/Lecture notes/week5.pdf · week5 1 Producing Data - Introduction • Statistic is a tool that helps data produce

week5 37

(B) In order to investigate the effect of repeated exposure to an advertising message, a number of undergraduate students viewed a 40 minute TV program that included ads for a digital camera. Some of the students saw a 30 second commercial: other a 90 second version. The same commercialwas repeated either 1, 3 or 5 times during the program. After viewing, all of the subjects answered questions about their recall of the ad, their attitude toward the camera, and their intention to purchase it.