Applied Statistics I Lecture Notes (Outline View)

55
1/25/2012 1 Applied Statistics I What is data? Facts that convey information from which conclusions can be drawn (pg. 2) {20, 32, 50, 52, 60} {72, 60, 62, 71, 73} The above two sets are not data Need to add information First set: ages of people in a cancer study Second set: the heights of students in my probability class With this additional information, the set of numbers is now data What is statistics? Science of data: obtaining data, analyzing data, interpretation of data Data must be properly obtained We want to make decisions based on data Many applications of statistics: genetics, engineering, medicine, government, physics, policy making, self-improvement, etc Variation in Data ELECE dataset: 174 daily electric usage measurements KWH day to day is not the same variation in kilowatt hours Could be due to different appliances being turned on for different days; dishwasher on one day, but off the other day Or maybe one day you watched TV for 1 hour, another day for 3 hours; or electric heater was on for 4 hours one day and 1 hour the other day We can assess the variation in a dataset by its data distribution Data distribution: a list of each data value and its associated frequency Histograms… Tool for displaying variation in data Manual creation, computer creation Manual: Break the range of the data into a number of intervals, count the frequencies in each interval For the KWH dataset, we used the intervals: [10,15),[15,20),[20,25),[25,30),[30,35),[35,40),[40,45),[45,50),[50,55) A rule of thumb: less than 50 data points: use 5 to 7 intervals 50 to 99 data points: use 6 to 10 intervals 100 to 250: 7 to 12 More than 250: 10 to 20 intervals Histograms continued… After you break up the range, count the frequency in each interval Draw vertical bar with height proportional to frequency of that interval 1 2 3 4 5 6

Transcript of Applied Statistics I Lecture Notes (Outline View)

1/25/2012

1 2

Applied Statistics I What is data? Facts that convey information from which conclusions can be drawn (pg. 2) {20, 32, 50, 52, 60} {72, 60, 62, 71, 73} The above two sets are not data Need to add information First set: ages of people in a cancer study Second set: the heights of students in my probability class With this additional information, the set of numbers is now data What is statistics? Science of data: obtaining data, analyzing data, interpretation of data Data must be properly obtained We want to make decisions based on data Many applications of statistics: genetics, engineering, medicine, government, physics, policy making, self-improvement, etc Variation in Data ELECE dataset: 174 daily electric usage measurements KWH day to day is not the same variation in kilowatt hours Could be due to different appliances being turned on for different days; dishwasher on one day, but off the other day Or maybe one day you watched TV for 1 hour, another day for 3 hours; or electric heater was on for 4 hours one day and 1 hour the other day We can assess the variation in a dataset by its data distribution Data distribution: a list of each data value and its associated frequency Histograms Tool for displaying variation in data Manual creation, computer creation Manual: Break the range of the data into a number of intervals, count the frequencies in each interval For the KWH dataset, we used the intervals: [10,15),[15,20),[20,25),[25,30),[30,35),[35,40),[40,45),[45,50),[50,55) A rule of thumb: less than 50 data points: use 5 to 7 intervals 50 to 99 data points: use 6 to 10 intervals 100 to 250: 7 to 12 More than 250: 10 to 20 intervals Histograms continued After you break up the range, count the frequency in each interval Draw vertical bar with height proportional to frequency of that interval Show on board Table 1.3 (pg.6) and Figure 1.1 (pg.5) Histograms SAS method Start SAS Go through SAS method

3

4

5

6

7

1

More than 250: 10 to 20 intervals6

1/25/2012

Histograms continued After you break up the range, count the frequency in each interval Draw vertical bar with height proportional to frequency of that interval Show on board Table 1.3 (pg.6) and Figure 1.1 (pg.5) Histograms SAS method Start SAS Go through SAS method Stationarity Kilowatt data taken over time To make future predictions: we want future observations to have same variation pattern as past and present data Pattern of variation must not change as we gather more measurements Stationary process: a data generating mechanism for which the distribution of the resulting data doesnt change appreciably as more data are generated How to assess stationarity? One method: Make histograms of the KWH dataset for different time periods (Figure 1.2): Oct 17 - Dec5,Dec 6 Jan 24, Jan 25 - Mar 15,mar 16 - May6 Show SAS Is it stationary? If stationary, histograms should display same pattern of variation Stationarity continued Can also use scatterplots to assess stationarity scatterplot KWH vs Date (show) Talk about pattern in scatterplot (U) We see how KWH vary with time Key points: Data must be plotted vs time to assess stationarity: scatterplot method or series of histograms Single histogram cant assess stationarity (show) Time Series and Moving Averages Hard to see trend in scatterplots Time series plot: connect consecutive times with a line Show time series plot in SAS General trend: rise and decline Jagged up and down lines: high frequency oscillations, also called noise Signal: underlying pattern in the data Filtering method: suppress high-frequency oscillations/reduce noise Filtering Moving average filter: replaces each data value by the average of itself and observations occurring immediately before it in time 3-term moving averages, 5-term moving averages, etc Show formulas on board Do example on board

7

8

9

10

11

12

2

Signal: underlying pattern in the data Filtering method: suppress high-frequency oscillations/reduce noise 12

1/25/2012

Filtering Moving average filter: replaces each data value by the average of itself and observations occurring immediately before it in time 3-term moving averages, 5-term moving averages, etc Show formulas on board Do example on board KWH data after filtering Applied 7-term MA to KWH dataset reduction of noise Why 7? Weekly cycle of electric usage Show figure 1.5 (reduction of noise, easier to see trends) Show figure 1.6 (28-term MA, monthly cycle) Key points: Averages show less variability than raw data Averages with more terms show less variability than averages with fewer terms Both figures show KWH data is not stationary: Increasing signal then decreasing If a process is not stationary, you should try to explain the pattern of nonstationarity Stationary process Washer data (Figure 1.7): thickness of washers in mm coming off production line; 100 washers measured 5-term MA was used; no pattern versus time Stationary process Causes of variation Talked about histograms to summarize variation in data But what are some causes of the variation in the data? Cause of variation: mechanism responsible for some of the observed variation in the process Two types of causes: common causes of variation and special causes of variation Common causes of variation: process is running as well as it can, noise inherent in the process; common causes include poor supervision, inadequate training, or substandard maintenance Special causes of variation: problems that arise periodically and unpredictably. For example, broken tools, jammed machine, etc Plots of data can identify causes of variation Washer example Washer and 5-term MA model showed a stationary process But their thicknesses werent matching competitors Their thickness variability also greater than competitors Whats going on? Need to look at details of the process Washers were produced by three different machines Plot thickness vs time for each machine (figure 1.8) First machine is OK, second and third show abnormal behavior, machine 2 not stationary declining trend Further investigation: Machine 1: only common causes Machine 2: overstretched spring, special causes Machine 3: worn cam, special causes

13

14

15

16

3

Need to look at details of the process Washers were produced by three different machines Plot thickness vs time for each machine (figure 1.8) First machine is OK, second and third show abnormal behavior, machine 2 not stationary declining trend Further investigation: Machine 1: only common causes Machine 2: overstretched spring, special causes Machine 3: worn cam, special causes Identified causes responsible17

1/25/2012

Stratified plots Graphical tool to identify causes of variation in a stationary process Data is split into smaller groups or strata Look at the distributions of the data within each strata, and see how the distributions change strata to strata Another washer example A company was having too many washers outside of their specification so they wanted to investigate why? Washer example continued Step 1: They looked at washer thickness versus time for each machine (figure 1.9) Patterns show some differences, but all three machines are stationary Stratified plot: Plot thickness vs machine; we want to determine differences between machines (show plot) Look at vertical spread and the center of spread Interpretation of Stratified plot Within variation of machine 1 similar to machine 3, within variation of machine 2 is much larger Centers: machine 1 is centered around 2.16 mm, machine 2: 2.30 mm, and machine 3: 2.05 mm Between variation between machine 2 and 3 is large Investigated why machine 2 has excessive thickness Investigated why machine 3 has low thickness Other graphical tools to identify causes of variation Process flow diagram: schematic diagram showing process steps and flow of material and information linking them during the process, how the process is supposed to work Ishikawa diagram: cause and effect diagram, useful for identifying possible causes of observed phenomena Mesh making process example: Process flow diagram: the way the process is supposed to work (show) Ishikawa diagram (show): complete list of possible causes of variation, one category system is called 5Ms + E (man, machine, methods, materials, measurement, and environment) Trunk: process being studied: high strength meshes Branches: relevant causes of variation Smaller branches: possible causes of variation affecting the major causes of variation These two diagrams help think about possible causes of variation and help to find ways of improving the manufacturing process Gage R&R procedure Assesses the quality of a measurement process Gage Repeatability and Reproducibility Focuses on a single measurement device and the operators that use the device

18

19

20

21

4

Branches: relevant causes of variation Smaller branches: possible causes of variation affecting the major causes of variation These two diagrams help think about possible causes of variation and help to find ways of improving the manufacturing process21

1/25/2012

Gage R&R procedure Assesses the quality of a measurement process Gage Repeatability and Reproducibility Focuses on a single measurement device and the operators that use the device In Gage R&R, we consider repeatability and reproducibility Repeatability: consistency of the device in repeated measurements by the same operator on the same part If repeatability is low, maybe faulty measurement device, flawed measuring methodology, or poor operator training Reproducibility: variation between operators; low reproducibility means measurements vary greatly operator to operator, could be due to poor operator training

22

Gage R&R Example Manufacturer is making roller bearings There are 3 operators, and each operator uses a gage to make 10 measurements on the bearing Measurement of the play in an assembled bearing We would like to know if the process is stationary; whether the process is reproducible and repeatable To assess stationarity a time series plot was made for each operator (show) Interpretation: all three operator show stationarity, operators 1 and 2s measurements are close to one another, and operator 3s measurements show more variability than operator 1 or 2 Next we looked at a stratified plot (show) Interpretation of stratified plot: Observations for operator 3 are higher than operators 1 and 2 More variability in operator 3 than operators 1 and 2 Measurements are not consistent between operators, measurements made by operator 3 are substantially different from operators 1 and 2 reproducibility problem Repeatability problem? Depends on if the variation of measurement for operators 1 and 2 are acceptable, if yes, no problem; if variation too big, then look at measuring device and the process of taking measurementsChapter 2: Summarizing Data Graphical and numerical tools to summarize information Bar charts and histograms Shapes of distribution Summary measures of location: mean, median, mode, quantiles, percentiles Summary measures of variation: mean absolute deviation, standard deviation, and interquartile range These methods summarize a single pattern of variation Variables and Data Before we can summarize data, we need to know the types of variables under study Variable: name given to what is being measured, counted, or observed when data are collected Height, weight, gender, race, armspan, etc are variables Variables can be quantitative or categorical Quantitative variable if it results from a physical measurement or count ie age or

23

24

5

These methods summarize a single pattern of variation24

1/25/2012

Variables and Data Before we can summarize data, we need to know the types of variables under study Variable: name given to what is being measured, counted, or observed when data are collected Height, weight, gender, race, armspan, etc are variables Variables can be quantitative or categorical Quantitative variable if it results from a physical measurement or count ie age or height Categorical variable if it defines categories ie ethnicity or race Types of variables continued Discrete versus continuous Discrete set if each element of the set can be assigned a unique counting number 1, 2, 3, 4, Examples of discrete sets: Sets with finite number of elements Subsets of the integers {0, +/- 1, +/- 2, +/- 3, } Any interval of real numbers is not a discrete set Discrete variable: can only take values in a discrete set Continuous variable: a variable that can take on any value in an interval of the real number line Discrete variable example: Gender {M,F}, Flip four coin # of heads {0,1,2,3,4} Continuous variable example: height, in theory could take any decimal value between 5 ft and 7 ft Age: continuous and discrete; age could be continuous because in theory age could be any decimal value between 0 and 100, but usually people talk of ages in whole years so age is discrete.

25

26

Displaying Data Distributions Is your data categorical or quantitative? For categorical data bar chart Construction of a bar chart: Count the frequency of each category List the categories on the x-axis Draw a bar above each category; the height of each bar is the frequency of data in that category Bar Chart example You fill in during class Quantitative Data Frequency histogram summarizes quantitative data If your variable of interest discrete or continuous? If discrete, label the x-axis with the variable values and y axis with the frequencies (show example) If the variable is continuous or a discrete variable with a large number of possible values: Determine the number of bars you want to use Find the largest and smallest values ylow and yhigh Divide the interval ylow to yhigh into b equal subintervals Decide if data at a boundary goes to the upper or lower subinterval; either way OK, just need to be consistent

27

28

6

If your variable of interest discrete or continuous? If discrete, label the x-axis with the variable values and y axis with the frequencies (show example) If the variable is continuous or a discrete variable with a large number of possible values: Determine the number of bars you want to use Find the largest and smallest values ylow and yhigh Divide the interval ylow to yhigh into b equal subintervals Decide if data at a boundary goes to the upper or lower subinterval; either way OK, just need to be consistent Count the frequency in each subinterval Draw a bar above each subinterval with height equal to the frequency of data in that subinterval 29

1/25/2012

Frequency Histogram Example You fill in during class Features of a histogram Figure 2.4: Two peaks and a valley Two centers here Modal bar: a bar with height greater than or equal to adjacent bars A histogram with a single modal bar is called unimodal A histogram with more than one modal bar is called multimodal A histogram with two modal bars is called bimodal Figure 2.4 is considered bimodal Bimodality indicates two subgroups in the data Investigate bimodality Idea two subgroups in data Plot frequencies of female workers (pg. 45) Bimodality explained Lower modal bar mostly females High modal bar mostly males Types of frequency histograms Symmetric: bars to the left of some point is a mirror image of the bars to the right of that point Skewed: not symmetric Skewed right: unimodal and bars extend farther to the right of the modal bar than to the left Skewed left: unimodal and bars extend father to the left of the modal bar than to the right Symmetric: Figure 2.3 Skewed right: Figure 2.7 Analyzing frequency histograms Does the histogram have a single modal bar or more than one modal bar? Where are they located? Is the histogram symmetric? If it is not symmetric, how is it skewed? Does it make sense to talk of the center of the distribution? If so, where is the center? Describe the spread of the distribution. Are there gaps in the ranges of values or values that lie far away from the rest of the data?

30

31

32

33

34

Sources of histogram shapes

7

Does the histogram have a single modal bar or more than one modal bar? Where are they located? Is the histogram symmetric? If it is not symmetric, how is it skewed? Does it make sense to talk of the center of the distribution? If so, where is the center? Describe the spread of the distribution. Are there gaps in the ranges of values or values that lie far away from the rest of the data?34

1/25/2012

Sources of histogram shapes Symmetric unimodal: test scores {IQ, SAT, GRE}, weights and heights Right skewed: data that have a lower bound but no upper bound ie salaries, house prices, breaking strength of a material Left skewed: ages at death Multimodal: presence of two subgroups in the data Summary measures of quantitative data Measures of location: mean, median, mode, quartiles, percentiles Measures of spread: mean absolute deviation, standard deviation, variance, IQR

35

36

Measures of location: Data has n observations y1 to yn Measures of location try to describe the center of the data Mean example Median example Mode example Unimodal -> mode is the center point of the interval for which the modal bar lies Multimodal -> modes are the center points of the intervals over which the modal bars lie Skewed right examplePercentiles and Quantiles Q1 first quartile is a point at or below which lies one quarter of the observations Q3 third quartile is a point at or below lies three quarters of the observations In general qth quantile: value at or below lies proportion q of the data Qth quantile also called 100 qth percentile Q1 = 25th percentile = .25th quantile Measures of spread How spread out a data distribution is. Mean absolute deviation Standard deviation Variance IQR Do examples on board Section 2.4: Outliers What is an outlier? An outlier is an extremely unrepresentative data point Data point that is far away from the body of the data Causes of outliers: typing errors, machine malfunction, unusual real data values, scientific discovery When you find an outlier, should you omit it? It may be a new discovery Process for handling outliers: Try to identify cause of the outlier If a cause has been found, outlier may be omitted or replaced For example, a typing error may be corrected If no cause can be found, analyze the data with and without the outlier

37

38

39

8

An outlier is an extremely unrepresentative data point Data point that is far away from the body of the data Causes of outliers: typing errors, machine malfunction, unusual real data values, scientific discovery When you find an outlier, should you omit it? It may be a new discovery Process for handling outliers: Try to identify cause of the outlier If a cause has been found, outlier may be omitted or replaced For example, a typing error may be corrected If no cause can be found, analyze the data with and without the outlier If the two analyzes have similar results, the outlier is not having much effect If the two analyzes have different results, report the presence of the outlier, present both analyzes, make extra effort to identify the cause of the outlier, and obtain more data40

1/25/2012

How to identify outliers? Frequency histogram example: Frequency on y-axis X axis is difference between water depth measurements obtained by an ultrasonic measuring device and the actual depth Where are the outliers? .15 to .21 Method #2 A second method to identify outliers is called a boxplot Also called box and whisker plot Displays main features of a data distribution and identifies possible outliers To create a boxplot we need the five number summary of the data distribution The five numbers are: The median The first quartile The third quartile The lower adjacent value, A-, the smallest data value greater than Q1 1.5 * IQR The upper adjacent value, A+, the largest data value smaller than Q 3 + 1.5 * IQR 5 number summary to boxplot: Construction process: Find the 5-number summary Draw a box ranging from Q1 to Q3 Draw a line at the median location Draw a whisker from A- to the center of the lower side of the box Draw a whisker from A+ to the center of the upper side of the box Any data value outside of the whiskers are drawn as points These data values outside of the whiskers are called outliers

41

42

43

Boxplot Example 5 number summary for ultrasonic example Draw box and median line Draw 2 whiskers Draw points outliers Show boxplotMore about boxplots Another types of boxplot is called the side by side boxplots Allows you to compare multiple datasets Example: Salaries of technical support workers male versus female Disadvantages of boxplots: Only based on 5-numbers so can miss important data features

44

9

Draw points outliers Show boxplot44

1/25/2012

More about boxplots Another types of boxplot is called the side by side boxplots Allows you to compare multiple datasets Example: Salaries of technical support workers male versus female Disadvantages of boxplots: Only based on 5-numbers so can miss important data features A boxplot can not show bimodality or multimodality Should use frequency histograms in addition to boxplots since they can show bimodality and multimodality SAS insight example Load ELECE Make boxplot Distribution (Y) Robust measures of scale Frequency counts (book calls it data distribution) Section 2.5: Resistance of summary measures Resistant vs nonresistant summary measures Resistant measure: a summary measure that is not greatly changed by one or more outliers Non resistant measure: a summary measure that is greatly changed by one or more outliers Typically, the mean, the mean absolute deviation, and standard deviation are not resistant Median, mode, IQR, and quartiles are resistant measures Section 2.6: Trimmed and Winsorized means People want a resistant measure that acts like mean when there are no outliers Two measures with this property have been developed: k-times trimmed mean and the k-times Winsorized mean K-times trimmed mean idea: we want to focus on the central data values and the extreme values are causing trouble as outliers, so omit the k lowest and k highest observations and compute the mean of the rest Show formula K-times Winsorized mean K smallest values are changed to y(k+1) K largest values are changed to y(n-k) The rest of the values are unaltered Take the average of this new data set consisting of n values Show formula Section 2.8: Exploratory Data Analysis Process of looking at data in many different ways in order to get an initial understanding of the phenomenon under study Includes graphics: frequency histograms and boxplots Includes summary measures: means, medians, standard deviations, and IQR Want to get an idea of the patterns in the data and to use the data to suggest questions for further study Exploratory analysis doesnt provide conclusive evidence to answer specific questions Exploratory data analysis doesnt generalize beyond the data set being analyzed You need a formal designed study and the statistical inference methodology in order

45

46

47

48

49

10

Process of looking at data in many different ways in order to get an initial understanding of the phenomenon under study Includes graphics: frequency histograms and boxplots Includes summary measures: means, medians, standard deviations, and IQR Want to get an idea of the patterns in the data and to use the data to suggest questions for further study Exploratory analysis doesnt provide conclusive evidence to answer specific questions Exploratory data analysis doesnt generalize beyond the data set being analyzed You need a formal designed study and the statistical inference methodology in order to establish evidence to answer specific questions and in order to generalize beyond the set of the data being analyzed. Next we are going to discuss study design.50

1/25/2012

Section 3.1: The role of statistics in producing and analyzing data You need to design a study properly in order to obtain statistically valid results Definition: a designed study is one that employs a systematic arrangement or pattern for collecting data Two types of designed studies: controlled experiments and observational studies Both share the idea of the sampling unit Definition: a sampling unit is an entity on which measurements or observations can be made How to select the sampling units to be used in the study? Section 3.2: Sampling Section 3.2: Sampling Overview: Sampling definitions Example of definitions Purpose of sampling: why sample? Sampling designs: Simple random sampling Stratified random sampling Cluster sampling Multistage sampling Errors in sampling Steps in designing sampling plans Sampling definitions: Target population: the collection of sampling units about which we want to draw conclusions Frame: a list of all of the sampling units in the target population Sample: subset of the target population from which observations are actually obtained. Conclusions about the target population will be drawn from this sample. Sampling design: the pattern, arrangement, or method used for selecting a sample of sampling units from the target population Sampling plan: the operational plan, including the sample design, for actually obtaining or accessing the sampling units for the study Sampling definition examples An employer is interested in knowing the effectiveness of a new employee training program on employee productivity, so 50 workers are selected from all of the hourly workers for participation in the study. 25 workers are given the training program and the rest are not given training. After training is complete, the productivity of the 50 workers are assessed over a 3-week program. To class: What is the sampling unit? What is the target population?

51

52

53

11

obtaining or accessing the sampling units for the study53

Sampling definition examples An employer is interested in knowing the effectiveness of a new employee training program on employee productivity, so 50 workers are selected from all of the hourly workers for participation in the study. 25 workers are given the training program and the rest are not given training. After training is complete, the productivity of the 50 workers are assessed over a 3-week program. To class: What is the sampling unit? What is the target population? What is the frame? What is the sampling design? What is the sampling plan? Answer: The sampling unit: a worker, the workers productivity is the observation The target population: all of the hourly workers The sample: the 50 chosen workers The sampling design: the method used to obtain the 50 workers The sampling plan: the operational plan includes the sampling design and a method of accessing the worker productivity records Why sample? Why bother? An alternative to sampling is called a census. A census is where you obtain observations from every sampling unit in the population. So why use a sample instead of conducting a census? Cost: usually conducting a census involves an enormous expense Time: a census may take too long to complete. The value of results may be diminished. Precision: might be difficult to get accurate information from each individual in the population Feasibility: sometimes a census is not feasible. One example is destructive testing, like time until a hard drive fails. If we tested all of the hard drives for time until failure, there would be no functional hard drives remaining. Sampling designs.. Goal: to obtain conclusions from the sample that can be generalized to the target population Therefore, we need a sample that is representative of the target population We are going to select sampling units by probability sampling methods Probability sampling: a method of choosing a sample using a pre-specified chance mechanism Why use probability sampling? Helps assure the representativeness of the sample Allows us to characterize the accuracy and precision of the results Allows results of the study to be generalized to the target population Common probability sampling methods Simple random sampling A probability sampling method in which each potential sample has the same chance of being chosen Stratified random sampling A sampling method in which a separate simple random sample is taken from each stratum Cluster sampling

1/25/2012

54

55

56

57

12

1/25/201257

Common probability sampling methods Simple random sampling A probability sampling method in which each potential sample has the same chance of being chosen Stratified random sampling A sampling method in which a separate simple random sample is taken from each stratum Cluster sampling A sampling design in which sampling units are grouped into clusters and the clusters are sampled Multistage sampling A sampling design in which a sample is taken in stages Next: more details on these four methods of sampling Simple random sampling How to draw a simple random sample? Assign each member of the population a number Put each number on a piece of paper and put it in a hat Draw out one slip at a time until you have drawn the amount of slips equaling the desired sample size The sample consists of the members of the population corresponding to these numbers Take observations on these units Hat method not practical use table of random digits or use a computer Stratified random sampling Need homogenous subgroups of the population In the subgroups, the sampling units are similar in one or more ways These groups are called strata Stratified random sampling: Divide the population into a number of strata Take a separate simple random sample from each stratum Why stratified sampling instead of simple random sampling? If the sampling units within each stratum are more homogenous than those between different strata, stratified random sampling will provide gains in precision in the quantitative measures used to describe the population When we desire to analyze the strata separately, stratified random sampling ensures we have an adequate sample size for each Example The administration at a college wants to study the opinions of undergraduates on a number of issues, so they need to collect a sample of undergraduate and interview each member of sample. In this scenario a stratified random sample has the following advantages compared to simple random sampling: The administration would like to know the opinions of all of the majors and some majors have few students (like 1% of the students are humanities majors). If we conducted a simple random sample, we would have few or zero humanities majors in our sample, and we would like to know the opinions of the humanities majors. In a stratified random sample, take majors as strata. Sample reasonable numbers of each major so opinions of all majors are represented in the study. Next slide Example continued Reason #2: If we suspect that majors have widely different beliefs on the issues of interest then a simple random sample would lead to results with large variability. If the

58

59

60

61

13

conducted a simple random sample, we would have few or zero humanities majors in our sample, and we would like to know the opinions of the humanities majors. In a stratified random sample, take majors as strata. Sample reasonable numbers of each major so opinions of all majors are represented in the study. Next slide61

1/25/2012

Example continued Reason #2: If we suspect that majors have widely different beliefs on the issues of interest then a simple random sample would lead to results with large variability. If the opinions within a major are homogenous, stratified random sampling provides gains in precision in the quantitative measures used to describe the population. Take a simple random sample from each major (ie stratified random sampling with the majors as strata) How to choose the sampling size for each strata? Proportional allocation method: choose a sample size from a strata proportional to the number of sampling units in that stratum Other types of sampling designs So far, we discussed simple random sampling and stratified random sampling Cluster sampling: a sampling design in which sampling units are grouped into clusters and the clusters are sampled One stage cluster sampling: all of the sampling units within the selected clusters are included in the study Two stage cluster sampling: within the selected clusters, we chose a random selection of sampling units to include in the study Multistage sampling: a sampling design in which a sample is taken in stages Example of multistage sampling: Setup a hierarchy of clusters: divide up the country into counties, the counties are divided into townships, the townships divided into blocks, and blocks are divided into households Start with the largest clusters and take a sample at each stage Example of a multistage sample Stage 1: A sample of 3000 counties in the United States is chosen Stage 2: A sample of townships within each of the selected counties in chosen Stage 3: A sample of city blocks within each of the selected townships is chosen Stage 4: A sample of households within each of the selected blocks is chosen. Each of the selected households are interviewed/surveyed. When to use multistage cluster sampling? Advantages of multistage cluster sampling compared to simple random sampling or stratified random sampling: May not be possible to obtain a frame of the sampling units. For example, a list of all US households is not available. Since lists of counties and townships are available, multistage cluster sampling is doable. Traveling costs: a simple random sample could have higher traveling costs compared to a multistage cluster sampling. Sample households are on the same block so we would save on traveling costs. Errors in sampling We want to obtain a sample of sampling units in order to draw conclusions about the target population Conclusions from a sample wont exactly match the true state of the population Example: an exit poll of 100 individuals shows that 55% favor Clinton for president, but in the target population, we wont have exactly 55% favoring Clinton for president Sampling error: the error obtained when, due to chance, a sample quantity gives

62

63

64

65

14

compared to a multistage cluster sampling. Sample households are on the same block so we would save on traveling costs.65

1/25/2012

Errors in sampling We want to obtain a sample of sampling units in order to draw conclusions about the target population Conclusions from a sample wont exactly match the true state of the population Example: an exit poll of 100 individuals shows that 55% favor Clinton for president, but in the target population, we wont have exactly 55% favoring Clinton for president Sampling error: the error obtained when, due to chance, a sample quantity gives different results from the analogous population quantity Probability sampling methods allows us to precisely quantify the sampling error Sampling error is not the result of a mistake, but due to natural variability in the sampling process Other sources of error Non sampling errors: errors that results from an inability to carry out the sampling plan correctly Bias: any systematic error in data collection or measurement Samples could be unrepresentative of the population or measurements could be obtained in a biased manner Unrepresentative samples can occur due to selection bias Selection bias: a systematic tendency on the part of a sampling procedure to under represent or exclude one or more kinds of sampling units from the sample Examples: If our target population is all Americans, a sampling plan which samples from households will miss the homeless, inmates, and students in dorms. A sampling plan that random samples telephone numbers will miss people without phones. Steps in designing sampling plans Step 1: Identify the target population Does the frame truly represent the target population? If we ask for the opinions of the readers of a particular magazine, we cant generalize the responses beyond the population of readers of that magazine. The sampled population, the readers of the magazine, may differ in their opinions from the target population of the general public Step 2: How are you going to conduct sample selection? Probability sampling methods should always be used Depending on the problem under consideration and the budget choose a method of sampling Step 3: Establish procedures to reduce non sampling errors Try to reduce selection bias, identify sources of selection bias, and put more resources into obtaining data from uncovered groups Step 4: Do a pilot study Allows us to obtain information about the target population in order to design the study properly Pilot studies help us determine an appropriate sample size Pilot studies allows us to correct any problems with the sampling plan Section 3.4: Controlled experiments Types of designed studies include controlled experiments and observational studies First, controlled experiments Controlled experiments Definitions Errors Some principles of experimental design How to assign treatments to experimental units

66

67

68

15

68

Section 3.4: Controlled experiments Types of designed studies include controlled experiments and observational studies First, controlled experiments Controlled experiments Definitions Errors Some principles of experimental design How to assign treatments to experimental units How to choose the experimental units How to establish cause and effect Considerations for human subjects Steps for planning an experiment Some definitions: Experimental unit: the name given to any sampling unit that has been selected for use in a controlled experiment Response: a measurement or observation of interest that is made on an experimental unit Factor: A quantity that is thought to influence the response Experimental factor: a factor purposely varied by the experimenter Nuisance factor: a factor that can not be controlled by the experimenter. Nuisance factors may or may not be known by the experimenter. Level: each value assumed by a factor in an experiment Treatments: the combinations of levels of factors for which the response will be observed Effect: the change in the average response between two factor levels or between two combinations of factor levels Controlled experiment: any study in which treatments are imposed on experimental units in order to observe responses The experimenter can decide which experimental units receive which treatments Controlled experiments can establish cause and effect relationships Controlled experiment example Experiment to characterize the hardness of a new type of plastic as a function of temperature and pressure at the time of molding. Pressure is set to 200, 300, or 400 pounds per square inch and temperature is set to 200 or 300 Fahrenheit. Two pieces of plastic are molded at each of the six pressure-temperature combinations. Experimental units: plastic pieces Response: the hardness of the molded plastic Experimental factors: pressure and temperature Levels of pressure: 200, 300, and 400 psi Levels of temperature, 200 and 300 Fahrenheit Treatments: the six pressure-temperature combinations Controlled experiment: treatments are imposed on the experimental units and plastic hardness is measured Nuisance factors: ambient temperature, humidity, etc Errors in controlled experiments Measurement bias: Bias: any systematic error in data collection or measurement Miscalibrated measuring device Known biases can be corrected What are the potential sources of measurement bias? Experimental error: The difference in responses taken in exactly the same manner at the same

1/25/2012

69

70

71

16

71

1/25/2012

Errors in controlled experiments Measurement bias: Bias: any systematic error in data collection or measurement Miscalibrated measuring device Known biases can be corrected What are the potential sources of measurement bias? Experimental error: The difference in responses taken in exactly the same manner at the same treatment Consists of measurement errors and errors due to nuisance factors Plastic example The difference in the two measurements taken at each pressure temperature combination are experimental error Sources of this error: Measurement error: cant determine plastic hardness perfectly Variation in experimental units: variation due to different suppliers of plastic pieces, different batches from the same supplier, or variation within a batch Nuisance factors: ambient temperature, barometric pressure, or humidity Confounding Confounded factor: Two or more factor are confounded if it is impossible to separate their individual effects Plastic example: plastic at temperature 200 are from one supplier but plastics at 400 were from another supplier Supplier are temperature confounded If an effect of temperature is observed, then we dont know if the effect is due to temperature or supplier or both

72

73

Principles of experimental design These principles lead to efficient and scientifically valid experimentation Principle 1: Make sure the process is stationary Principle 2: Block what you can Principle 3: Randomize what you cant block Principle 4: Replicate as time and budget permit Principle 5: Confirm the results Lets consider each of these principles in detailPrinciple 1: make sure the process is stationary Variation due to experimental factors and process non stationarity Process non stationarity makes it more difficult to identify the effects of the experimental factors Try to stabilize a process before conducting experiments on it Principle 2: Block what you can Blocking factor: a nuisance factor whose levels can be selected for each experimental unit. These levels define groups of experimental units called blocks that are treated similarly during the experiment Blocking minimizes variation due to nuisance factors Blocking can also minimize the effects of non stationarity when data is taken from a non stationary process Two types of blocking factors: Characteristics associated with an experimental unit: When people are experimental units, blocking is frequently done by gender, age, income, education, job experience, etc

74

75

17

Blocking factor: a nuisance factor whose levels can be selected for each experimental unit. These levels define groups of experimental units called blocks that are treated similarly during the experiment Blocking minimizes variation due to nuisance factors Blocking can also minimize the effects of non stationarity when data is taken from a non stationary process Two types of blocking factors: Characteristics associated with an experimental unit: When people are experimental units, blocking is frequently done by gender, age, income, education, job experience, etc Characteristics associated with the experimental setting: Examples of blocking factors: the observer, the time of processing, machine, batch of material, measuring instrument, etc76

1/25/2012

More about blocking Blocking by time: captures variability due to learning by an observer, or changes in equipment Blocking by observers: eliminates interobserver variability Blocking by batches: reduces batch to batch variability Subjects as blocks: some or all of the treatments are given to every subject Principle 3: Randomize What You Cant Block Randomization: the chance assignment of treatments to experimental units in order to nullify the effects of unsuspected nuisance factors Randomization and blocking example: A company has two shifts: the morning and the afternoon. All workers for the first shift were given the first incentive program and the second shift workers were given the second incentive program. This study concluded that first incentive program was better so it was implemented. Unfortunately, after 6 months, the results of implementing this incentive program were not good. Upon further investigation, it was found that morning workers were full-time and afternoon workers were part-time. Since morning workers had one incentive program and the afternoon workers had another, the effect of the incentive program was confounded with worker type. Example continued How to fix this? Make type of worker a blocking factor. Make two blocks: one with just part time workers and one with just full time workers. Then within each block, randomly assign one incentive program to half of the workers and the other incentive program to the rest of the workers in the block. Randomization guarded against the effects of unsuspected nuisance factors Blocking here minimized the variation due to known nuisance factors Replicate as Time and Budget Permits Definition Replication: Repetition of each treatment in an experiment Need to repeat entire experiment More data -> greater precision in conclusions More replications easier to identify small differences in the response Replications helps estimate the amount of experimental error Number of replications decided by precision desired, by time constraints, and by budget constraints Principle 5: Confirm the results You want to confirm your results by running additional trials Gives you information about reproducibility and persistence of your original conclusions

77

78

79

80

81

Review:

18

Replications helps estimate the amount of experimental error Number of replications decided by precision desired, by time constraints, and by budget constraints80

1/25/2012

Principle 5: Confirm the results You want to confirm your results by running additional trials Gives you information about reproducibility and persistence of your original conclusions Review: Principle Principle Principle Principle Principle 1: 2: 3: 4: 5: Make sure that the process is stationary Block what you can Randomize what you can not block Replicate as time and budget permits Confirm the results

81

82

How to assign treatments to experimental units? Two methods: Completely randomized assignment: similar to simple random sampling Randomized complete block assignment: similar to stratified random sampling Completely randomized design Definition: a design in which treatments are assigned to experimental units completely at random. Every unit has an equal chance to receive any of the treatments. Procedure: Assume there are k treatments and n experimental units Decide how many experimental units are to be assigned to each treatment. Call these numbers: n1, n2, , nk Put the numbers 1 thru n in a hat and shake The first n1 numbers drawn will be the experimental units to receive the first treatment, the next n2 numbers drawn will be the experimental units to receive the 2nd treatment, and so on Practical random numbers We dont put numbers in a hat For small experiments can use random number table on page 911 Another method is to use computer software to generate random number For example in excel, rand(), generates a random number between 0 and 1 Completely randomized design using a computer Assign each experimental unit a number from 1 to n Generate n random numbers using the softwares random number generator. The first random number is assigned to the first experimental unit. The second random number is assigned to the 2nd random number and so on Sort the random numbers smallest to largest Assign treatment 1 to the n1 smallest random numbers, assign treatment 2 to the next n2 smallest random numbers, and so on Advantages and Disadvantages to a Completely Randomized Design Definition homogenous: similar experimental units Definition heterogeneous: dissimilar experimental units This design is useful when experimental units are homogenous This design is flexible and can accommodate any number of treatments and any number of observations per treatment Missing observations create no problems in the analysis of single factor studies Missing observation examples: death of a subject, loss of a data sheet This design is inefficient when experimental units are heterogenous

83

84

85

19

85

Advantages and Disadvantages to a Completely Randomized Design Definition homogenous: similar experimental units Definition heterogeneous: dissimilar experimental units This design is useful when experimental units are homogenous This design is flexible and can accommodate any number of treatments and any number of observations per treatment Missing observations create no problems in the analysis of single factor studies Missing observation examples: death of a subject, loss of a data sheet This design is inefficient when experimental units are heterogenous If this is the case try blocking next design Randomized Complete Block Design Blocking -> reduce or eliminate known sources of data variation Definition of Randomized Complete Block design: a design in which the experimental units are divided into blocks, and, separately within each block, all of the treatments are assigned at random to the experimental units within that block This design has better precision than a completely randomized design of similar size when the experimental units are heterogeneous Missing observations are more difficult to handle in this design compared to completely randomized design Idea of randomized complete block design is similar to stratified random sampling Cause and Effect The goal of a controlled experiment is to establish a cause and effect relationship between the treatments and observed responses In a controlled experiment, the experimenter ensures that the only possible differences in observed reponses are due to differences in treatments Need a properly designed and properly conducted controlled experiment

1/25/2012

86

87

88

Human subject considerations Ethical issues Design issues Treatment group: a treatment is assigned to a group of subjects Control group: no treatment is given to a second group of human subjects Control group allows us to determine the effect of the treatment and helps us establish a baseline for evaluating the effect of the treatment Control group should be given a placebo (looks like a treatment but is inactive) Placebo makes sure that the response is due to treatment rather than to the idea of treatment Definition double blind: the subjects dont know if they are in the control or treatment group and the experimenter doesnt know either In some cases, the people who evaluate the responses are also blind to the assignment of control vs treatment Protects against bias from the subjects and from those who conduct the experimentObservational studies Definition: select sampling units, and then observe characteristics of those units. In contrast to a controlled experiment, you dont impose treatments on sampling units. Three types of observational studies: Prospective studies Retrospective studies Sample surveys Prospective study From each sampling unit, obtain data on hypothesized causal factors and responses

89

90

20

89

Observational studies Definition: select sampling units, and then observe characteristics of those units. In contrast to a controlled experiment, you dont impose treatments on sampling units. Three types of observational studies: Prospective studies Retrospective studies Sample surveys Prospective study From each sampling unit, obtain data on hypothesized causal factors and responses Observational studies cant show cause and effect Prospective study example: Does smoking cause lung cancer? Sampling units are persons Hypothesized causal factor is smoking behavior Response (occurrence or not of lung cancer) Individuals are followed over time Called observational study since smoking behavior and response are observed not controlled by the researchers Lack of control is a weakness of all observational studies If we see a higher incidence of lung cancer in the smoking group, then the observed difference could be due to smoking or it could be due to some other cause (genetics) Observational studies dont show cause and effect, instead we say the increased incidence of lung cancer is associated with smoking The hypothesized causal factors might be confounded with other factors that cause the response Lack of controlled experimental conditions, lack of randomization How to establish cause and effect with smoking? Many observational studies show a strong association Show that the association holds for various population subgroups Animal experiments were done on animals closely related to humans (cause and effect holds) Molecular mechanisms were identified to explain the cause and effect relationship The large amount of evidence establishes cause and effect Control in observational studies Cant assign treatments to experimental units But you can control for nuisance factors You can compare sampling units having similar values of the nuisance factors or you can adjust for values of the nuisance factors For example, Cochran studied death rates of smokers versus non-smokers. The difference in death rates could be due to smoking or it could be due to an age effect. Maybe the mean age of one group is bigger than the other? Solution: stratify subjects by age and compare death rates for the two groups Alternatively, you can use statistical methods to adjust for the age effect Retrospective studies If you are studying a chronic disease, the time between the hypothesized cause and the observed effect could be long In this case, a prospective study would be long and costly If an effect occurs rarely, then a prospective study would have to follow a very large number of individuals so that we can observe enough cases. Following large numbers of individuals would be very costly In these two scenarios, a retrospective study might be conducted instead of a prospective study

1/25/2012

90

91

92

21

Alternatively, you can use statistical methods to adjust for the age effect92

Retrospective studies If you are studying a chronic disease, the time between the hypothesized cause and the observed effect could be long In this case, a prospective study would be long and costly If an effect occurs rarely, then a prospective study would have to follow a very large number of individuals so that we can observe enough cases. Following large numbers of individuals would be very costly In these two scenarios, a retrospective study might be conducted instead of a prospective study Retrospective study: the end result is observed and the hypothesized causes are sought Example: A retrospective study on the effect of smoking on the incidence of lung cancer begins by finding a group of individuals with lung cancer and a group without lung cancer. Next, study how the two groups differ in smoking behavior. Also could compare the two group with respect to other potential causes of lung cancer. Sample Surveys Uses a sample to obtain information about the whole population Goal of sample surveys is to describe the population or to compare subgroups of the population Nonsampling errors in studies of human populations: Selection bias: systematic tendency on the part of a sampling procedure to underrepresent or exclude one or more kinds of sampling units from the sample Nonresponse bias: when a selected individual can not be contacted or refuses to cooperate Example: a telephone survey during day-time hours will fail to contact people who work during the day Response bias: when questions are difficult to understand or are phrased in such a way to make a particular answer more desirable to the respondent Fixes for non-response bias and response bias? Be careful when designing questions and be careful with non-response Try to minimize the effects of these non-sampling errors Steps in designing an observational study Determine what information is required Leaving out important questions may need to repeat the study Too many questions more nonrespondents Collect demographic data (gender, age, race, etc) to make sure that the sample is representative of the population Design the sampling plan Identify the target population Decide on how you are going to sample Try to reduce nonsampling errors (next slide) Do a pilot study (next slide) How is the data going to be gathered? Mailings, phone calls, face to face? Non-sampling errors Establish procedures to reduce non-sampling errors Non-response bias High nonresponse sample may be not be representative of the population Try to reduce the non-response bias as much as possible Keep in mind the number of questions asked and how sample members are contacted

1/25/2012

93

94

95

22

1/25/201295

Non-sampling errors Establish procedures to reduce non-sampling errors Non-response bias High nonresponse sample may be not be representative of the population Try to reduce the non-response bias as much as possible Keep in mind the number of questions asked and how sample members are contacted Telephone and in person surveys tend to have higher response rates than mailed surveys Try to follow up on nonrespondents to reduce the nonrespondent rate Check if the respondent and nonrespondents differ in age, gender, race, occupation, etc If the two groups are similar then there is less effect on the study results Response bias Be careful in the wordings of the questions Test your questions Randomize the order of the questions The interviewers should be neutral when asking questions Do a pilot study Gather information about the target population in order to design the main study properly To test your operational procedures for conducting the study Chapter 4: Statistical Modeling Models: Mathematical curves that try to explain data-distribution patterns Choose a model, test your model, refine your model Model only approximation to reality For which types of data is your model appropriate? What are the assumptions of your model? How are models related to one another? Density histograms Similar to frequency histograms Similar in construction but height of bars is chosen so that the area of the bar = the proportion of data in that subinterval The height of the bar in the density histogram is called the density of the data in the subinterval The height of the bar is calculated as: fi/(wi * n) Where fi is the frequency of the ith subinterval, wi is the width, and n is the total number of data values Area of bar is then fi /(wi*n) * wi = fi/n which is the proportion of data that lie in that subinterval If you sum up the areas of all of the bars, you will get 1. When the subintervals all have the same width, the frequency histogram and the density histogram look the same, only the scale of the y-axis changes Notion of probability Bernoulli trial: an occurrence in which exactly one of two possible outcomes can occur: heads/tails, success/failure, 6/not a 6 Often outcomes are denoted 1/0 or success/failure Let the probability of success be called p Then the probability of failure is 1-p What is the interpretation of probability p?

96

97

98

23

When the subintervals all have the same width, the frequency histogram and the density histogram look the same, only the scale of the y-axis changes 98

1/25/2012

Notion of probability Bernoulli trial: an occurrence in which exactly one of two possible outcomes can occur: heads/tails, success/failure, 6/not a 6 Often outcomes are denoted 1/0 or success/failure Let the probability of success be called p Then the probability of failure is 1-p What is the interpretation of probability p? Limit of relative frequency of success in n trials as n approaching infinity, the relative frequency will converge to p (show picture) Notions continued Probabilities are between 0 and 1 inclusive Random phenomena: an occurrence that results in one of a known set of definite, identifiable outcomes, but whose actual outcome can not be predicted with certainty in advance Tossing a coin is a random phenomena since they are two possible outcomes but before the coin is tossed we dont know which one will occur Trial: a single occurrence of a random phenomenon Event: any set of possible outcomes of a random phenomenon Notions continued Toss a coin 3 times: Trial: one set of 3 tosses Event: A = three heads, B= two heads A = {HHH},B={HHT,HTH, THH} Event occurs = one of the outcomes in the event occurs Nn(E)/n as n goes to infinity approaches P(E) Count the number of trials in n trials for which E occurs , relative frequency of E occurring in those n trials Example what is probability of two heads in 3 tosses? Repeat many trials (each trial consists of 3 coin tosses) Look at relative frequency of occurrence in n trials The limit of the relative frequency as n approaches infinity is P(E) Coins continued 3 tosses, 8 possible outcomes Since the coin is fair, we expect each of the possible outcomes to have the same probability each possible outcome has probability 1/8 How to calculate probability? Relative frequency method Addition rule of probability Equally likely rule Addition rule of probability Let A and B be two events from the same random phenomenon Definitions: Intersection of A and B are the outcomes which are common to both A and B (show symbol) The union of A and B are the outcome that are in A or in B or in both A and B (show symbol) Events A and B are called disjoint if they have no outcomes in common Example: Event A = three heads {HHH} Event B = two heads = {HHT, HTH, THH}

99

100

101

102

24

Definitions: Intersection of A and B are the outcomes which are common to both A and B (show symbol) The union of A and B are the outcome that are in A or in B or in both A and B (show symbol) Events A and B are called disjoint if they have no outcomes in common Example: Event A = three heads {HHH} Event B = two heads = {HHT, HTH, THH} A and B are disjoint since they have no outcomes in common Intersection here is called null event (the event containing no outcomes) (show symbol) Union of A and B is {HHH, HHT, HTH, THH}103

1/25/2012

Addition rule continued If events A and B are disjoint then the probability of their union is the sum of their individual probabilities If you have a collection of events that are disjoint A, B, C, D then the probability of their union is the sum of their individual probabilities This addition rule can be extended to any number of events Event one head = {HTT, THT, TTH} Event two heads = {HHT, THH, HTH} P(union of these events) = P(one head) + P(two heads) Addition rule continued What is the probability of one head? The one head event = {HTT, THT, TTH} Can be considered as the union of the three events {HTT}, {THT}, and {TTH} These three events are disjoint: So P(one head) = P(HTT) + P(THT) + P(TTH) = 3/8 This idea is generalized in the equally likely outcomes rule: If a random phenomenon has m outcomes each with probability 1/m, and E is an event consisting of k of these outcomes, then P(E) = k/m Example: P(two heads) = P({HHT, HTH, THH}) Three tosses has 8 equally outcomes, events consists of 3 outcomes P(two heads) = 3/8 from equally likely outcomes rule Independence Definition: Two events are called independent if knowing whether or not one occurs does not change the probability that the other occurs If two events are not independent, they are called dependent Example: event two heads vs event two tails in 3 tosses. If we know that the event two heads occurred, then the probability of getting two tails is not possible. So these events are dependent Example #2: Toss a coin twice: heads on second toss vs tails on first these events are independent Independence continued Independence of two events: Two events A and B are independent if and only if P(A intersect B) = P(A)P(B) Also called multiplication rule for two independent events Trials are called independent if any event from the first trial is independent of any event from the second trial Example Is getting two heads in three tosses independent of getting two tails in three tosses?

104

105

106

107

25

106

Independence continued Independence of two events: Two events A and B are independent if and only if P(A intersect B) = P(A)P(B) Also called multiplication rule for two independent events Trials are called independent if any event from the first trial is independent of any event from the second trial

1/25/2012

107

Example Is getting two heads in three tosses independent of getting two tails in three tosses? P(two heads) = P({HHT,HTH,THH}) = 3/8 P(two tails) = P({TTH,THT,HTT}) = 3/8 P(two heads and two tails) = 0 P(two heads) * P(two tails) = 9/64 Since these numbers are different, they are dependent eventsExample #2. Event A = heads on second toss = {HHH,HHT,THH,THT} Event B = tails on first = {THH, THT, TTH, TTT} P(A) = 4/8 = P(B) = 4/8 = P(A and B) = P({THH,THT}) = 2/8 = P(A and B) ?= P(A)P(B) = * = A and B are independent events Biased coin Three coin tosses Make assumption that flips are independent Let P(Heads) = p, so P(tails) = 1-p What is P(TTH)? By independence: P(TTH) = P(T1)*P(T2)*P(H3) = (1-p)*(1-p)*p Discrete Random Variables and their models A random variable is a function that assigns a real number to each outcome of a random phenomena For example, toss a coin 10 times, we can define the random variable X as the number of heads that appeared. The possible values of X are the values 0 thru 10 For a Bernoulli trial, let Y be the number of successes, then Y can be only 0 or 1. Toss 4 dice and count the number of sixes that appear This is an example of a random variable with possible values 0, 1, 2, 3, or 4 A random variable whose possible values form a discrete set is called a discrete random variable The above examples are discrete random variables. Distribution Models A distribution model for a random variable is a model that precisely quantifies the pattern of variation for a random variable Bernoulli Distribution Model: The distribution model of a Bernoulli Random Variable From a Bernoulli trial, the possible values of Y are 0 and 1 The probability of obtaining a 1 is p and the probability of obtaining a 0 is 1 p The probability that the random variable Y takes on the possible value y is denoted by P(Y = y) = pY(y) Then the Bernoulli model is expressed by: pY(0) = 1-p and pY(1) = p Another way to express this model is pY(y) = py(1-p)(1-y) The probability of success, p, is called a model parameter of the Bernoulli distribution

108

109

110

111

26

Bernoulli Distribution Model: The distribution model of a Bernoulli Random Variable From a Bernoulli trial, the possible values of Y are 0 and 1 The probability of obtaining a 1 is p and the probability of obtaining a 0 is 1 p The probability that the random variable Y takes on the possible value y is denoted by P(Y = y) = pY(y) Then the Bernoulli model is expressed by: pY(0) = 1-p and pY(1) = p Another way to express this model is pY(y) = py(1-p)(1-y) The probability of success, p, is called a model parameter of the Bernoulli distribution A model parameter is a quantity describing the model that may take on more than one value, and that must be specified to determine the model as completely as possible Y ~ Bernoulli(p)112

1/25/2012

Discrete Distribution Models In general, if we have a discrete random variable Y, the distribution model can be expressed as pY(y) = P(Y=y) for each y in the discrete set. pY(y) is called the probability mass function of Y. Y represents a random variable, y = the value that the random variable takes on Definition: A discrete distribution model is a distribution model for a discrete random variable. It consists of a list of the possible values the random variable can assume along with the probability of assuming each possible value. The function describing these probabilities is called the probability mass function. Example: Let X be the number of heads in two tosses of a fair coin. P(X = 0) = P({TT}) = P(X = 1) = P({TH,HT}) = 2/4 = P(X = 2) = P({HH}) = If probabilities for different outcomes are all the same, the probabilities are called uniform, and we have a discrete uniform distribution model Displaying Discrete Distribution Models Probability histogram: has a bar of area pY(y) centered over Y = y for each possible value y that the random variable Y can take on Let Y be the face of a single dice toss then P(Y = 1) = 1/6, P(Y=2) = 1/6, P(Y=3) = 1/6, . Show example Probability histograms are limits of density histograms After n trials, in a density histogram the area of the bar over the value y is the proportion of the first n trials for which Y = y. If the intervals under the bars are the same for the probability histogram and density histogram then as n goes to infinity, the height of the bars in the density histogram will converge to P(Y=y) as in the probability histogram Show example The mean, variance, and standard deviation of a distribution model Suppose we perform n trials of a random phenomena and we obtain observations Y1 thru Yn. The mean of these observations is the sum of the Ys divided by n. As n goes to infinity this mean will converge to the mean of the distribution model, which is called muY. This result is called the law of large numbers. The mean of the distribution model is also called the expected value or expectation of the distribution model. Another representation is E[Y]. To calculate E[Y] = sum over y [y*pY(y)] Example: Let Y be the face that appears from a single dice toss. What is E[Y]?

113

114

27

114

The mean, variance, and standard deviation of a distribution model Suppose we perform n trials of a random phenomena and we obtain observations Y1 thru Yn. The mean of these observations is the sum of the Ys divided by n. As n goes to infinity this mean will converge to the mean of the distribution model, which is called muY. This result is called the law of large numbers. The mean of the distribution model is also called the expected value or expectation of the distribution model. Another representation is E[Y]. To calculate E[Y] = sum over y [y*pY(y)] Example: Let Y be the face that appears from a single dice toss. What is E[Y]? Average or mean of random variable Y. Practice problems: How many possible outcomes are there when rolling two dice? Assume order is important What is the event that exactly one 6 appears? What is probability of this event? Let Z be a random variable representing the sum of the two dice. What are the possible value of the sum? Find the mass function for Z. What is the most likely sum? Suppose we have a biased coin with probability p of heads. What is the mass function for random variable X representing the number of heads in 3 tosses of the coin? Jack has an 80% chance of no accident this year, 10% chance of small accident with a $500 payout from the insurance company, 5% chance of major accident ($2000 payout), and 5% chance of totaling his car ($15000 payout). What is the expected payout from the insurance company? How should the insurance company price the premium in order to break even? What is the probability of a 4 of a kind if we roll a dice 5 times? What is the probability of a 5 of a kind? What is the probability of a full house? Mean, Variance, and Standard Deviation Last time, we talked about mean of the distribution model as sum over y of y * pY(y) Law of Large Numbers: as n goes to infinity the sample mean approaches the mean of the distribution model Today, standard deviation and variance of the distribution model For a sample, we can define the sample standard deviation as 1/(n-1) * sum over i = 1 to n of (Yi sample mean)2 As n goes to infinity the sample standard deviation approaches a number called the variance of the distribution model To calculate the variance of the distribution model, you can use: sum over y of (y mean of the distribution model)2 * pY(y) The variance of the distribution model represents the amount of spread in the probability histogram Standard deviation The standard deviation of the distribution model is the square root of the variance of the distribution model Example: Flip two coins and let X represent the number of heads obtained. Here, X could be 0, 1, or 2. The associated probabilities are , , and . What is the variance of X? First you need to calculate the expected value, so use the expected value formula for a discrete random variable:

1/25/2012

115

116

117

28

The variance of the distribution model represents the amount of spread in the probability histogram117

1/25/2012

Standard deviation The standard deviation of the distribution model is the square root of the variance of the distribution model Example: Flip two coins and let X represent the number of heads obtained. Here, X could be 0, 1, or 2. The associated probabilities are , , and . What is the variance of X? First you need to calculate the expected value, so use the expected value formula for a discrete random variable: * 0 + * 1 + * 2 = 1 = E[X] Var(X) = (0 1)2 * + (1 1)21/2 + (2 1)2 * = 1/2 Linear transformation Suppose we have a random variable Y, a linear transformation of Y is a new random variable, X from X = aY + b How is the expectation of X related to the expectation of Y? Show derivation E[X] = a E[Y] + b What about variance and standard deviation? If X = aY + b, then Var(X) = a2 Var(Y) STD(X) = |a| STD(Y) Why? Next, more than one random variable Instead of a single random variable, X or Y, we are now going to consider a collection of random variables: R1, R2, R3, Rn Maybe the first random variable represents the face obtained on a fair 6 sided dice Maybe the second random variable represents the face obtained on a weighted 6 sided dice Maybe the third random variable represents the face obtained on a 6 sided dice with a different weighting Etc up to n dice If X = sum of all of the Rs, then what is the expected value of X? Means of sums of random variables E[X] = E[R1] + E[R2] + + E[Rn] In words, the expected value of a sum of random variables = sum of expected values What about the variance of a sum of random variables? Theorem 4.14 and 4.15 in the book: If R1 and R2 are independent random variables, then VAR(R 1 + R2) = Var(R1) + Var(R2) And VAR(R1 - R2) = Var(R1) + Var(R2) Independence of discrete random variables Before we discussed the definition of independence between two events A and B as P(A and B) = P(A)P(B) Independence of discrete random variables is similar. Suppose we have two discrete random variables R and S. R and S are independent if P(R = r and S = s) = P(R = r)P(S=s) for all values of r and s

118

119

120

121

122

29

And VAR(R1 - R2) = Var(R1) + Var(R2) 122

2

1/25/2012

Independence of discrete random variables Before we discussed the definition of independence between two events A and B as P(A and B) = P(A)P(B) Independence of discrete random variables is similar. Suppose we have two discrete random variables R and S. R and S are independent if P(R = r and S = s) = P(R = r)P(S=s) for all values of r and s Joint mass function = product of mass function of R with the mass function of S Example Suppose R represents the face obtained on a roll of a fair die Suppose S represents the face obtained on the second roll of a fair die Both R and S are discrete random variables. Question: Are R and S independent? P((r,s)) = 1/36 because there is 6 possible outcomes for the first roll and 6 possible outcomes for the second roll, then there are 6 * 6 = 36 possible outcomes for a pair of dice. I assume each of these outcomes are equally likely so 1/36 P(R=r) = 1/6 P(S=s) = 1/6 P(R=r and S = s) = P(R=r)P(S=s) because 1/36 = 1/6 * 1/6 So R and S are independent discrete random variables What is independence for a collection? Suppose we have a collection of random variables: R 1 thru Rn We say that R1 thru Rn are independent if for all values of r1 thru rn : P(R1 = r1 and R2 = r2 and R3 = r3 and and Rn = rn) = P(R1 = r1) P(R2 = r2) * * P(Rn = rn) With this definition, we can now discuss the variance of the sum of a collection of random variables ie Var(R1 + R2 + R3 + Rn) If R1 thru Rn are independent, then this variance equals Var(R1) + Var(R2) + + Var(Rn) In words, the variance of a sum of random variables is the sum of variances if R 1 thru Rn are independent. Mean, Variance, and Standard Deviation of the Sample Mean With these ideas of independence, expectation of a sum of random variables, and variance of a sum of random variables, we can find the mean, variance, and standard deviation of the sample mean We are going to perform n trials of a random phenomena. Let Ri be the value of a random variable for the ith trial of a random phenomena The sample mean is then the sum of the Rs over n We are going to assume the Rs are independent and follow the same distribution Sample mean continued Let mu be the common expectation, ie mu = E[R1] = E[R2] = = E[Rn] Let sigma2 be the common variance, ie sigma2 = Var[R1] = Var[R2] = = Var[Rn] We would like to know: The expectation of the sample mean under these conditions The variance of the sample mean under these conditions The standard deviation of the sample mean under these conditions

123

124

125

126

30

126

1/25/2012

Sample mean continued Let mu be the common expectation, ie mu = E[R1] = E[R2] = = E[Rn] Let sigma2 be the common variance, ie sigma2 = Var[R1] = Var[R2] = = Var[Rn] We would like to know: The expectation of the sample mean under these conditions The variance of the sample mean under these conditions The standard deviation of the sample mean under these conditions Sample mean E[sample mean] = mu Var[sample mean] = sigma2/n STD[sample mean] = sigma/sqrt(n) Why? Show proof Binomial distribution model Binomial trial: Consists of n independent Bernoulli trials The probability of success is the same value p for each of the Bernoulli trials Let X be the number of successes in the binomial trial Notation: X ~ Bin(n,p) X is distributed as a binomial distribution model Example: Flip a fair coin 10 times and let X be the count of the number of heads, then X ~ Bin(10, 0.5) Mass function Suppose we flip a biased coin with probability p of heads n times Let X be the # of heads in the n flips Then X ~ Bin(n,p) We would like to know the mass function of a binomial distribution model ie P(X = x) The number of successes is x and therefore the number of failures must be n x: p * p * p * (1- p) * p * p * (1-p) * p .. x ps n x (1-p)s Orderings Suppose we consider a specific case: 2 heads in 4 trials There are many possible ways to get 2 heads in 4 trials: HHTT, HTHT, HTTH, THHT, THTH, TTHH In general, if we have x successes in n flips, there will be many possible orderings that have x successes in n flips The number of sequences of x items in n items when order is not important is a combination We say the order is not important because having heads in positions 2 and 3 is the same as having heads in positions 3 and 2 Mass function The formula for combinations is n!/x!/(n-x)! Combination notation: () Putting everything together, we have the mass function for the binomial distribution model x ps (n-x) (1-p)s

127

128

129

130

131

31

combination We say the order is not important because having heads in positions 2 and 3 is the same as having heads in positions 3 and 2131

1/25/2012

Mass function The formula for combinations is n!/x!/(n-x)! Combination notation: () Putting everything together, we have the mass function for the binomial distribution model x ps (n-x) (1-p)s n!/x!/(n-x)! sequences of x successes in n flips P(X = x) = n!/x!/(n-x)! px(1-p)(n-x) Example Suppose we flip a biased coin 10 times and let the probability of heads be 70% What is the probability that we will get 8 heads? What is the probability that we will get 9 heads? Let Z be the number of heads observed, then X ~ Bin(10, 0.7) P(X = 8) = . P(X = 9) = . Expectation and Variance Let X ~ Bin(n,p) So X represents the number of successes in the n independent Bernoulli trials We would like to know the expectation of X and the variance in X E[X] = np Var[X] = np(1-p) STD[X] = sqrt(Var[X])

132

133

134

Cumulative Distribution Function Definition: The cumulative distribution function of a random variable, Y, is defined as FY(y) = P(Y = 100, p