Crash Course on Basic Statistics -...

39
Crash Course on Basic Statistics Marina Wahl, [email protected] University of New York at Stony Brook November 6, 2013

Transcript of Crash Course on Basic Statistics -...

Page 1: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Crash Course on Basic Statistics

Marina Wahl, [email protected] of New York at Stony Brook

November 6, 2013

Page 2: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

2

Page 3: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Contents

1 Basic Probability 51.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Probability of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Basic Definitions 72.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Population and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.7 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.8 Questions on Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.9 Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 The Normal Distribution 11

4 The Binomial Distribution 13

5 Confidence Intervals 15

6 Hypothesis Testing 17

7 The t-Test 19

8 Regression 23

9 Logistic Regression 25

10 Other Topics 27

11 Some Related Questions 29

3

Page 4: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

4 CONTENTS

Page 5: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 1

Basic Probability

1.1 Basic Definitions

Trials

? Probability is concerned with the outcome of tri-als.

? Trials are also called experiments or observa-tions (multiple trials).

? Trials refers to an event whose outcome is un-known.

Sample Space (S)

? Set of all possible elementary outcomes ofa trial.

? If the trial consists of flipping a coin twice, thesample space is S = (h, h), (h, t), (t, h), (t, t).

? The probability of the sample space is always1.

Events (E)

? An event is the specification of the outcome ofa trial.

? An event can consist of a single outcome or aset of outcomes.

? The complement of an event is everything inthe sample space that is not that event (not Eor ∼ E).

? The probability of an event is always between0 and 1.

? The probability of an event and its complementis always 1.

Several Events

? The union of several simple events creates acompound event that occurs if one or moreof the events occur.

? The intersection of two or more simple eventscreates a compound event that occurs only ifall the simple events occurs.

? If events cannot occur together, they are mutu-ally exclusive.

? If two trials are independent, the outcome ofone trial does not influence the outcome of an-other.

Permutations

? Permutations are all the possible ways ele-ments in a set can be arranged, where the orderis important.

? The number of permutations of subsets of size kdrawn from a set of size n is given by:

nPk =n!

(n− k)!

5

Page 6: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

6 CHAPTER 1. BASIC PROBABILITY

Combinations

? Combinations are similar to permutations withthe difference that the order of elements isnot significant.

? The number of combinations of subsets of size kdrawn from a set of size n is given by:

nPk =n!

k!(n− k)!

1.2 Probability of Events

? If two events are independents, P (E|F ) =P (E). The probability of both E and F occur-ring is:

P (E ∩ F ) = P (E)× P (F )

? If two events are mutually exclusive, the prob-ability of either E or F :

P (E ∪ F ) = P (E) + P (F )

? If the events are not mutually exclusive (youneed to correct the ‘overlap’):

P (E ∪ F ) = P (E) + P (F )− P (E ∩ F ),

where

P (E ∩ F ) = P (E)× P (F |E)

1.3 Bayes’ Theorem

Bayes’ theorem for any two events:

P (A|B) =P (A ∩B)P (B)

=P (B|A)P (A)

P (B|A)P (A) + P (B| ∼ A)P (∼ A)

? Frequentist:

– There are true, fixed parameters in a model(though they may be unknown at times).

– Data contain random errors which have acertain probability distribution (Gaussianfor example).

– Mathematical routines analyse the proba-bility of getting certain data, given a par-ticular model.

? Bayesian:

– There are no true model parameters. In-stead all parameters are treated as randomvariables with probability distributions.

– Random errors in data have no probabilitydistribution, but rather the model param-eters are random with their own distribu-tions.

– Mathematical routines analyze probabilityof a model, given some data. The statisti-cian makes a guess (prior distribution) andthen updates that guess with the data.

Page 7: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 2

Basic Definitions

2.1 Types of Data

There two types of measurements:

? Quantitative: Discrete data have finite val-ues. Continuous data have an infinite numberof steps.

? Categorical (nominal): the possible responsesconsist of a set of categories rather than numbersthat measure an amount of something on a con-tinuous scale.

2.2 Errors

? Random error: due to chance, with no partic-ular pattern and it is assumed to cancel itself outover repeated measurements.

? Systematic errors: has an observable pattern,and it is not due to chance, so its causes can beoften identified.

2.3 Reliability

How consistent or repeatable measurements are:

? Multiple-occasions reliability (test-retest,temporal): how similarly a test perform overrepeated administration.

? Multiple-forms reliability (parallel-forms):how similarly different versions of a test performin measuring the same entity.

? Internal consistency reliability: how wellthe items that make up instrument (a test) re-flect the same construct.

2.4 Validity

How well a test or rating scale measures whatis supposed to measure:

? Content validity: how well the process of mea-surement reflects the important content of thedomain of interests.

? Concurrent validity: how well inferencesdrawn from a measurement can be used to pre-dict some other behaviour that is measured atapproximately same time.

? Predictive validity: the ability to draw infer-ences about some event in the future.

2.5 Probability Distributions

? Statistical inference relies on making assump-tions about the way data is distributed, trans-forming data to make it fit some known distri-bution better.

? A theoretical probability distribution is de-fined by a formula that specifies what values canbe taken by data points within the distributionand how common each value (or range) will be.

7

Page 8: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

8 CHAPTER 2. BASIC DEFINITIONS

2.6 Population and Samples

? We rarely have access to the entire population ofusers. Instead we rely on a subset of the popu-lation to use as a proxy for the population.

? Sample statistics estimate unknown popu-lation parameters.

? Ideally you should select your sample ran-domly from the parent population, but in prac-tice this can be very difficult due to:

– issues establishing a truly random selectionscheme,

– problems getting the selected users to par-ticipate.

? Representativeness is more important than ran-domness.

Nonprobability Sampling

? Subject to sampling bias. Conclusions are of lim-ited usefulness in generalizing to a larger popu-lation:

– Volunteer samples.

– Convenience samples: collect informa-tion in the early stages of a study.

– Quota sampling: the data collector isinstructed to get response from a certainnumber of subjects within classifications.

Probability Sampling

? Every member of the population has a knowprobability to be selected for the sample.

? The simplest type is a simple random sam-pling (SRS).

? Systematic sampling: need a list of your pop-ulation and you decide the size of the sample andthen compute the number n, which dictates howyou will select the sample:

– Calculate n by dividing the size of the pop-ulation by the number of subjects you wantin the sample.

– Useful when the population accrues overtime and there is no predetermined listof population members.

– One caution: making sure data is not cyclic.

? Stratified sample: the population of interestis divided into non overlapping groups or stratabased on common characteristics.

? Cluster sample: population is sampled by us-ing pre-existing groups. It can be combined withthe technique of sampling proportional to size.

2.7 Bias

? Sample needs to be a good representation of thestudy population.

? If the sample is biased, it is not representativeof the study population, conclusions draw fromthe study sample might not apply to the studypopulation.

? A statistic used to estimate a parameter is un-biased if the expected value of its sampling dis-tribution is equal to the value of the parameterbeing estimated.

? Bias is a source of systematic error and enterstudies in two primary ways:

– During the selection and retention ofthe subjects of study.

– In the way information is collectedabout the subjects.

Sample Selection Bias

? Selection bias: if some potential subjects aremore likely than others to be selected for thestudy sample. The sample is selected in a waythat systematically excludes part of the popula-tion.

Page 9: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

2.9. CENTRAL TENDENCY 9

? Volunteer bias: the fact that people who vol-unteer to be in the studies are usually not rep-resentative of the population as a whole.

? Nonresponse bias: the other side of volunteerbias. Just as people who volunteer to take partin a study are likely to differ systematically fromthose who do not, so people who decline to par-ticipate in a study when invited to do so verylikely differ from those who consent to partici-pate.

? Informative censoring: can create bias in anylongitudinal study (a study in which subjects arefollowed over a period of time). Losing subjectsduring a long-term study is common, but thereal problem comes when subjects do not dropout at random, but for reasons related to thestudy’s purpose.

Information Bias

? Interviewer bias: when bias is introduced in-tro the data collected because of the attitudes orbehaviour of the interviewer.

? Recall bias: the fact that people with a life ex-perience such as suffering from a serious diseaseor injury are more likely to remember events thatthey believe are related to that experience.

? Detection bias: the fact that certain charac-teristics may be more likely to be detected orreported in some people than in others.

? Social desirability bias: caused by people’sdesire to present themselves in a favorable light.

2.8 Questions on Samples

Representative Sampling

? How was the sample selected?

? Was it truly randomly selected?

? Were there any biases in the selection process?

Bias

? Response Bias: how were the questions wordedand the response collected?

? Concious Bias: are arguments presented in a dis-interested, objective fashion?

? Missing data and refusals: how is missing datatreated in the analysis? How is attrition (loss ofsubjects after a study begins) handled?

Sample Size

? Were the sample sizes selected large enough fora null hypothesis to be rejected?

? Were the sample sizes so large that almost anynull hypothesis would be rejected?

? Was the sample size selected on the basis of apower calculation?

2.9 Central Tendency

Mean

? Good if data set that is roughly symmetrical:

µ =1n

n∑i=1

xi.

? Outliers: data error or they belong to otherpopulation.

Median

? Middle value when the values are ranked in as-cending or descending order.

? When data is not symmetrical, mean can beheavily influenced by outliers, and median pro-vides a better idea o most typical value.

? For odd samples, the median is the central value(n + 1)/2th. For even samples, it’s the averageof the two central values, [n/2 + (n+ 1)/2]/2th.

Page 10: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

10 CHAPTER 2. BASIC DEFINITIONS

? In a small sample of data (less than 25 or so),the sample median tends to do a poor job ofestimating the population median.

? For task-time data the geometric meantends to provide a better estimate of the pop-ulation’s middle value than the sample median.

Mode

? The most frequently occurring value.

? Is useful in describing ordinal or categorical data.

Dispersion

? Range: simplest measure of dispersion, whichis the difference between the highest and lowestvalues.

? Interquartile range: less influenced by ex-treme values.

? Variance:

– The most common way to do measure dis-persion for continuous data.

– Provides an estimate of the average differ-ence of each value from the mean.

– For a population:

σ2 =1n

n∑i=1

(xi − µ)2

– For a sample:

s2 =1

n− 1

n∑i=1

(xi − x)2

? Standard deviation:

– For a population:

σ =√σ2,

– For a sample:

s =√s2.

Page 11: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 3

The Normal Distribution

Figure 3.1: (left) All normal distributions have thesame shape but differ to their µ and σ: they areshifted by µ and stretched by σ. (right) Percent ofdata failing into specified ranges of the normal distri-bution.

? All normal distributions are symmetric,unimodal (a single most common value), andhave a continuous range from negative infinityto positive infinity.

? The total area under the curve adds up to 1, or100%.

? Empirical rule: For the population that fol-lows a normal distribution, almost all the valueswill fall within three standard deviations aboveand bellow the mean (99.7%). Two standard de-viations are about 95%. One standard deviationis 68%.

The Central Limit Theorem

? As the sample size approaches infinity, the dis-tribution of sample means will follow a nor-

mal distribution regardless of what the parentpopulation’s distribution (usually 30 or larger).

? The mean of this distribution of samplemeans will also be equal to the mean of theparent population. If X1, X2, ..., Xn all havemean µ and variance σ2, sampling distributionof X =

PXin :

E(X) = µ, V ar(X,σ2

n),

using the central limit theorem, for large n, X ∼N(µ, σ

2

n ).

? If we are interested on p, a proportion or theprobability of an event with 2 outcomes. We usethe estimator p, proportion of times we see theevent in the data. The sampling distribution ofp (unbiased estimator of p) has expected value p

and standard deviation√

p(1−p)n . By the central

limit theorem, for large n the sampling distribu-tion is p ∼ N(p, p(1−p)n ).

The Z-Statistic

? A z-score is the distance of a data point from themean, expressed in units of standard deviation.The z-score for a value from a population is:

Z =x− µσ

? Conversion to z-scores place distinct populationson the same metric.

11

Page 12: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

12 CHAPTER 3. THE NORMAL DISTRIBUTION

? We are interested in the probability of a par-ticular sample mean. We can use the normaldistribution even if we do not know the distri-bution of the population from which the samplewas drawn, by calculating the z-statistics:

Z =x− µσ√n

.

? The standard error of the mean σ√n

is thestandard deviation of the sampling distributionof the sample mean:

– It describes the mean of multiple mem-bers of a population.

– It is always smaller than the standard de-viation.

– The larger our sample size is, the smaller isthe standard error and less we would expectour sample mean to differ from the popula-tion mean.

? If we know the mean but not the standard devi-ation, we can calculate the t-statistic instead(following sessions).

Page 13: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 4

The Binomial Distribution

? An example of discrete distribution

? Events in a binomial distribution are gen-erated by a Bernoulli process. A single trialwithin a Bernoulli process is called a Bernoullitrial.

? Data meets four requirements:

– The outcome of each trial is one of two mu-tually exclusive outcomes.

– Each trial is independent.

– The probability of success, p, is constant forevery trial.

– There is a fixed number of trials, denotedas n.

? The probability of a particular number of suc-cesses on a a particular number of trials is:

P (X = k) =(n

k

)pk(1− p)n−k

where (n

k

)= nCk =

n!k!(n− k)!

? If both np and n(1 − p) are grater than 5, thebinomial distribution can be approximated bythe normal distribution.

13

Page 14: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

14 CHAPTER 4. THE BINOMIAL DISTRIBUTION

Page 15: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 5

Confidence Intervals

Point Estimate

? Point estimate is a single statistic, such as themean, to describe a sample.

? If we drew a different sample, the mean calcu-lated from that sample would probably be dif-ferent.

? Interval estimates state how much a point es-timate is likely to vary by chance.

? One common interval estimate is the confi-dence level, calculated as (1 − α). where αis the significance.

Confidence Interval

? The confidence interval is the range of valuesthat we believe will have a specified chance ofcontaining the unknown population parameter.

? The confidence interval is a range of valuesaround the mean for which if we drew an infi-nite number of samples of the same size fromthe same population, x% of the time the truepopulation mean would be included in the con-fidence interval calculated from the samples. Itgives us the information about the precision of apoint estimate such as the sample mean.

? CI will tell you the most likely range of the un-known population mean or proportion.

? The confidence is in the method, not in any in-terval.

? Any value inside the interval could be said to bea plausible value.

? CI are twice the margin of error and provideboth a measure of location and precision.

? Three things affect the width of a confidence in-terval:

– Confidence Level: usually 95%.

– Variability: if there is more variation ina population, each sample taken will fluc-tuate more and wider the confidence inter-val. The variability of the population is es-timated using the standard deviation fromthe sample.

– Sample size: without lowering the confi-dence level, the sample size can control thewidth of a confidence interval, having aninverse square root relationship to it.

CI for Completion Rate (Binary Data)

? Completion rate is one of the most funda-mental usability metrics, defining whether auser can complete a task, usually as a binaryresponse.

15

Page 16: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

16 CHAPTER 5. CONFIDENCE INTERVALS

? The first method to estimate binary success ratesis given by the Wald interval:

p± z(1−α2 )

√p(1− p)

n,

where p is the sample proportion, n is the samplesize, z(1−α2 ) is the critical value from the normaldistribution for the level of confidence.

? Very inaccurate for small sample sizes (less than100) or for proportions close to 0 or 1.

? For 95% (z = 1.96) confidence intervals, we canadd two successes and two failures to the ob-served number of successes and failures:

padj =x+ z2

2

n+ z2=x+ 1.962

2

n+ 1.962∼ x+ 2n+ 4

, (5.0.1)

where x is the number that successfully com-pleted the task and n the number who attemptedthe task (sample size).

CI for Task-time Data

? Measuring time on task is a good way to as-sess task performance and tend to be positivelyskewed (a non-symmetrical distribution, so themean is not a good measure of the center of thedistribution).

? In this case, the median is a better measure of thecenter. However, there is two major drawbacksto the median: variability and bias.

? The median does not use all the informationavailable in sample, consequently, the medians ofsamples from a continuous distribution are morevariable than their means.

? The increased variability of the median relativeto the mean is amplified when the sample sizesare small.

? We also want our sample mean to be unbiased,where any sample mean is just as likely to over-estimate or underestimate the population mean.The median does not share this property: atsmall samples, the sample median of comple-tion times tends to consistently overestimatedthe population median.

? For small-sample task-time data the geometricmean estimates the population median betterthan the sample median. As the sample sizesget larger (above 25) the median tends to be thebest estimate of the middle value. We can usebinomial distribution to estimate the confidenceintervals: the following formula constructs a con-fidence interval around any percentile. The me-dian (0.5) would be the most common:

np± z(1−α2 )

√np(1− p),

where n is the sample size, p is the percentileexpressed as a proportion, and

√np(1− p) is the

standard error. The confidence interval aroundthe median is given by the values taken by theth integer values in this formula (in the ordereddata set).

Page 17: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 6

Hypothesis Testing

? H0 is called the null hypothesis and H1 is thealternative hypothesis. They are mutually exclu-sive and exhaustive. A null hypothesis cannotnever be proven to be true, it can only be shownto be plausible.

? The alternative hypothesis can be single-tailed:it must achieve some value to reject the null hy-pothesis,

H0 : µ1 ≤ µ2, H1 : µ1 > µ2,

or can be two-tailed: it must be different fromcertain value to reject the null hypothesis,

H0 : µ1 = µ2, H1 : µ1 6= µ2.

? Statistically significant is the probability thatis not due to chance.

? If we fail to reject the null hypothesis (find sig-nificance), this does not mean that the null hy-pothesis is true, only that our study did not findsufficient evidence to reject it.

? Tests are not reliable if the statement of thehypothesis are suggested by the data: datasnooping.

p-value

? We choose the probability level or p-value thatdefines when sample results will be considered

strong enough to support rejection of the nullhypothesis.

? Expresses the probability that extreme resultsobtained in an analysis of sample data are dueto chance.

? A low p-value (for example, less than 0.05)means that the null hypothesis is unlikely to betrue.

? With null hypothesis testing, all it takes is suffi-cient evidence (instead of definitive proof) thatwe can see as at least some difference. The size ofthe difference is given by the confidence intervalaround the difference.

? A small p-value might occur:

– by chance– because of problems related to data collec-

tion– because of violations of the conditions nec-

essary for testing procedure– because H0 is true

? if multiple tests are carried out, some are likelyto be significant by chance alone! For σ = 0.05,we expect that significant results will be 5% ofthe time.

? Be suspicious when you see a few significant re-sults when many tests have been carried out orsignificant results on a few subgroups of the data.

17

Page 18: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

18 CHAPTER 6. HYPOTHESIS TESTING

Errors in Statistics

? If we wrongly say that there is a difference,we have a Type I error. If we wrongly saythere is no difference, it is called Type IIerror.

? Setting α = 0.05 means that we accept a 5%probability of Type I error, i.e. we have a 5%chance of rejecting the null hypothesis when weshould fail to reject it.

? Levels of acceptability for Type II errors are usu-ally β = 0.1, meaning that it has 10% probabil-ity of a Type II error, or 10% chance that thenull hypothesis will be false but will fail to berejected in the study.

? The reciprocal of Type II error is power, definedas 1 − β and is the probability of rejecting thenull hypothesis when you should reject it.

? Power of a test:

– The significance level α of a test shows howthe testing methods performs if repeatedsampling.

– If H0 is true and α = 0.01, and you carryout a test repetitively, with different sam-ples of the same size, you reject the H0

(type I) 1 per cent of the time!

– Choosing α to be very small means that youreject H0 even if the true value is differentform H0.

? The following four main factors affect power:

– α level (higher probability of Type I errorincreases power).

– Difference in outcome between populations(greater difference increases power).

– Variability (reduced variability increasespower).

– Sample size (larger sample size increasespower).

? How to have a higher power:

– The power is higher the further the alter-native values away from H0.

– Higher significance level α gives higherpower.

– Less variability gives higher power.

– The larger the samples size, the greater isthe power.

Page 19: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 7

The t-Test

t-Distribution

? The t-distribution adjusts for how good ourestimative is by making the intervals wider asthe sample sizes get smaller. It converges to thenormal confidence intervals when the sample sizeincreases (more than 30).

? Two main reasons for using the t-distributionto test differences in means: when working withsmall samples from a population that is approx-imately normal, and when we do not know thestandard deviation of a population and need touse the standard deviation of the sample as asubstitute for the population deviation.

? t-distributions are continuous and symmetrical,and a bit fatter in the tails.

? Unlike the normal distribution, the shape of thet distribution depends on the degrees of freedomfor a sample.

? At smaller sample sizes, sample means fluctuatemore around the population mean. For exam-ple, instead of 95% of values failing with 1.96standard deviations of the mean, at a sample sizeof 15, they fall within 2.14 standard deviations.

Confidence Interval for t-Distributions

To construct the interval,

x± t(1−α2 )s√n,

where x is the sample mean, n is the sample size, sis the sample standard deviation, t(1−α2 ) is the crit-ical value from the t-distribution for n − 1 degreesof freedom and the specified level of confidence, weneed:

1. the mean and standard deviation of the mean,

2. the standard error,

3. the sample size,

4. the critical value from the t-distribution for thedesired confidence level.

The standard error is the estimate of how much theaverage sample means will fluctuate around the truepopulation mean:

se =s√n.

19

Page 20: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

20 CHAPTER 7. THE T-TEST

The t-critical value is simply:

t =x− µs√n

,

where we need α (level of significance, usually 0.05,one minus the confidence level) and the degrees offreedom. The degrees of freedom (df) for this type ofconfidence interval is the sample size minus 1.

For example, a sample size of 12 has the expecta-tion that 95% of sample means will fall within 2.2standard deviations of the population mean. We canalso express this as the margin of error:

me = 2.2s√n,

and the confidence interval is given as twice the mg.

t-test

Assumptions of t-tests:

? The samples are unrelated/independent, other-wise the paired t-test should be used. You cantest for linear independence.

? t-tests assume that the underlying populationvariances of the two groups are equal (variancesare pooled), so you should test for homogeneity.

? Normality of the distributions of both variablesare also assumed (unless the samples sizes arelarge enough to apply the central limit theorem).

? Both samples are representative of their parentpopulations.

? t-tests are robust even for small sample sizes, ex-cept when there is an extreme skew distributionor outliers.

1-Sample t-Test

One way t-test is used is to compare the mean of asample to a population with a known mean. The null

hypothesis is that there is no significant difference be-tween the mean population from which your samplewas drawn and the mean of the know population.

The standard deviation of the sample is:

s =

√∑ni=1(xi − x)2

n− 1

The degrees of freedom for the one sample t-test isn− 1.

CI for the One-Sample t-Test

The formula to compute a two-tailed confidence in-terval for the mean for the one-sample t-test:

CI1−α = x±(tα/2,df

)( s√n

)If you want to calculate a one-sided CI, change the

± sign to either plus or minus and use the uppercritical value and α rather than α/2.

2-Sample t-Test (Independent Sam-ples)

The t-test for independent samples determineswhether the means of the populations from which thesamples were drawn are the same. The subjects inthe two samples are assumed to be unrelated and in-dependently selected from their populations. We as-sume that the populations are approximately a nor-mal distribution (or sample large enough to invokethe central limit theorem) and that the populationshave approximately equal variance:

t =(x1 − x2)− (µ1 − µ2)√

s2p

(1n1

+ 1n2

) ,

where the pooled variance is

s2p =(n1 − 1)s21 + (n2 − 1)s22

n1 + n2 − 2.

Page 21: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

21

CI for the Independent Samples t-Test

CI1−α = (x1 − x2)±(tα/2,df

)(√s2p

( 1n1

+1n2

))

Repeated Measures t-Test

Also know as related samples t-test or the depen-dent samples t-tests, the samples are not indepen-dent. The measurements are considered as pairs sothe two samples must be of the same size. The for-mula is based on the difference scores as calculatedfrom each pairs of samples:

t =d− (µ1 − µ2)

sd√n

,

where d is the mean of the different scores and n thenumber of pairs. The degrees of freedom is still n−1.

The null hypothesis for the repeated measures t-test is usually that the mean of the difference scores,d, is 0.

CI for the Repeated Measures t-Test

CI1−α = d±(tα/2,df

)( sd√n

)

Two Measurements

? H0 : µ = µ0, HA : µ 6= µ0

? observations: n, x, s, d = |x− µ0|

? Reject H0 if

P (|X − µ0| ≥ d) ≤ α

= P (|X − µ0√s2/n

| ≥ d√s2/n

)

= P (|tn−1| ≥d√s2/n

)

Comparing Samples

The concept of the number of standard errors that thesample means differ from population means applies toboth confidence intervals and significance tests.

The two-sample t-test is found by weighting theaverage of the variances:

t =x1 − x2√

s21

n1+s22n2

,

where x1 and x2 are the means from sample 1 and2, s1 and s2 the standard deviations from sample 1and 2, and n1 and n2 the sample sizes. The degreeof freedom for t-values is approximately 2 less thesmaller of the two sample sizes. A p-value is just apercentile rank or point in the t-distribution.

The null hypothesis for an independent samplest-test is that the difference between the populationmeans is 0, in which case µ1 − µ2) can be dropped.The degree of freedom is n1 + n2 − 2, fewer than thenumber of cases when both samples are combined.

Between-subjects Comparison (Two-sample t-test)

When a different set of users is tested on each prod-uct, there is variation both between users and be-tween designs. Any difference between the meansmust be tested to see whether it is greater thanthe variation between the different users. To deter-mine whether there is a significant difference betweenmeans of independent samples of users, we use thetwo-sample t-test:

t =x1 − x2√s21n1

+ s22n2

The degrees of freedom for one-sample t-test is n−1. For a two-sample t-test, a simple formula is n1 +1n2 − 2. However the correct formula is given byWelch-Satterthwaite procedure. It provides accurate

Page 22: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

22 CHAPTER 7. THE T-TEST

results even if the variances are unequal:

df =

(s21n1

+ s22n2

)2

(s21n1

)2

n1−1 +

(s22n2

)2

n2−1

CI Around the Difference

There are several ways to report an effect size, butthe most compelling and easiest to understand is theconfidence interval. The following formula generatesa confidence interval around the difference scores tounderstand the likely range of the true difference be-tween products:

(x1 − x2)± ta

√s21n1

+s22n2.

Page 23: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 8

Regression

Linear Regression

? It calculates the equation that will produce theline that is as close as possible to all the datapoints considered together. This is describedas minimizing the square deviations, where thesquared deviations are the sum of the squareddeviations between each data point and the re-gression line.

? The difference between each point and the re-gression line is also called the residual becauseit represents the variability in the actual datapoints not accounted for by the equation gener-ating the line.

? Minimizing the squared deviations can be ex-pressed as minimizing the errors of predictionor minimizing the residuals.

? The most important assumption of the linear re-gression is the independence and normality oferrors in the independent and dependent vari-ables.

? Other assumptions for simple linear regressionare: data appropriateness (outcome variablecontinuous, unbounded), linearity, distribution(the continuous variables are approximately nor-mally distributed and do not have extreme out-liers), homoscedasticity (the errors of predictionare constant over the entire data range), inde-pendence.

? The sum of the squared deviations is the sum ofsquares of errors, or SSE:

SSE =n∑i=1

(yi − yi)2 =n∑i=1

(yi − (axi + b))2

? In this formula, yi is an observed data value andyi is the predicted value according to the regres-sion equation.

? The variance of s is

Sxx =∑

x2 − (∑x)2

n

and the covariance of x and y is

Sxy =∑

xy − (∑x)(∑y)

n.

? The slope of a simple regression equation is

a =SxySxx

,

and the intercept of a simple regression equation,

b =∑y

n− a

∑x

n

Independent and Dependent Variable

Variables can be either dependent if they represent anoutcome of the study, or independent if they are pre-

23

Page 24: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

24 CHAPTER 8. REGRESSION

sumed to influence the value of the dependent vari-ables. There is also a third category, control vari-ables, which might influence the dependent variablebut are not the main focus of interest.

In a standard linear model such as an OLS (ordi-nary least squares) regression equation, the outcomeor dependent variable is indicated by Y , and the in-dependent variables by X:

Y = β0 + β1X1 + β2X2 + ...+ ε,

where ε means error and reflects the fact that wedo not assume any regression equation will perfectlypredict Y , and βs are the regression coefficients.

Other terms used for the dependent variable in-clude the outcome variable, the response variable, andthe explained variable. Other names for independentvariables are regressors, predictor, variables, and ex-planatory variables.

Multiple Linear Regression

Multiple linear regression may be used to find the re-lationship between a single, continuous outcome vari-able and a set of predictor variables that might becontinuous, dichotomous, or categorical; if categori-cal, the predictors must be recoded into a set of di-chotomous dummy variables

Multiple linear regression, in which two or more in-dependent variables (predictors) are related to a sin-gle dependent variables is much more common thansimple linear regression. Two general principles ap-ply to regression modelling. First, each variable in-cluded in the model should carry its own weight, itshould explain unique variance in the outcome vari-able. Second, when you deal with multiple predictors,you have to expect that some of them will be corre-lated with one another as well as with the dependentvariable.

Y = β0 + β1X1 − β2X2 + ..+ βnXn + ε,

where Y is the dependent variable and β0 is the in-tercept. No variable can be a linear combination ofother variables.

Page 25: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 9

Logistic Regression

? We use logistic regression when the dependentvariable is dichotomous rather than continuous.

? Outcome variables conventionally coded as 0-1,with 0 representing the absence of a characteris-tic and 1 its presence.

? Outcome variable in linear regression is a logit,which is a transformation of the probability of acase having the characteristic in question.

? The logit is also called the log odds. If p is theprobability of a case having some characteristics,the logit is:

logit(p) = logp

1− p= log(p)− log(1− p).

? The logistic regression equation with n predic-tors is:

logit(p) = β0 + β1X1 + β2X2 + ...+ βnXn + ε.

? The differences to multiple linear regression us-ing a categorical outcome are:

– The assumption of homoscedascitiy (com-mon variance) is not met with categoricalvariables.

– Multiple linear regression can return valuesoutside the permissible range of 0-1.

– Assume independence of cases, linearity(there is a linear relationship between the

logit of the outcome variable and any con-tinuous predictor), no multicollinearity, nocomplete separation (the value of one vari-able cannot be predicted by the values ofanother variable of set variables).

? As with linear regression, we have measures ofmodel fit for the entire equation (evaluating itagainst the null model with no predictor vari-ables) and tests for each coefficient (evaluatingeach against the null hypothesis that the coeffi-cient is not significantly different from 0).

? The interpretation of the coefficients is different:instead of interpreting them in terms of linearchanges in the outcome, we interpret them interms of the odds ratios.

Ratio, Proportion, and Rate

Three related metrics:

? A ratio express the magnitude of one quantityin relation to the magnitude of another withoutmaking further assumptions about the two num-bers or having them sharing same units.

? A proportion is a particular type of ratio in whichall cases in the numerator are included in thedenominator. They are often expressed as per-cents.

25

Page 26: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

26 CHAPTER 9. LOGISTIC REGRESSION

? A rate is a proportion in which the denominatorincludes a measure of time.

Prevalence and Incidence

? Prevalence describes the number of cases that ex-ist in a population at a particular point in time.It is defined as the proportion of individuals ina population who have the characteristic at aparticular point in time.

? Incidence requires three elements to be defined:new cases, population at risk, and time interval.

Odds Ratio

? The odds ratio was developed for use in case-control studies.

? The odds ratio is the ratio of the odds of expo-sure for the case group to the odds of exposurefor the control group.

? The odds of an event is another way to expressthe likelihood of an event, similar to probabil-ity. The difference is that although probabilityis computed by dividing the number of events bythe total number of trials, odds are calculated bydividing the number of events by the number ofnon-events:

odds = probability/(1− probability)

orprobability = odds(1 + odds)

? The odds ratio is simply the ratio of two odds:

OR =odds1odds2

=p1(1− p1)p2(1− p2)

Page 27: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 10

Other Topics

ANOVA

? Analysis of variance is a statistical procedureused to compare the mean values on some vari-able between two or more independent groups.

? It is called analysis of variance because the pro-cedure involves partitioning variance, attribut-ing the variance observed in a data set to dif-ferent cause of factors, including group member-ship.

? ANOVA is a useful technique when analysingdata from designed experiments.

? Assumptions of ANOVA:

– Assumes independence, normality, andequality of variances.

– It is more reliable when the study is bal-anced (sample sizes are proximately equal).

? The major test statistics for an ANOVA is the F-ratio, which can be used to determine whetherstatistically significant differences exist betweengroups.

? One-Way ANOVA: The simplest form ofANOVA, in which only one variable is used toform the groups to be compared. This variableis called factor. A one-way ANOVA with twolevels is equivalent to performing a t-test. Thenull hypothesis in this type of design is usuallythat the two groups have the same mean.

? Factorial ANOVA: ANOVA with more thanone grouping variable or factor. We are often in-terested in the influence of serveral factors, andhow they interact. We might be interested inboth main effects (of each factor alone) and in-teractions effects.

Factor Analysis

? Uses standardized variables to reduce data setsby using principal component analysis (PCA)(data reduction technique).

? It is based on an orthogonal decomposition ofan input matrix to yield an output matrix thatconsists of a set of orthogonal components (orfactors) that maximize the amount of variationin the variables from the input matrix.

? This process produces a smaller, more compactnumber of output components (produce a set ofeigenvectors and eigenvalues).

? The components in the output matrix are linearcombinations of the input variables, the compo-nents are created so the first component max-imizes the variance captured, and each subse-quent component captures as much of the resid-ual variance as possible while taking on an un-correlated direction in space (produce variablesthat are orthogonal).

27

Page 28: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

28 CHAPTER 10. OTHER TOPICS

Nonparametric Statistics

? Distribution-free statistics: make few or noassumptions about the underlying distributionof the data.

? Sign Test: analogue to the one-sample t-testand is used to test whether a sample has ahypothesized median.

– Data values in the sample are classified asabove (+) or below (-) the hypothesizedmedian.

– Under the null hypothesis that the sampleis drawn from a population with the speci-fied median, these classifications have a bi-nomial distribution with π = 0.5 (probabil-ity of the population).

Z =(X ± 0.5)− np√

np(1− p),

where X is the number of observed valuesgreater than median, 0.5 is the continuitycorrection (negative in this case for a hy-pothesis π > 0.5), np is the mean of thebinomial distribution (the expected valuefor X if the null hypothesis is true), andthe denominator is the standard deviationof the binomial distribution.

Page 29: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Chapter 11

Some Related Questions

Design an experiment to see whether a newfeature is better?

? Questions:

– How often Y occurs?

– Do X and Y co-vary? (requires measuringX and Y)

– Does X cause Y? (requires measuring Xand Y and measuring X, and accountingfor other independent variables)

? We have:

– goals for visitor (sign up, buy).

– metrics to improve (visit time, revenue, re-turn).

? We are manipulating independent variables (in-dependent of what the user does) to measure de-pendent variables (what the user does).

? We can test 2 versions with a A/B test, or manyfull versions of a single page.

? How different web pages perform using a ran-dom sample of visitor: define what percentage ofthe visitor are included in the experiment, chosewhich object to test.

Mean standard deviation of two blendeddatasets for which you know their means andstandard deviations:

The difference between the means of two sam-ples, A and B, both randomly drawn from the samenormally distributed source population, belongs toa normally distributed sampling distribution whoseoverall mean is equal to zero and whose standard de-viation (”standard error”) is equal to√

(sd2/na) + (sd2/nb),

where sd2 is the variance of the source population.

If each of the two coefficient estimates in aregression model is statistically significant, doyou expect the test of both together is stillsignificant?

? The primary result of a regression analysis is aset of estimates of the regression coefficients.

? These estimates are made by finding values forthe coefficients that make the average residual 0,and the standard deviation of the residual termas small as possible.

? In order to see if there is evidence supporting theinclusion of the variable in the model, we startby hypothesizing that it does not belong, i.e.,that its true regression coefficient is 0.

? Dividing the estimated coefficient by the stan-dard error of the coefficient yields the t-ratio of the variable, which simply shows howmany standard-deviations-worth of sampling er-ror would have to have occurred in order to yield

29

Page 30: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

30 CHAPTER 11. SOME RELATED QUESTIONS

an estimated coefficient so different from the hy-pothesized true value of 0.

? What the relative importance of variation in theexplanatory variables is in explaining observedvariation in the dependent variable? The beta-weights of the explanatory variables can be com-pared.

? Beta weights are the regression coefficients forstandardized data. Beta is the average amountby which the dependent variable increases whenthe independent variable increases one standarddeviation and other independent variables areheld constant. The ratio of the beta weights isthe ratio of the predictive importance of the in-dependent variables.

How to analyse non-normal distributions?

? When data is not normally distributed, the causefor non-normality should be determined and ap-propriate remedial actions should be taken.

? Causes:

– Extreme values: too many extreme val-ues in a data set will result in a skeweddistribution. Find measurement errors andremove outliers with valid reasons.

– Overlap of two processes: see if it looksbimodal and stratify the data.

– Insufficient data discrimination:round-off errors or measurements deviceswith poor resolution makes data look dis-crete and not normal. Use more accuratemeasurement systems or collecting moredata.

– Sorted data: if it represents simply a sub-set of the total output a process produced.It can happen if the data is collected andanalysed after sorting.

– Values close to zero or a natural limit:the data will skew. All data can be raised ortransformed to a certain exponent (but we

need to be sure that a normal distributioncan be assumed).

– Data Follows a Different Distribution:Poison distribution (rare events), Expo-nential distribution (bacterial growth), log-normal distribution (to length data), bino-mial distribution (proportion data such aspercent defectives).

– Equivalente tools for non-normally dis-tributed data:

∗ t-test, ANOVA: median test, kruskal-wallis test.

∗ paired t-test: one-sample sign test.

How do you test a data sample to seewhether it is normally distributed?

? I would first plot the frequencies, comparing thehistogram of the sample to a normal probabilitycurve.

? This might be hard to see if the sample is small,in this case we can regress the data against thequantities of a normal distribution with the samemean and variance as the sample, the lack of fitto the regression line suggests a departure fromnormality.

? Another thing you can do, a back-of-the-envelope calculation, is taking the sample’s max-imum and minimum and compute the z-score (ormore properly the t-statistics), which is the num-ber of sample standard deviations that a sampleis above or below the mean:

z =x− xs/√N,

and then compares it to the 68-95-99.7 rule.

? If you have a 3 sigma event and fewer than 300samples, or a 4 sigma event and fewer than 15ksamples, then a normal distribution understatesthe maximum magnitude of deviations in thedata.

Page 31: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

31

? For example: 6 sigmas events don’t happen innormal distributions!

What’s a Hermitian matrix? What im-portant property does a Hermitian matrix’seigenvalues possess?

? It is equal to its conjugate transpose, for examplethe Paul matrices. A real matrix is Hermitian ifit is symmetric.

? The entries in the main diagonal are real.

? Any Hermitian matrix can be diagonalized by aunitary matrix (U*U=I)and all the eigenvaluesare real and the matrix has n linearly indepen-dent eigenvectors.

Random variable X is distributed as N(a, b),and random variable Y is distributed as N(c,d). What is the distribution of (1) X+Y, (2)X-Y, (3) X*Y, (4) X/Y?

?

N(µ, σ) =1

σ√

2e−

(x−µ)2

2σ2

or the standard normal distribution, describedby the probability density function:

φ(x) =1√2πe=

12x

2.

?Z ∼ N(µx, σ2

x), Y ∼ N(µy, σ2y)

them by convolution, we see that the Fouriertransform is a Gaussian PDF with:

Z = X + Y ∼ N(µx + µy, σ2x + σ2

y)

.

? Requires the assumption of independence: if therandom variables are correlated, the joint ran-dom variables are still normally distributed how-ever their variances are not additive, and we needto calculate de correlation.

? Any linear combination of independent normaldeviates is a normal deviate!

? The product of two Gaussian PDFs is propor-tional to a Gaussian PDF with mean that is halfof the coefficient of x and se that is the squareroot of half of the denominator:

σXY =

√σ2xσ

2y

σ2x + σ2

y

,

and

µXY =µxσ

2y + µyσ

2x

σ2x + σ2

y

.

How to Write a Competitive Analysis?

? Your company’s competitors:

– list of your company’s competitors.

– companies that indirectly compete withyours, ones that offer products or servicesthat are aiming for the same customer cap-ital.

? Competitor product summaries:

– Analyse the competition’s products andservices in terms of features, value, and tar-gets.

– How do your competitor’s sell their wares?How do they market them?

– Customer satisfaction surveys conducted.How do customers see your competition?

? Competitor strengths and weaknesses:

– be objective, no bias.

– What makes their products so great?

– If they are growing rapidly, what is it abouttheir product or service that’s promotingthat growth?

? The market outlook:

Page 32: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

32 CHAPTER 11. SOME RELATED QUESTIONS

– What is the market for your company’sproduct like now?

– Is it growing? If so, then there are likelyquite a few customers left to go around.

– If it is flat, then the competition for cus-tomers is likely to be fierce.

– Is the market splintering, is it breaking upinto niches?

What is the yearly standard deviation of astock given the monthly standard deviation?

SDy = SDm

√12

Given a dataset, how do you determine itssample distribution? Please provide at leasttwo methods.

? Plot them using histogram: it should giveyou and idea as to what parametric fam-ily(exponential, normal, log-normal...) mightprovide you with a reasonable fit. If there is one,you can proceed with estimating the parameters.

? If you use a large enough statistical sample size,you can apply the Central Limit Theorem to asample proportion for categorical data to find itssampling distribution.

? You may have 0/1 responses, so the distributionis binomial, maybe conditional on some othercovariates, that’s a logistic regression.

? You may have counts, so the distribution is Pois-son.

? Regression to fit, test chi-square to the goodnessof the fit.

? Also a small sample (say under 100) will be com-patible with many possible distributions, there issimply no way distinguishing between them onthe basis of the data only. On the other hand,large samples (say more than 10000) won’t becompatible with anything.

What’s the expectation of a uniform(a, b)distribution? What’s its variance?

? The PDF:f(x) =

1b− a

for a ≤ x ≤ b and 0 otherwise.

? Mean: E(X) = a+b2

? Variance: V (X) = (b−a)212

What’s the difference between the t-stat andR2 in a regression? What does each measure?When you get a very large value in one but avery small value in the other, what does thattell you about the regression?

? Correlation is a measure of association betweentwo variables, the variables are not designatedas dependent or independent. The significance(probability) of the correlation coefficient is de-termined from the t-statistic. The test indicateswhether the observed correlation coefficient oc-curred by chance if the true correlation is zero.

? Regression is used to examine the relationshipbetween one dependent variable and one inde-pendent variable. The regression statistics canbe used to predict the dependent variable whenthe independent variables is known. Regressiongoes beyond correlation by adding prediction ca-pabilities.

? The significance of the slope of the regressionline is determined from the t-statistic. It is theprobability that the observed correlation coeffi-cient occurred by chance if the true correlationis zero. Some researchers prefer to report theF-ratio instead of the t-statistic. The F-ratio isequal to the t-statistic squared.

? The t-statistic for the significance of the slope isessentially a test to determine if the regression

Page 33: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

33

model (equation) is usable. If the slope is signif-icantly different than zero, then we can use theregression model to predict the dependent vari-able for any value of the independent variable.

? The coefficient of determination (r-squared) isthe square of the correlation coefficient. Its valuemay vary from zero to one. It has the advantageover the correlation coefficient in that it may beinterpreted directly as the proportion of variancein the dependent variable that can be accountedfor by the regression equation. For example, anr-squared value of .49 means that 49% of thevariance in the dependent variable can be ex-plained by the regression equation.

? The R2 statistics is the amount of variance inthe dependent variable explained by your model.The F statistics is whether your model is signif-icant. The t statistics (if it’s a linear model ofmore than one predictor) is a measure of whetheror not the individual predictors are significantpredictors of the dependent variable. The rejec-tion of H0 depends on what it is. If it’s aboutyour overall model(s) you may look at the Fand/or R2 . If it is about the predictors youlook at the t statistics (but only if your overallmodel is significant).

? The t-values and R2 are used to judge very dif-ferent things. The t-values are used to judgethe accuracy of your estimate of the βi’s, butR2 measures the amount of variation in your re-sponse variable explained by your covariates.

? If you have a large enough dataset, you willalways have statistically significant (large) t-values. This does not mean necessarily meanyour covariates explain much of the variation inthe response variable.

What is bigger, the mean or the median?

If the distribution shows a positive skew, the meanis larger than the median. If it shows a negative skew,the mean is smaller than the median.

Pretend 1% of the population has a disease.You have a test that determines if you havethat disease, but it’s only 80% accurate and20% of the time you get a false positive, howlikely is it you have the disease.

? Fact: 0.01 of the population has the disease(given).

? Data: Test is only 0.8 accurate (given):

? How likely is it that you have the disease?

? To identify that you have the disease you haveto test +ve and actually have the disease. UsingBayes:

P (B|A) =0.8 ∗ 0.01

0.01 ∗ 0.8 + 0.99 ∗ 0.2= 0.04

SQL, what are the different types of tablejoins? What’s the difference between a leftjoin and a right join?

Inner Join, Left Outer Join, Right Outer Join, FullOuter Join, Cross Join.

? Simple Example: Lets say you have a Studentstable, and a Lockers table.

? INNER JOIN is equivalent to ”show me all stu-dents with lockers”.

? LEFT OUTER JOIN would be ”show me allstudents, with their corresponding locker if theyhave one”.

? RIGHT OUTER JOIN would be ”show me alllockers, and the students assigned to them ifthere are any”.

? FULL OUTER JOIN would be silly and prob-ably not much use. Something like ”show meall students and all lockers, and match them upwhere you can”

Page 34: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

34 CHAPTER 11. SOME RELATED QUESTIONS

? CROSS JOIN is also fairly silly in this scenario.It doesn’t use the linked ”locker number” field inthe students table, so you basically end up witha big giant list of every possible student-to-lockerpairing, whether or not it actually exists.

How many gigabytes would you need to runGoogle mail. How much do you think GMailcosts to Google?

? 16 000 gb is 800 dol/month, so 1 GB is 0.5 andeach user have 15 GB, so 7.5 dollar.

? Gmail has around 400 million user, so 3 billionsmonth.

? Marginal cost is the additional cost for adding auser.

How much revenue does Youtube make in aday?

? Youtube’s revenues at 4 billions for 2013 andtheir operating income at 711 millions. This putstheir daily revenues at 11 millions and their dailyincome at 1.9 millions.

? Average of 133 page views per month per user.

With a data set from normal distribution,suppose you can’t get any observation greaterthan 5, how to estimate the mean?

? Statistical analysis with small samples is likemaking astronomical observations with binocu-lars. You are limited to seeing big things: plan-ets, stars, moons and the occasional comet. Butjust because you don’t have access to a high-powered telescope doesn’t mean you cannot con-duct astronomy.

? Comparing Means: If your data is generally con-tinuous (not binary), such as task time or ratingscales, use the two sample t-test. It’s been shownto be accurate for small sample sizes.

? Comparing Two Proportions: If your data is bi-nary (pass/fail, yes/no), then use the N-1 TwoProportion Test. This is a variation on the bet-ter known Chi-Square test.

? Confidence Intervals: While the confidence inter-val width will be rather wide (usually 20 to 30percentage points), the upper or lower boundaryof the intervals can be very helpful in establish-ing how often something will occur in the totaluser population.

? Confidence interval around a mean: If your datais generally continuous (not binary) such as rat-ing scales, order amounts in dollars, or the num-ber of page views, the confidence interval isbased on the t-distribution (which takes into ac-count sample size).

? Confidence interval around task-time: Task timedata is positively skewed. There is a lowerboundary of 0 seconds. It’s not uncommon forsome users to take 10 to 20 times longer thanother users to complete the same task. To han-dle this skew, the time data needs to be log-transformed and the confidence interval is com-puted on the log-data, then transformed backwhen reporting.

Page 35: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

35

? Confidence interval around a binary measure:For an accurate confidence interval around bi-nary measures like completion rate or yes/noquestions, the Adjusted Wald interval performswell for all sample sizes.

? Completion Rate: For small-sample completionrates, there are only a few possible values foreach task. For example, with five users at-tempting a task, the only possible outcomes are0%, 20%, 40%, 60%, 80% and 100% success. It’snot uncommon to have 100% completion rateswith five users.

Derive the formula for the variance of OLSfrom scratch.

? linear least squares is a method for estimatingthe unknown parameters in a linear regressionmodel.

? this method minimizes the sum of squared verti-cal distances between the observed responses inthe dataset and the responses predicted by thelinear approximation.

? The resulting estimator can be expressed by asimple formula, especially in the case of a singleregressor on the right-hand side.

? Given a data set of n statistical units, a linearregression model assumes that the relationshipbetween the dependent variable yi and the p-vector of regressors xi is linear. This relationshipis modelled through a disturbance term or errorvariable εi an unobserved random variable thatadds noise to the linear relationship between thedependent variable and regressors.

? Thus the model takes the form

yi = β1xi1 + ..+ βpxip + εi = xTi β + ε1

where T denotes the transpose, so that xiT isthe inner product between vectors xi and .

Bayesian analysis of drug testing (determin-ing false positive rate).

? Let say 0.5% of people are drug users and ourtest is 99% accurate (it correctly identifies 99%. What’s the probability of being a drug user ifyou’ve tested positive?

? P (B|A) is 0.99, P (B)= 0.01*0.995+0.99*0.005,P (A) = 0.005.

I have a coin, I tossed 10 times and I wantto test if the coin is fair.

? A fair coin is an idealized randomizing devicewith two states (usually named ”heads” and”tails”) which are equally likely to occur.

? In more rigorous terminology, the problem is ofdetermining the parameters of a Bernoulli pro-cess, given only a limited sample of Bernoullitrials.

? If a maximum error of 0.01 is desired, how manytimes should the coin be tossed? At 68.27%,level of confidence (Z=1), at 95.4% level of con-fidence (Z=2), at 99.90% level of confidence(Z=3.3).

? Hypothesis testing lets you to decide, with a cer-tain level of significance, whether you have suf-ficient evidence to reject the underlying (Null)hypothesis or you have do not sufficient evidenceagainst the Null Hypothesis and hence you ac-cept the Null Hypothesis.

? I am explaining the Hypothesis testing belowassuming that you want to determine if a coincomes up heads more often than tails. If youwant to determine, if the coin is biased or unbi-ased, the same procedure holds good. Just thatyou need to do a two-sided hypothesis testing asopposed to one-sided hypothesis testing.

Page 36: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

36 CHAPTER 11. SOME RELATED QUESTIONS

? your Null hypothesis is p ≤ 0.5 while your Al-ternate hypothesis is p > 0.5, where p is theprobability that the coin shows up a head. Saynow you want to perform your hypothesis test-ing at 10% level of significance. What you donow is to do as follows:

? Let nH be the number of heads observed out ofa total of n tosses of the coin.

? Take p=0.5 (the extreme case of the Null Hy-pothesis). Let x∼B(n,0.5).

? P (x ≥ ncH) = 0.1, ncH gives you the criticalvalue beyond which you have sufficient evidenceto reject the Null Hypothesis at 10% level of sig-nificance. i.e. if you find nH≥ncH, then you havesufficient evidence to reject the Null Hypothesisat 10% level of significance and conclude thatthe coin comes up heads more often than tails.

? If you want to determine if the coin is unbiased,you need to do a two-sided hypothesis testing asfollows.

? Your Null hypothesis is p=0.5 while your Alter-nate hypothesis is p6= %0.5, where p is the prob-ability that the coin shows up a head. Say nowyou want to perform your hypothesis testing at10% level of significance. What you do now is todo as follows:

? Let nH be the number of heads observed out ofa total of n tosses of the coin.

? Let x∼B(n,0.5).

? Compute nc1H and nc2H as follows. P (x ≤nc1H)+P (x ≥ nc2H) = 0.1 (nc1H and nc2H aresymmetric about n2 i.e. nc1H+nc2H=n) nc1Hgives you the left critical value and nc2H givesyou the right critical value.

? If you find nH(nc1H,nc2H), then you have do nothave sufficient evidence against Null Hypothe-sis and hence you accept the Null Hypothesis at10% level of significance. Hence, you accept thatthe coin is fair at 10% level of significance.

Why we can not use linear regression fordependent variable with binary outcomes?

? the linear regression can produce predictionsthat are not binary, and hence ”nonsense”: in-teractions that are not accounted for in the re-gression and non-linear relationships between apredictor and the outcome

? inference based on the linear regression coeffi-cients will be incorrect. The problem with a bi-nary dependent variable is that the homoscedas-ticity assumption (similar variation on the de-pendent variable for units with different valueson the independent variable) is not satisfied. Iwill add that another concern is that the normal-ity assumption is violated: the residuals from aregression model on a binary outcome will notlook very bell-shaped... Again, with a suffi-ciently large sample, the distribution does notmake much difference, since the standard errorsare so small anyway.

How to estimating sample size required forexperiment?

Sample sizes may be chosen in several differentways:

? expedience - For example, include those itemsreadily available or convenient to collect. Achoice of small sample sizes, though sometimesnecessary, can result in wide confidence intervalsor risks of errors in statistical hypothesis testing.

? using a target variance for an estimate to be de-rived from the sample eventually obtained

? using a target for the power of a statistical testto be applied once the sample is collected.

? How to determine the sample size?

– for a study for which the goal is to get asignificant result from at test:

Page 37: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

37

– set α to the residual power

– set HA, estimate σ

Page 38: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

38 CHAPTER 11. SOME RELATED QUESTIONS

Page 39: Crash Course on Basic Statistics - CBMMcbmm.mit.edu/sites/default/files/documents/probability_handout.pdf · Crash Course on Basic Statistics Marina Wahl, marina.w4hl@gmail.com University

Bibliography

[1] Quantifying the User Experience, Jeff Sauro andJames R. Lewis, 2012

[2] OReilly Statistics in a Nutshell, 2012

[3] https://class.coursera.org/introstats-001/class/index

39