Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an...

16
Unit 8: The Normal Distribution

description

(Continuous) uniform distribution all numbers (decimals) between 0 and 1 (say) equally likely in Excel: = RAND() distribution: 0 1

Transcript of Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an...

Page 1: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Unit 8: The Normal Distribution

Page 2: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Probability distributionsThe probability of an outcome in an interval is shown in an histogram as the area above that interval.

Page 3: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

(Continuous) uniform distribution

• all numbers (decimals) between 0 and 1 (say) equally likely

• in Excel: = RAND()• distribution:

0 1

Page 4: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

The Law of Averages says ...

• No matter what distribution you start with, the average of many “draws” from it will be distributed more and more tightly around the average of the original distribution.

• (The sum of the draws will have more variability with more draws, but relative to the possible range of sums, they will cluster more tightly around the original average times the number of draws.)

Page 5: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Ex 1: P(0) = .9, P(1) = .1 n = 1, 3, 10, 25, 400After n draws, the possible sums are 0 to n , but the most probable sums group close to .1n .

Page 6: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Same initial distribution as last slide, but with horizontal scale adjusted so that the possible range of outcomes, 0 to n , is always the same width.

The next slide is similar, with two other initial distributions, both from the text.

Page 7: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.
Page 8: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Central Limit Theorem• The last histogram in each of the lists

above is showing you something else, which we can’t prove but we can see happening:

• If we take n draws from any distribution and find their sum or average, the probability distribution of the sums or averages will get closer and closer to a normal curve as n increases.

Page 9: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

That’s why, ...• if a variable (height, IQ, measurement or

experimental error) is the sum of many small increments (genes controlling bone growth, components of intelligence, many small adjustments), it is probably normally distributed, ...

• while if the variable (salary, bulb life) comes from a few factors (one employer, weakest part of the filament), then it can have some other distribution.

Page 10: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Ex: Law of Averages + Central Limit TheoremRoll a die n times, with n = 6, 12, 24, 48, 96.

What are the probabilities of getting exactly k 4’s?

Page 11: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Chapter 19: Survey methods

• Very interesting, ...• but not mathematical, so I can only

emphasize a few things and add a little

Page 12: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Notoriously bad polls

• Literary Digest: FDR vs. Landon, 1936– mail poll based on telephone listings, country

club memberships, car registrations– Gallup agency, then new, even predicted how

far wrong LD would be• Chicago Tribune: Truman vs. Dewey,

1948– Gallup, Roper and Crossley all wrong– all used “quota sampling”

Page 13: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Sample of “quota sampling”:St. Louis, 1948

• Interviewer had to get 13 subjects– 6 from suburbs, 7 from central city– 7 men, 6 women. Of men,

• 3 under 40, 4 over 40• 1 black, 6 white. Of white men,

– 1 pays > $44 rent– 3 pay between $18 and $44 rent– 2 pay < $18

• But within rules, they could choose anyone– and they chose Republicans (in ’36, ’40, ’44 and ’48)

Page 14: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Sample surveys: Beware of:• volunteer sample

– or any nonrandom sample, because it invites bias

• nonresponse bias: subjects don’t answer– especially a large part of chosen sample

• response bias: subjects answer wrongly– lying (to please interviewer?)– confused (question is leading or unclear)– affected by previous questions

Page 15: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

On edstat-l usenet discussion group:

From: IN%"[email protected]" 20-JAN-1995 08:59:13.14

Subj: RE: Internet-Based Survey Research

If you mean "surveys" of the kind where someone puts an article in several newsgroups asking people to fill out an electronic questionnaire and return it, the answer has always seemed to me to be simple: Such surveys are ALWAYS a waste of time and resources since they can only yield non-random, indeed self-selected samples from an ill-defined population. The procedure is on exactly the same footing as TV telephone polls, newspaper advertisement polls and the like. The procedure does not to stand up to the most basic scientific criteria of validity and thus the exercise is futile for any formal academic purpose (except, perhaps, journalism). They should be outlawed and attract very serious penalties on the grounds that they are a very blatant abuse of the internet. No offence intended...

Bill Venables, Department of Statistics, The University of Adelaide, South AUSTRALIA. 5005.

Page 16: Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.

Sample survey: Randomize!

• Simple random sample!– every sample of given size has same chance of being

chosen– theory is based on randomness

• Cluster random sample?– pick some “clusters” out of population and poll

everyone in them– small samples (and there are fewer clusters than

individuals) means less randomness• Stratified random sample?

– divide entire population into “strata” and take random sample of each stratum

– if strata are not of equal size, small ones may be over-represented in sample