Stat 13, Intro. to Statistical Methods for the Life and ...Stat 13, Intro. to Statistical Methods...
-
Upload
nguyendang -
Category
Documents
-
view
222 -
download
0
Transcript of Stat 13, Intro. to Statistical Methods for the Life and ...Stat 13, Intro. to Statistical Methods...
Stat 13, Intro. to Statistical Methods for the Life and Health Sciences.
1.Midterms.2.Polls.3.Twoquantitativevariables,correlation.4.Linearregression.
NoclassThuNov24,Thanksgiving.Readch10.Hw4is10.1.8,10.3.14,10.3.21,10.4.11andisdueTueNov29.
ThefinalFri Dec9,8am-11,right here,willbeonch1-10.BringaPENCILandCALCULATORandanybooksornotesyouwant.Nocomputers.http://www.stat.ucla.edu/~frederic/13/F16.
1.Midterms.Thescoresarelistedinmidtermscores.txt .Theyareoutof20.
Themeanwas16.628=83.14%.Median=85%.SD=15%.Idorewardimprovementonthefinal.
1
2.Polls.a.Whyweretheysofaroff?
Beforewegetintothis,whenitcomestotheelection,pleasemakesuretorespectfreedomofspeechandalsotorespecteveryone'srighttotheirbeliefseveniftheyarefarfromyourown.
Ourdept's officialstatement:Intheaftermathofthepresidentialelection,inwhichpassionsranhighsurroundingissuesoftolerancefordiversity,itisimportanttorememberthatharassmentanddiscriminationbasedonsuchthingsas:•race,ethnicity,ancestry,color•sex,gender,genderidentity,genderexpression,sexualorientation•nationalorigin,citizenshipstatus•religionarenotacceptableatUCLA,andmayhaveseriousconsequences.Informationforhowtoobtainredressorcounselingifyouaresubjectedtosuchharassmentordiscriminationcanbefoundat:https://equity.ucla.edu/report-an-incident/
2
2.Polls.a.Whyweretheysofaroff?
3
2.Polls.
Intotalthismakes17,104likelyvotersinthoseWisconsinpollsputtogether.Theyaveraged40.3% forTrump,andClinton46.8%.Thedifferenceis6.5%.Combined,themarginoferrorfora95%confidenceintervalaroundTrump'spercentagewouldbe0.735%.Thestandarderroris0.375%ontheestimateofTrump'spercentageof40.3%,andhegot47.9%.Sotheywereoffby7.6%whichismorethan20standarderrors.Theprobabilityis1in10^90thatthepollswouldbeoffbythatmuchormorejustbychance,iftheanswerstothepollswerejustarandomsampleofhowpeoplewereactuallygoingtovote.Technically,thereareundecidedvotersinthepollsalso.JusttakingthedifferenceinpercentagesbetweenTrumpandHillaryClintonratherthanthepercentageforTrumpintoaccount,theresultswereoffbyabout10SEs,not20,andthismakestheprobabilityofsomethingthisextremeormoreextremestillastronomical,about1.5*10-23.Thechanceofamonkeyrandomlytyping15letterscompletelyatrandomandhappeningtochoose"hillary rclinton"inorder,wouldbe6*10-22,soit'sabout40timesmorelikely.Whatdoweconclude?
4
2.Polls.
Whatdoweconclude?Either*lotsofpeoplechangedtheirminds,*thepollswerebiased,*theofficialresultswereincorrect,or*thepollsweren'tindependentofeachother.Thosearereallytheonlytenableexplanations.
5
2. Polls.
b.IfClintonhadoutperformedthepolls,whatwouldhavebeenanexplanation?*fundraisingadvantage.* moreexperiencedcampaignstaff.*moreorganizedgroundgame.*Latinopopulationshadsurged.*AhigherpercentageofLatinosvoted.*Earlyvotinghadenabledmorepoorpeopleandminoritiestovote.Thuspeoplewhomightbeconsideredunlikelyvotersduetovotingtrendsinpreviouselectionsmightactuallybevoting,andthemajorityofthesewouldbeexpectedtobeDemocrats.*Mostlygoodnewsforherrightbeforetheelection.TheFBIsaidtheywentthroughtheemailsandclearedherofcharges.Lotsofbigstarswereperformingandgettingpeopleouttothepollsandrallies.Obama,Michelle,BillClinton,andmanyotherswerecampaigninghardforherinthefinaldays.*Meanwhile,manytopRepublicanswerenotevensupportingTrump.Hisgroundgameandfieldofficesweredisorganizedornonexistent.EvensomerightwingradiohostswerecriticizingTrump.Hehadfiredhiscampaignmanagermidwaythroughthecampaignandhadaninexperiencedhodge podgeofsupportersandstaff.
6
2. Polls.
c.Otherpossiblepro-Clintonexplanations.*Hillarypreparedveryearlyforherrunforoffice.Trumpwasalatecomer.*TheDemocratsmostlycoalescedaroundHillary,allowinghertopileupnumerousendorsementsveryearly.Onlyacoupleofpeopleevenranagainsther,andnonewerereallypromisingcandidates.EvenSanderswasnotseenasaveryviablecandidatewhentheracebegan.*OntheRepublicanside,Trumphadtofendoff16othercandidateswhileHillarywasraisingmoneyandholdingontoit.*AfterTrumpgotthenomination,barely,hehadlittleconventionbounce,andHillaryhadahugeoneandgotabigleadinthepolls.*Obama'spopularityhasbeenhigh.*Theeconomy,whilenottoostrong,ismuchmuchstrongerthanwhenObamabeganinoffice,andhewasabletocampaignstronglyforHillary.*Shealsohadanincrediblycharismaticandgreatspeakerinherhusband,andMichellemadegreatspeechesaswell.*Thefatherofamuslim fallensoldierspokeeloquentlyagainstTrump.*Melania gotcaughtblatantlyplagiarizingMichelle'sspeechfrom2008.*TrumpgotsuedforfraudforTrumpUniversity,andcriticizedthejudgeasunfitbecausehewasofMexicandescent.*Trumpmadefunofareporterinawheelchair.*Clintonwonall3debatesaccordingtomostpollsandsurveys.
7
2. Polls.
c.Otherpossiblepro-Clintonexplanations.*Trumpcontinuedtorefusetoreleasehistaxreturnsdespiteprodding.*ThetapecameoutwhereTrumpbraggedaboutgrabbingwomen.Clintonsurgedwayaheadinthepolls.*ManyimportantRepublicansstoppedsupportinghim.*DownballotRepublicanstriedtodistancethemselvesfromhim.*ClintonwasraisingmorefundsthanTrump,andTrumpwasnotspendingmuchofhisownmoneyonhiscampaign.
d.Bayesianstatistics,NateSilver,and538.com.
Inhismostrecentarticleon538.com,NateSilverstatesveryclearlythatheexpectedHillarytowin,andhismodelsfavoredher,buthealsoemphasizesthatthepollsaretypicallyoffbytheamountstheywereoffthistimeandpointsoutthathismodelsgaveTrumpabouta30%chanceofwinningthedaybeforetheelection.
Healsoadmitsthatthepollstypicallymissbyamountsgreaterthanwhatwewouldexpectduetosamplingerroralone.
8
d.Bayesianstatistics,NateSilver,and538.com.
AsNateSilvernotes,thepollsindicatedalargeproportionofundecidedvoters,whichmeanttherewasmoreuncertaintyandthereforealargerchanceforamajorpollingerror.
NateSilverand538.comuseaBayesianmodeltoforecasttheelection.HowdoesBayesianmodelingwork?Themodelstartswithpriordistributionswhicharesupposedtoreflecttheresearcher'sbeliefsabouttheprobabilityofsomethingbeforecollectingdata.Forinstance,youmighthaveapriordistributionthatthepercentageµofvotestheDemocraticcandidatewouldgetisspreaduniformlybetween40%and60%.Thenyoucollectdata,frompolls,economicdata,etc.,andgraduallyupdateyourdistribution,formingaposteriordistribution forµ.NateSilver'smainideawastoweightthedifferentpollsbyhowaccuratetheywerehistorically.
Bayesianmodelersareoftencorrectlycriticizedfornotadequatelyvalidatingtheirmodels.NateSilvermightbeagoodmodeler,butitishardtotell.
9
d.Bayesianstatistics,NateSilver,and538.com."Apartfromcalibration,arethereothergoodmethodstoevaluateaprobabilisticforecast?Notreally,althoughsometimesitcanbeworthwhiletolookforsignsofwhetheranupsetwinnerbenefitedfromgoodluckorquirky,one-offcircumstances."NateSilver,May18,2016.http://fivethirtyeight.com/features/how-i-acted-like-a-pundit-and-screwed-up-on-donald-trump
Calibration.Ifwetakeallcandidateswherethemodeloutputsaprobabilityof35%,dothecandidateswin35%ofthetime?
IcouldjustsayeveryDemocratandRepublicanhasa50%chanceeveryelection.Thismodelmightbewellcalibrated,butitisnotveryinformative.Validation.Isthemodelaccurate?Doesitoutperformcompetingmodels?e.g.Log-likelihood:∑i won log(pi)+∑i lost log(1-pi).
TherearepotentiallimitationsofBayesianstatistics.*WhymightonewanttoknowifClinton'sprobabilityofwinningis67%or72%?Perhapsifoneisonlyverysuperficiallyinterested?*WhatifmypriorisdifferentfromNateSilver's?*Ifyouaredeeplyinterested,whynotlookatthepollresultsthemselves,directly?*Bayesianmodelsmightbemoredifficulttovalidate,insomecases.
10
e.ArethereotherpredictorsofsuccessinPresidentialelections?
Herearethepresidentialelectionsthathaveoccurredinmylifetime.1972.Nixonvs.McGovern.1976.Cartervs.Ford.1980.Reaganvs.Carter.1984.Reaganvs.Mondale.1988.Bushvs.Dukakis.1992.Clintonvs.Bush.1996.Clintonvs.Dole.2000.Bushvs.Gore.2004.Bushvs.Kerry.2008.Obamavs.McCain.2012.Obamavs.Romney.2016.Trump vs.Clinton.
Many would say the candidate with morehumor,style,andcharisma won11/12ofthese elections.The two-sided p-value =0.63%.Experience?Hard tosay,but perhaps the moreexperienced candidate won3/12.The two-sided p-value =14.6%.
11
3.TwoQuantitativeVariablesChapter10
TwoQuantitativeVariables:ScatterplotsandCorrelationSection10.1
ScatterplotsandCorrelation
Time 30 41 41 43 47 48 51 54 54 56 56 56 57 58
Score 100 84 94 90 88 99 85 84 94 100 65 64 65 89
Time 58 60 61 61 62 63 64 66 66 69 72 78 79
Score 83 85 86 92 74 73 75 53 91 85 62 68 72
Supposewecollecteddataontherelationshipbetweenthetimeittakesastudenttotakeatestandtheresultingscore.
Scatterplot
Putexplanatoryvariableonthehorizontalaxis.
Putresponsevariableontheverticalaxis.
DescribingScatterplots•Whenwedescribedatainascatterplot,wedescribethe• Direction(positiveornegative)• Form(linearornot)• Strength(strong-moderate-weak,wewillletcorrelationhelpusdecide)• UnusualObservations• Howwouldyoudescribethetimeandtestscatterplot?
Correlation• Correlationmeasuresthestrengthanddirectionofalinear associationbetweentwoquantitative variables.• Correlationisanumberbetween-1and1.• Withpositivecorrelationonevariableincreases,onaverage,astheotherincreases.• Withnegativecorrelationonevariabledecreases,onaverage,astheotherincreases.• Thecloseritistoeither-1or1thecloserthepointsfittoaline.• Thecorrelationforthetestdatais-0.56.
CorrelationGuidelinesCorrelationValue Strengthof
AssociationWhatthismeans
0.7to1.0 Strong Thepointswillappeartobenearlyastraightline
0.3to0.7 Moderate Whenlookingatthegraphtheincreasing/decreasingpatternwillbeclear,but thereisconsiderablescatter.
0.1to0.3 Weak Withsomeeffortyouwillbeabletoseeaslightlyincreasing/decreasingpattern
0to0.1 None Nodiscernibleincreasing/decreasingpattern
Same StrengthResultswithNegativeCorrelations
BacktothetestdataActuallythelastthreepeopletofinishthetesthadscoresof93,93,and97.
Whatdoesthisdotothecorrelation?
InfluentialObservations• Thecorrelationchangedfrom-0.56(afairlymoderatenegativecorrelation)to-0.12(aweaknegativecorrelation).• Pointsthatarefartotheleftorrightandnotintheoveralldirectionofthescatterplotcangreatlychangethecorrelation.(influentialobservations)
Correlation• Correlationmeasuresthestrengthanddirectionofalinear associationbetweentwoquantitativevariables.• -1< r< 1• Correlationmakesnodistinctionbetweenexplanatoryandresponsevariables.• Correlationhasnounits.• Correlationisnotresistanttooutliers.Itissensitive.
LearningObjectivesforSection10.1• Summarizethecharacteristicsofascatterplotbydescribingitsdirection,form,strengthandwhetherthereareanyunusualobservations.• Recognizethatthecorrelationcoefficientisappropriateonlyforsummarizingthestrengthanddirectionofascatterplotthathaslinearform.• Recognizethatascatterplotistheappropriategraphfordisplayingtherelationshipbetweentwoquantitativevariablesandcreateascatterplotfromrawdata.• Recognizethatacorrelationcoefficientof0meansthereisnolinearassociationbetweenthetwovariablesandthatacorrelationcoefficientof-1or1meansthatthescatterplotisexactlyastraightline.• Understandthatthecorrelationcoefficientisinfluencedbyextremeobservations.
InferencefortheCorrelationCoefficient:Simulation-BasedApproachSection10.2
Wewilllookatasmallsampleexampletoseeifbodytemperatureisassociatedwithheartrate.
TemperatureandHeartRateHypotheses
• Null:Thereisnoassociationbetweenheartrateandbodytemperature.(ρ=0)• Alternative:Thereisapositivelinearassociationbetweenheartrateandbodytemperature.(ρ>0)
ρ=rho
InferenceforCorrelationwithSimulation(Section10.2)
1.Computetheobservedstatistic.(Correlation)2.Scrambletheresponsevariable,computethesimulatedstatistic,andrepeatthisprocessmanytimes.
3.Rejectthenullhypothesisiftheobservedstatisticisinthetailofthenulldistribution.
TemperatureandHeartRate
Tmp 98.3 98.2 98.7 98.5 97.0 98.8 98.5 98.7 99.3 97.8HR 72 69 72 71 80 81 68 82 68 65Tmp 98.2 99.9 98.6 98.6 97.8 98.4 98.7 97.4 96.7 98.0HR 71 79 86 82 58 84 73 57 62 89
CollecttheData
TemperatureandHeartRate
r=0.378
ExploretheData
TemperatureandHeartRate• Iftherewasnoassociationbetweenheartrateandbodytemperature,whatistheprobabilitywewouldgetacorrelationashighas0.378justbychance?
• Ifthereisnoassociation,wecanbreakapartthetemperaturesandtheircorrespondingheartrates.Wewilldothisbyshufflingoneofthevariables.
ShufflingCards• Let’sremindourselveswhatwedidwithcardstofindoursimulatedstatistics.• Withtwoproportions,wewrotetheresponseonthecards,shuffledthecardsandplacedthemintotwopilescorrespondingtothetwocategoriesoftheexplanatoryvariable.• Withtwomeanswedidthesamethingexceptthistimetheresponseswerenumbersinsteadofwords.
20.0% Improvers
66.7% Improvers
DolphinTherapyControlNon-
improver
Improver
Improver
Improver
Improver
Improver
Improver
Improver
ImproverImprover
Improver
Improver
Improver
ImproverNon-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
Non-improver
40.0% Improvers
46.7% Improvers0.400 – 0.467 = -0.067
Difference in Simulated Proportions
mean = 3.90mean = 19.82
Music Nomusic
14.5
25.2
11.6
12.6
18.6
12.1
30.534.5
-7.0
45.6 10.0
9.6
-10.7
7.2-14.7
21.3
2.2
4.5
-10.7
21.8
2.4
mean = 6.38 mean = 16.126.38 – 16.12 = -9.74
Difference in Simulated Means
ShufflingCards• Nowhowwillthisshufflingbedifferentwhenboththeresponseandtheexplanatoryvariablearequantitative?• Wecan’tputthingsintwopilesanymore.• Westillshufflevaluesoftheresponsevariable,butthistimeplacethemnexttotwovaluesoftheexplanatoryvariable.
98.3° 98.2° 97.7° 98.5° 97.0° 98.8° 98.5° 98.7° 99.3° 97.8°
98.2° 99.9° 98.6° 98.6° 97.8° 98.4° 98.7° 97.4° 96.7° 98.0°
r = 0.378
6972 8180 82687172
r = 0.073
Simulated Correlations
Body Temperature and Heart Rate
68 65
7971 8458 57738286 62 89
MoreSimulations0.054
-0.253 -0.3450.062 0.259
0.339
0.447-0.008
-0.229
-0.0290.059 -0.006
-0.034
-0.327 0.1000.067
0.212
0.097
0.447
0.034
0.167
0.3290.020
-0.042
0.232
0.2000.314
Only one simulated statistic out of 30 was as large or larger than our observed correlation of 0.378, hence our p-value for this null distribution is 1/30 ≈ 0.03.
Simulated Correlations 0.378
TemperatureandHeartRate• Wecanlookattheoutputof1000shuffleswithadistributionof1000simulatedcorrelations.
TemperatureandHeartRate• Noticeournulldistributioniscenteredat0andsomewhatsymmetric.• Wefoundthat530/10000timeswehadasimulatedcorrelationgreaterthanorequalto0.378.
TemperatureandHeartRate• Withap-valueof0.053=5.3%,wealmostbutdonotquitehavestatisticalsignificance.Thisismoderateevidenceofapositivelinearassociationbetweenbodytemperatureandheartrate.Perhapsalargersamplewouldgiveasmallerp-value.
4.LeastSquaresRegressionSection10.3
Introduction• Ifwedecideanassociationislinear,itishelpfultodevelopamathematicalmodelofthatassociation.• Helpsmakepredictionsabouttheresponsevariable.• Theleast-squaresregressionline isthemostcommonwayofdoingthis.
Introduction• Unlessthepointsareperfectlylinearlyalligned,therewillnotbeasinglelinethatgoesthrougheverypoint.• Wewantalinethatgetsascloseaspossibletoallthepoints.
Introduction• Wewantalinethatminimizestheverticaldistancesbetweenthelineandthepoints• Thesedistancesarecalledresiduals.• Thelinewewillfindactuallyminimizesthesumofthesquaresoftheresiduals.• Thisiscalledaleast-squaresregressionline.
AreDinnerPlatesGettingLarger?Example10.3
GrowingPlates?• TherearemanyrecentarticlesandTVreportsabouttheobesityproblem.• Onereasonsomehavegivenisthatthesizeofdinnerplatesareincreasing.• Aretheseblackcirclesthesamesize,orisonelargerthantheother?
GrowingPlates?• Theyappeartobethesamesizeformany,buttheoneontherightisabout20%largerthantheleft.
• Thissuggeststhatpeoplewillputmorefoodonlargerdinnerplateswithoutknowingit.
• Thereisnameforthisphenomenon:Delboeufillusion
GrowingPlates?• Researchersgathereddatatoinvestigatetheclaimthatdinnerplatesaregrowing• Americandinnerplatessoldonebay onMarch30,2010(VanIttersum andWansink,2011)• Yearmanufacturedanddiameteraregiven.
GrowingPlates?• Bothyear(explanatoryvariable)anddiameterininches(responsevariable)arequantitative.• Eachdotrepresentsoneplateinthisscatterplot.• Describetheassociationhere.
GrowingPlates?• Theassociationappearstoberoughlylinear• Theleastsquaresregressionlineisadded• Howcanwedescribethisline?
RegressionLineTheregressionequationis𝑦" = 𝑎 + 𝑏𝑥:• a isthey-intercept• b istheslope• x isavalueoftheexplanatoryvariable• ŷ isthepredictedvaluefortheresponsevariable
• Foraspecificvalueofx,thecorrespondingdistancey − 𝑦" (oractual– predicted)isaresidual
RegressionLine• Theleastsquareslineforthedinnerplatedatais𝑦" = −14.8 + 0.0128𝑥• Ordiameter7 = −14.8 + 0.0128(year)• Thisallowsustopredictplatediameterforaparticularyear.
Slope𝑦" = −14.8 + 0.0128𝑥
• Whatisthepredicteddiameterforaplatemanufacturedin2000?• -14.8+0.0128(2000)=10.8in.
• Whatisthepredicteddiameterforaplatemanufacturedin2001?• -14.8+0.0128(2001)=10.8128in.
• Howdoesthiscomparetoourpredictionfortheyear2000?• 0.0128larger
• Slopeb =0.0128meansthatdiametersarepredictedtoincreaseby0.0128inchesperyearonaverage
Slope• Slopeisthepredictedchangeintheresponsevariableforone-unitchangeintheexplanatoryvariable.• Boththeslopeandthecorrelationcoefficientforthisstudywerepositive.• Theslopeis0.0128• Thecorrelationis0.604
• Theslopeandcorrelationcoefficientwillalwayshavethesamesign.
y-intercept• They-interceptiswheretheregressionlinecrossesthey-axisorthepredictedresponsewhentheexplanatoryvariableequals0.• Wehaday-interceptof-14.8inthedinnerplateequation.Whatdoesthistellusaboutourdinnerplateexample?• Dinnerplatesinyear0were-14.8inches.
• Howcanitbenegative?• Theequationworkswellwithintherangeofvaluesgivenfortheexplanatoryvariable,butfailsoutsidethatrange.
• Ourequationshouldonlybeusedtopredictthesizeofdinnerplatesfromabout1950to2010.
Extrapolation• Predictingvaluesfortheresponsevariableforvaluesoftheexplanatoryvariablethatareoutsideoftherangeoftheoriginaldataiscalledextrapolation.
CoefficientofDetermination
• Whiletheinterceptandslopehavemeaninginthecontextofyearanddiameter,rememberthatthecorrelationdoesnot.Itisjust0.604.• However,thesquareofthecorrelation(coefficientofdeterminationorr2)doeshavemeaning.• r2 =0.6042=0.365or36.5%• 36.5%ofthevariationinplatesize(theresponsevariable)canbeexplainedbyitslinearassociationwiththeyear(theexplanatoryvariable).
LearningObjectivesforSection10.3• Understandthatonewayascatterplotcanbesummarizedisbyfittingthebest-fit(leastsquaresregression)line.• Beabletointerpretboththeslopeandinterceptofabest-fitlineinthecontextofthetwovariablesonthescatterplot.• Findthepredictedvalueoftheresponsevariableforagivenvalueoftheexplanatoryvariable.• Understandtheconceptofresidualandfindandinterprettheresidualforanobservationalunitgiventherawdataandtheequationofthebestfit(regression)line.• Understandtherelationshipbetweenresidualsandstrengthofassociationandthatthebest-fit(regression)linethisminimizesthesumofthesquaredresiduals.
LearningObjectivesforSection10.3• Findandinterpretthecoefficientofdetermination(r2)asthesquaredcorrelationandasthepercentoftotalvariationintheresponsevariablethatisaccountedforbythelinearassociationwiththeexplanatoryvariable.• Understandthatextrapolationiswhenaregressionlineisusedtopredictvaluesoutsideoftherangeofobservedvaluesfortheexplanatoryvariable.• Understandthatwhenslope=0meansnoassociation,slope<0meansnegativeassociation,slope>0meanspositiveassociation,andthatthesignoftheslopewillbethesameasthesignofthecorrelationcoefficient.• Understandthatinfluentialpointscansubstantiallychangetheequationofthebest-fitline.