Multivariate All

download Multivariate All

of 64

Transcript of Multivariate All

  • 7/24/2019 Multivariate All

    1/64

    Applied Multivariate Statistics

    Ralf B. Schfer

    University of Koblenz-andau!"11#1!

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/
  • 7/24/2019 Multivariate All

    2/64

    Short introduction

    $unior %rofessor for &uantitative andscape 'colo(y

    )urrent teachin(* Statistics +Master, /S +Bachelor#Master,

    'nviron0ental Modellin( +Bachelor#Master, )urrent research proects*

    'ffects of to2icants in fresh3ater ecosyste0s

    4rait-based a5uatic ecolo(y +focused on anthropo(enicstressors,

    /dentification of species vulnerable to cli0ate chan(e %ri0arily field studies#e2peri0ents and data analyses

    )ourse assistants* Katharina %eters +%hd student, 6 'duardSz7cs +Master student,

    ttp*##333.uni-8oblenz-landau.de#landau#fb9#u03elt3issenschaften#landscape-ecolo(y#

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://www.uni-koblenz-landau.de/landau/fb7/umweltwissenschaften/landscape-ecology/
  • 7/24/2019 Multivariate All

    3/64

    ;r(anisation

    Monday -1!*"" +brea8 1"*1>-1"*:", in %) Roo0 1

    ecture recordin(* enables follo3in( fro0 ho0e +e.(. in case ofillness, and follo3in( fro0 outside of our university. Seehttp*##tinyurl.co0#br:>t(lfor all 0aterials includin( scripts.

    %roficiency certificate* Active participation in e2ercisessuccessful e2a0 after course +?th@eb !"1!,

    )ourse structure* lecture de0onstration o3n e2ercises

    )ontact ti0e* : hours per 3ee8 ;3n study ti0e* appro2. ? hoursper 3ee8

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://tinyurl.com/br35tgl
  • 7/24/2019 Multivariate All

    4/64

    Relation 3ith other statistical courses

    Univariate statistics and exercises9 sessions by M. Bundschuh 6 $. ubrod

    Multivariate statistics and exercises9 sessions - this course

    Introduction to probability theory

    and statisticsB.Sc. of 'nviron0ental Sciences#Under(raduate Ciplo0a

    raduateCiplo0a

    M.Sc. of'nviron0entalSciences and'coto2icolo(y

    )ourse e2a0 +?th@ebruary !"1!,* )o0bined univariate and0ultivariate for 0aster studentsD 0ultivariate for diplo0a students

    'ntry e2a0 for raduate Ciplo0a +li0ited places and to ascertainprior 8no3led(e,

    '2a0s can be obtained fro0 M. Bundschuh Ciplo0a studentsperfor0ed very 3ell several M.Sc. 'coto2 3ould have failed

  • 7/24/2019 Multivariate All

    5/64

    )ourse obectives

    1. Understandin( the different types of statistical approaches

    !. )hoosin( the appropriate statistical 0ethod for the research5uestion

    :. Moderate level of statistical 0odellin( s8ills

    =. Basic s8ills in R includin( develop0ent of scripts

    )ourse schedule

    )ourse literature

    http*##tinyurl.co0#buy=29r

    )o0ple0entary use of a te2tboo8 is essentialEEEAnd let 0e 8no3 if you are 0issin( boo8s in the library...

    http*##tinyurl.co0#c?v((=r

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://tinyurl.com/c6vgg4rhttp://tinyurl.com/buy4x7r
  • 7/24/2019 Multivariate All

    6/64

    )ourse Soft3are* R

    Fhy RG

    R has really become the second language for people coming out of grad school now,and theres an amazing amount of code being written for it (Max Kuhn, !", #$%$&'')

    R is freeD co0bines everythin( you need for data analysis +e.(.databaseD statisticsD (raphics, and easily co00unicates 3ith other

    pro(ra0s +e.(. RASS /SD Hetlo(oD %ost(reS&,.

    ...and is probably the do0inant statistical soft3are

    http*##blo(.8a((le.co0#!"11#11#!9#8a((lers-favorite-tools# http*##r=stats.co0#popularity

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://r4stats.com/popularityhttp://blog.kaggle.com/2011/11/27/kagglers-favorite-tools/
  • 7/24/2019 Multivariate All

    7/64

    R basics that you should be fa0iliar 3ith

    R is a calculator* ID-D#DJD

    4he R universe is populated by obects such as*c+,D factor+,D 0atri2+,D data.fra0e+,D list+,

    4he obects can be inde2ed to e2tract infor0atione.(. obectL N obectL D N obectOvariable

    Po3 to read and 3rite e2ternal data

    @unctions follo3 a specific pattern* function+ar(u0ent1Q...D ar(u0ent!Q...,

    Where do I find help or introductory tutorials?

    http*##dist.stat.ta0u.edu#pub#rvideos#+teachin( videos,http*##search.r-proect.or(#n0z.ht0l+search R and 0ailin( lists,http*##333.r-proect.or(#+ 0ailin( lists docu0entation,and chec8 the literature list provided for this courseE

    R and statistics for be(inners 0ailin( list at our university*http*##0ail0an.rz.uni-landau.de#0ail0an#listinfo#r-for-be(inners

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://www.r-project.org/http://search.r-project.org/nmz.htmlhttp://dist.stat.tamu.edu/pub/rvideos/
  • 7/24/2019 Multivariate All

    8/64

    Still R-phobicG 4ry RstudioE

    Helpful features:

    - U/ for loadin( and savin( data

    - PelpD pac8a(e overvie3 etc. easilyaccessible

    - )ode co0pletion and help +4abT,

    - and 0uch 0ore*http*##rstudio.or(#docs#

  • 7/24/2019 Multivariate All

    9/64

    Usin( your o3n noteboo8

    - feel free to you use your o3n FAH-enabled noteboo8E

    - install R +http*##0irrors.softliste.de#cran#, oder Rstudio+reco00ended for be(inners - http*##333.rstudio.or(#,

    - run the function install.pac8a(es+Vpac8a(e to be installedW, ifpac8a(es are used that are not installed on your 0achine

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://www.rstudio.org/http://mirrors.softliste.de/cran/
  • 7/24/2019 Multivariate All

    10/64

    "

    Block I: (Briefly) evisitin!

    univariate statistics

    1. '2ploratory data analysis

    !. Statistical 0odellin(

    :. )o0parin( central tendencies

    =. 4he linear 0odel

    >. Model dia(nostics and fittin(

    ?. 4he (eneralised linear 0odel

    "ontents

  • 7/24/2019 Multivariate All

    11/64

    1

    '2ploratory data analysis

    #he first step in any data analysis is to look at the data$

    %I%&: %arba!e in ' %arba!e out

    1. ;utliers +e.(. bo2plot,

    !. Po0o(eneity of variance +e.(. conditional bo2plot,

    :. Hor0ality +e.(. &&-plot,

    =. Couble zeros +e.(. fre5uency plot,

    >. )ollinearity +e.(. pair3ise scatterplots,

    ?. Relationship of e2planatory and response variable +e.(. scatterplots,

    9. Spatial- or te0poral autocorrelation +e.(. vario(ra0s,

    teps in exploratory data analysis ' "heck for:

    A 0ust read is*uurD A.@ /enoD '.H 'lphic8D ).S +!"1",* A protocol for data e2ploration toavoid co00on statistical proble0s. Methods in 'colo(y and 'volution 1* :1=.http*##333.biolo(ia.ufr.br#labs#labpoly#uur!"1".pdf

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://www.biologia.ufrj.br/labs/labpoly/Zuur2010.pdf
  • 7/24/2019 Multivariate All

    12/64

    !

    '2ploratory data analysis

    )o00on plots for loo8in( at the data

    ;utliersG Asy00etry of distributionGHor0alityG

    inearityG)ollinearityG

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/
  • 7/24/2019 Multivariate All

    13/64

    :

    Statistical 0odellin(

    1. Ai0* @it a 0odel to the data that can be used for inference+point and interval esti0ationD classificationD hypothesis testin(

    etc., '2a0ple* 4he arith0etic 0ean is an esti0ate of the true

    population 0ean *and s!is an esti0ate of the true variance X!

    !. All 0odels rely on assu0ptions

    e.(. nor0al distributionD independence of observations

    :. Most 0odels incorporate a deter0inistic +fi2ed effect, and arando0 co0ponent +rando0 effect,

    '2a0ple* yQ + -x .

    =. /n 0ost situations several 0odels 3ill fit the data so that(oodness of fit 0easures are re5uired

    e.(. A/)D r!D RMS'

    x

  • 7/24/2019 Multivariate All

    14/64

    =

    )o0parison of central tendency* t-test

    Po3 li8ely is it that our sa0ple 0eans are dra3n fro0populations 3ith the sa0e *G

    P"* Y1Q Y! P1* Y1Z Y!

    t-test assu0ptions*

    Hor0al distribution +(raphical inspection or tests,

    /ndependent sa0ples

    Po0o(eneity of variances +(raphical inspection or tests,

    @-test can be e0ployed for chec8in( the assu0ption of e5ualvariances in the t-test*

    Fn1,n2 = 12

  • 7/24/2019 Multivariate All

    15/64

    >

    )o0parison of several 0eans*analysis of variance

    P"* Y1Q Y!Q Y:

    4his is tested by calculation of Su0 of s5uaresE

    Po3 li8ely is it that all sa0ple 0eans are dra3n fro0populations 3ith the sa0e *G

    +@or : (roups,

  • 7/24/2019 Multivariate All

    16/64

    ?

    Analysis of variance

    SS[ Q SS4 I SS'

    otalariation

    '2plainedvariation

    Une2plainedvariation

    Fk1,nk = MST

    MSE

  • 7/24/2019 Multivariate All

    17/64

    9

    Analysis of variance

    Anova assu0ptions*

    Hor0al distribution +(raphical dia(nostics,

    /ndependent sa0ples Po0o(eneity of variances +(raphical dia(nostics,

    'specially violations of the ho0o(eneity of varianceassu0ption 0ay lead to an underesti0ation of the real p-value+i.e. real p-value can be several ti0es hi(her than the no0inalp-value of ".">,

    Alternatives* Krus8al-Fallis test or per0utational anova for non-nor0al

    distributed data +but e5ual varianceE,

    Felch\s anova for data 3ith hetero(eneous variances

    http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/http://users/ralfs/Arbeit/Lehre/2011/Multivariate/
  • 7/24/2019 Multivariate All

    18/64

    . )ollinearity +e.(. pair3ise scatterplots,

    ?. Relationship of e2planatory and response variable +e.(. scatterplots,

    9. Spatial- or te0poral autocorrelation +e.(. vario(ra0s,

    teps in exploratory data analysis ' "heck for:

    A 0ust read is*uurD A.@ /enoD '.H 'lphic8D ).S +!"1",* A protocol for data e2ploration toavoid co00on statistical proble0s. Methods in 'colo(y and 'volution 1* :1=.http*##333.biolo(ia.ufr.br#labs#labpoly#uur!"1".pdf

  • 7/24/2019 Multivariate All

    44/64!

    !

    '2ploratory data analysis

    )o00on plots for loo8in( at the data

    ;utliersG Asy00etry of distributionGHor0alityG

    inearityG)ollinearityG

    here are several rules of thu0b as to 3hat can be re(arded as an outlier but it re0ains 0ore or less asubective decision. $ohn 4u8ey su((ested to define [ as an outlier if* [ +&1 1.> /&R, or [ T +&: I 1.>/&R,D 3here &1 denotes the lo3er 5uartileD &: denotes the upper 5uartileD and /&R Q +&: &1, denotesthe inter5uartile ran(e. Mc ari(al +!""", reco00end to classify observations 3ith T !.> fold standarddeviation fro0 the 0ean as outliers. /n practiceD the type of dataD nu0ber of sa0ples and 8no3led(e aboutthe data should be ta8en into account 3hen decidin( 3hether a sa0ple point is classified as outlier.

    eanplots represent an alternative to bo2plots and have been introduced by %eter Ka0pstra* Ka0pstra %.!""

  • 7/24/2019 Multivariate All

    45/64:

    :

    Statistical 0odellin(

    1. Ai0* @it a 0odel to the data that can be used for inference+point and interval esti0ationD classificationD hypothesis testin(

    etc., '2a0ple* 4he arith0etic 0ean is an esti0ate of the true

    population 0ean *and s!is an esti0ate of the true variance X!

    !.All 0odels rely on assu0ptions

    e.(. nor0al distributionD independence of observations

    :. Most 0odels incorporate a deter0inistic +fi2ed effect, and arando0 co0ponent +rando0 effect,

    '2a0ple* yQ + -x .

    =. /n 0ost situations several 0odels 3ill fit the data so that(oodness of fit 0easures are re5uired

    e.(. A/)D r!D RMS'

    x

    n statistical ter0s any observation contains si(nal and noise. 4his relates to the fitted value and the residualsin the fitted 0odel.

  • 7/24/2019 Multivariate All

    46/64=

    =

    )o0parison of central tendency* t-test

    Po3 li8ely is it that our sa0ple 0eans are dra3n fro0populations 3ith the sa0e *G

    P"* Y1Q Y! P1* Y1Z Y!

    t-test assu0ptions*

    Hor0al distribution +(raphical inspection or tests,

    /ndependent sa0ples

    Po0o(eneity of variances +(raphical inspection or tests,

    @-test can be e0ployed for chec8in( the assu0ption of e5ualvariances in the t-test*

    Fn1,n2 = 12

    he t-test should be used 3hen the standard deviation 3as esti0ated fro0 the data. /f the standard deviationof the population is 8no3nD the -test can be used.

    hec8in( the nor0al distribution assu0ption 3ill be discussed later +3hen discussin( anova and 0ultivariatenor0al distribution,. 4he t3o sa0ple t-test is relatively robust a(ainst violation of the assu0ption of nor0aldistribution as lon( as the variances are e5ual. /n case of stron( violations of this assu0ptionD a non-para0etric test should be selected such as the Filco2on ran8 su0 test. 4he Felch\s t-test should beselected 3hen the variances are not e5ual.

    one sa0ple t-test tests the null hypothesis that YQ" and is co00only e0ployed in para0eter esti0ates inlinear 0odels since a para0eter value of " 3ould 0ean that the variable can be re0oved fro0 the 0odel.

  • 7/24/2019 Multivariate All

    47/64>

    >

    )o0parison of several 0eans*analysis of variance

    P"* Y1Q Y!Q Y:

    4his is tested by calculation of Su0 of s5uaresE

    Po3 li8ely is it that all sa0ple 0eans are dra3n fro0populations 3ith the sa0e *G

    +@or : (roups,

    he anova is used for cate(orical response variables. /n the case of continuous response variables the linearre(ression is used that 3ill be discussed later.

  • 7/24/2019 Multivariate All

    48/64?

    ?

    Analysis of variance

    SS[ Q SS4 I SS'

    4otalvariation

    '2plainedvariation

    Une2plainedvariation

    Fk1, nk = MST

    MSE

    n te2tboo8sD different abbreviations are used for the total su0 of s5uaresD treat0ent su0 of s5uares anderror su0 of s5uares. Pere 3e use SS[D SS4 and SS'D respectively +partly follo3in( )ra3ley !"">,. /ncase of 0ore than 1 factor in the 0odel SS4 is bro8en into several SSAD SSB etc. that denote the su0 ofs5uares attributable to differences bet3een the 0eans of the first factorD second factor etc. the treat0entsu0 of s5uares can also be deno0inated as bet3een-(roup-variation and the error su0 of s5uares canbe deno0inated as 3ithin-(roup-variation. Fe 3ill encounter this deno0ination a(ain 3hen 3e discussdiscri0inant analysis.

    he overall ai0 is to 0ini0ize SS' or better 0a2i0ise the ratio of the 0ean SS4# 0ean SS' +here MS4

    and MS', in the 0odel. MS4 is calculated by dividin( SS4 by the de(rees of freedo0D 3hich is 8 +nu0berof treat0ents, -1D and MS' is calculated by dividin( SS' by the de(rees of freedo0D 3hich is n +totalsa0ple size, - 8.

  • 7/24/2019 Multivariate All

    49/649

    9

    Analysis of variance

    Anova assu0ptions*

    Hor0al distribution +(raphical dia(nostics,

    /ndependent sa0ples Po0o(eneity of variances +(raphical dia(nostics,

    'specially violations of the ho0o(eneity of varianceassu0ption 0ay lead to an underesti0ation of the real p-value+i.e. real p-value can be several ti0es hi(her than the no0inalp-value of ".">,

    Alternatives*

    Krus8al-Fallis test or per0utational anova for non-nor0aldistributed data +but e5ual varianceE,

    Felch\s anova for data 3ith hetero(eneous variances

    raphical dia(nostics should be preferred to hypothesis testin( for the assu0ptions of nor0al distribution andho0o(eneity of variance or used co0ple0entary. &uinn and Keou(h +!""!, p.1:ff discuss this in detail.4he 0odel dia(nostics can be conducted usin( the residuals of the linear 0odel and it is not necessary totest for the assu0ptions before the anova +or linear re(ression,.

    he anova is 0ore robust to violations of the first than of the third assu0ption outlined above +lassD%ec8ha0 6 Sanders 19! Re1iew of 2ducational Research=!* !:9-!+:,* e9

  • 7/24/2019 Multivariate All

    50/64, p.?1-?:for dia(nostic tools to spot serial correlation.

    n case 3e 8no3 the 0easure0ent error in 2 and it accounts si(nificantly for the une2plained variationD a0odel // re(ression also called reduced 0aor a2is re(ression can be conducted.

    Fe 3ill briefly discuss (eneralised linear 0odels later in this course. ]ariable transfor0ation and robustre(ression are discussed in 0any te2tboo8s +e.(. Maindonald 6 Braun !"1"D &uinn 6 Keou(h !""!, andare beyond the scope of this course +but variable transfor0ation has been e2tensively discussed in theprecedin( course of univariate statistics,.

  • 7/24/2019 Multivariate All

    54/64!!

    !

    Model dia(nostics* Po0o(eneity of variance

    nor0alVtron( increaseV

    non-linearVsli(ht increaseV

    Residuals vs. fitted values plots

    raphical dia(nostics are the sa0e for linear re(ression and anova +and other linear 0odels,. 4he test fornor0ality is conducted usin( 5-5 plots.

    he displayed residuals-fitted values plots can be used to chec8 3hether the assu0ption of ho0oscedasticity+and linearity in the case of re(ression, holds. /f the residuals are not rando0ly distributed +upper ri(ht, butdisplay patternsD this 0ay indicate heteroscedasticity +botto0 and top left, or non-linearity +botto0 ri(ht,.@or anova the 2 a2is displays the factor levels and the plots do not describe a continuous pattern as on theplots sho3n. @ara3ay +!"">,D p.>? advocates transfor0ation of variables or 3ei(htin( in the case ofre(ression to alleviate such departures fro0 the assu0ption of constant variance. /n additionD non-linearre(ression can represent a 0ore suitable alternative for continuous data.

    ;ther dia(nostics include chec8in( for influential points and outliers but these are discussed in the section onre(ression 0odels.

  • 7/24/2019 Multivariate All

    55/64!:

    :

    @urther 0odel dia(nostics

    evera(e points +predictor outlier,

    Po3 to deal 3ith levera(e pointsG

    )hec8 3hether values are valid

    Should another 0odel be fitted or should data be transfor0edG

    Are 0odel results robustG

    evera(e pointthat e2erts hi(h

    influence

    Hon-influentiallevera(e point

    eside chec8in( for assu0ptionsD 0odel dia(nostics are used to detect influential pointsD levera(e points andoutliers. /nfluential points e2ercise hi(h influence on the 0odel fitD but 0ay not be outliers. ;utliers do notfit the 0odelD but are not necessarily influential.

    evera(e points +1, e2ercise hi(h influence on the fitted y +but not necessarily on the 0odel fit, and +!, aredistant fro0 the other 2-values. 4he levera(e is calculated in ter0s of so-called hat values and theavera(e hat value is +8I1,#nD 3here 8 is the nu0ber of para0eters in the 0odel e2cludin( the constantand n is the nu0ber of observations. @ara3ay +!"">, and Sheater +!"", su((est to loo8 at points 3ith h-values T !J avera(e 0ore closely. Po3everD the hat values are independent of the y values so that only(raphical inspection can tell 3hether a hi(h levera(e point is really proble0atic. A nice illustration of

    levera(e can be found here* http*##333.rob-0cculloch.or(#teachin(Applets#evera(e#inde2.ht0l.

    ;utliers can be identified 3ith studentized residuals. PereD points that are 0ore than ! standard deviationsa3ay fro0 the re(ression line 0ay be considered as outlier +see Sheater !""* p.?",. 4here are of coursedifferent rules of thu0b as to 3hat can be re(arded as an outlier but it re0ains 0ore or less asubective decision. $ohn 4u8ey su((ested to define [ as an outlier if* [ +&1 1.> /&R, or [ T +&: I1.> /&R,D 3here &1 denotes the lo3er 5uartileD &: denotes the upper 5uartileD and /&R Q +&: &1,denotes the inter5uartile ran(e. PenceD you 3ould use a bo2plot to identify an outlier. Mc ari(al +!""",reco00end to classify observations 3ith T !.> fold standard deviation fro0 the 0ean as outliers.

    nother i0portant 0easure in dia(nostics plots represents )oo8s distance. )oo8s distance 0easures the

    influence of observations on the 0odel fit by calculatin( the co0bined effect of levera(e and of the0a(nitude of the residual. 4he hi(her )oo8s distance the lar(er the chan(e in 0odel fit 3hen the point isre0oved fro0 the 0odel. A point 3ith a hi(h )oo8s distance tends to be either outlier or a levera(e pointor both. 4here a different rules of thu0b 3hen to consider a point as influential +e.(. )oo8s C T 1 or )oo8sC T =#n-!,D but in practice it is i0portant to loo8 for (aps in the values of )oo8s distance +Sheater !""*p.?

  • 7/24/2019 Multivariate All

    56/64!=

    =

    @lo3chart for si0ple linear re(ression

    4a8en fro0 Sheather !""* p.1":

  • 7/24/2019 Multivariate All

    57/64!>

    >

    Multiple linear re(ression 0odel

    Relationship bet3een several e2planatory variables and aresponse variable 3ith*

    y Q ^ I _121I _!2!I I _n2nI `

    Ai0 is to 0ini0ise `D 3hich is SS' in the linear re(ression0odel*

    SSE= y1x

    1

    2x

    2...nx n

    2

    21

    2!

    y@or n Q !

    ote that as a rule of thu0b there should be at least 1" observations per variable included in the 0ultiplere(ression.

  • 7/24/2019 Multivariate All

    58/64!?

    ?

    Multiple re(ression 0odel* Cia(nostics

    1. )hec8 all 0odel assu0ptions of the si0ple linear re(ression0odel +nor0ality of residualsD independence of residualsDho0o(eneity of residual varianceD linearity,

    !. )hec8 for levera(e pointsD outliers and influential points

    :. )hec8 for 0ulticollinearity*

    Stron( correlation bet3een e2planatory variables +(raphicalinspection or correlation 0atri2,

    )an lead to 3ron( esti0ates of the re(ression coefficient and

    non-si(nificant ter0s in the 0odelD 3hile the overall @-testindicates a hi(hly si(nificant 0odel

    )alculate variance inflation factors +]/@,*

    VIF= 1

    1R j2

    Ris the e2plained variance for the linear 0odel 3here 2

    is e2plained by all other 2 in the 0odel

    s 0entioned before +slide !1, the independence of residuals 0ay not hold true for ti0e series or spatialdataD but this data is beyond the scope of this course.

    ;ne possibility to cope 3ith non-linearity or non-nor0ality is to transfor0 the response variable +see Sheather!""* 1?9 ff,. 4he bo2-co2 transfor0ation is a 3idely used techni5ue that can transfor0 0ost variables tonor0al distribution. /n 0ore co0plicated casesD both the response and e2planatory variables should betransfor0ed.

    4ransfor0ation of the e2planatory variables is (enerally necessary if they are hi(hly s8e3ed. 'speciallyche0ical data often e2hibit a s8e3ed distribution +due to detection li0its, that should be transfor0edbefore the 0ultiple re(ression analysis.

    e(ardin( the ]/@ there are different rules of thu0b as to 3hen one should 3orry about collinearity. Mostte2tboo8s su((est that for ]/@ values T= +@o2 !""

  • 7/24/2019 Multivariate All

    59/64!9

    9

    Multiple re(ression 0odel* Cia(nostics //

    ealin! *ith +ulticollinearity

    '2planatory variables in linear 0odels should be selected

    based on scientific 8no3led(e

    Scatterplots and ]/@s can aid in identifyin( variables 3ith hi(h0ulticollinearityD but can not su((est 3hat to do

    Strate(ies to deal 3ith 0ulticollinearity* ;0ission of variablesfro0 the 0odel or other types of re(ression +e.(. Rid(e

    re(ressionD principal co0ponent re(ression,

    /f you o0it variablesD chec8 3hich variable is 0ore i0portantbased on current scientific understandin(. Co not auto0aticallyre0ove the variable 3ith the hi(hest ]/@E

  • 7/24/2019 Multivariate All

    60/64!, Modelselection and inference* facts and fiction. 2conometric "heory!1D !1> accessible in our university at*http*##333.stor.or(#stable#:>::?!:

    ne 3ay of dealin( 3ith this issue in statistical 0odellin( is cross-validation* 4he data is divided rando0lyinto 0subsets and then each of the 0subsets is predicted usin( the re0ainin( data. 4hen the predictiveaccuracy can be assessed based on the predictions of the o0itted values.

    ecent develop0ents include the lasso approach +least absolute shrin8a(e and selection operator, that

    Veffectively perfor0s variable selection and re(ression coefficient esti0ation si0ultaneously.W +Sheather!""*!>",.

  • 7/24/2019 Multivariate All

    63/64:1

    1

    Analysis of fitted 0odels

    elative i+portance of variables

    Cifferent 0easures have been su((estedD either focusin( onthe re(ression coefficientsD e2plained variance or both

    Standardized betas are scaled re(ression coefficients*

    3idely used AHC criticised as they do not account for the directeffect of a variable +for e2a0ple* a hi(h direct effect of a

    predictor 0ay be assi(ned to correlatin( predictors and resultin a lo3 beta,

    Pierarchical partitionin( +)hevan 6 Sutherland 11, and%M]C +@eld0an !"">, are reco00ended 0ethods

    k , standardized = kskksyy

    tandardized betas are contested as they are not related to the partitionin( of r!+unless the e2planatoryvariables are non-correlated, and tend to ne(lect the direct effect of a variable in the 0odel. HeverthelessDthey (ive infor0ation on influence of a predictor on the response variable.

    or a deeper discussion of this issue refer to* $ohnson $FD ebreton $M +!""=,. VPistory and Use of Relative/0portance /ndices in ;r(anizational Research.W 6rganizational Research MethodsD 9D !:9 or 3ith aspecial e0phasis on R* Grmping, U. (2006) Relative importance for linear regression in R: The packagerelaimpo. Journal of Statistical Software,17.A 0ore e2tensive literature list on this topic is provided on the3ebsite of Ulri8e r70pin(* http*##prof.beuth-hochschule.de#(roe0pin(#relai0po#

  • 7/24/2019 Multivariate All

    64/64

    !

    Brief tutorial for 0ultiple re(ression

    1. 4ransfor0 variables if necessary +chec8 ran(eD distribution etc.,

    !. )hec8 e2planatory variables for 0ulticollinearity* ;0it variables

    or adust re(ression 0ethod

    :. Ran8 variables accordin( to their scientific relevance and build0odel 3ith 0ost i0portant variables

    =. Search for best-fit 0odel usin( different strate(ies

    >. )hec8 best-fit 0odel 3ith 0odel dia(nostics

    ?. ]alidate 0odel usin( cross-validation or a validation sa0ple

    9. Ceter0ine relevance of individual e2planatory variables

    or further details on 3hich aspects to consider see Maindonald +!"1",*1< ff. )ross-validation is e2plainedon the pa(es 1>: ff and you can find infor0ation on the i0ple0entation of different 0ultiple re(ressionrelated techni5ues here* http*##333.stat0ethods.net#stats#re(ression.ht0l.