Multiple Correspondence Analysisfactominer.free.fr/course/doc/MCA_course_slides.pdf · Data -...

40
Data - issues Studying the individuals Studying the categories Interpretation aids Multiple Correspondence Analysis François Husson Applied Mathematics Department - AGROCAMPUS OUEST [email protected] 1 / 38

Transcript of Multiple Correspondence Analysisfactominer.free.fr/course/doc/MCA_course_slides.pdf · Data -...

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Multiple Correspondence Analysis

    François Husson

    Applied Mathematics Department - AGROCAMPUS OUEST

    [email protected]

    1 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Plan

    1 Data - issues

    2 Studying the individuals

    3 Studying the categories

    4 Interpretation aids

    2 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    The data

    21

    Categorical variables

    I

    1

    i

    j

    Individuals

    1 J

    vij

    I individualsJ qualitative variables

    vij : category of the j-th variablepossessed by the i-th individual

    Example : survey where I peoplereply to J multiple-choice questions

    3 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    The data

    17

    i’ 0 1 0 0 0

    Kj = 5

    I

    1

    i

    j

    Ind

    ivid

    ua

    ls

    1 J

    i’

    I

    1

    i

    j

    Ind

    ivid

    ua

    ls

    1 J

    yik

    Indicator Matrix

    1 k Kj K Σ

    J

    IJIpk

    I

    Categorical variablesCategories

    vij

    married

    4 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Goals

    1 Studying the individualsOne individual = one row of the CDT = set of categoriesSimilarity of individuals – Inter-individual variabilityPrincipal axes of the inter-individual variability

    (in relation to the categories)2 Studying the variables

    Links between qualitative variables(in relation to the categories)

    Visualization of the set of associations between categoriesSynthetic variables

    (quantitative indicators based on the qualitative variables)

    ⇒ Similar problem to PCA

    5 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Leisure activity data

    • Extract from 2003 INSEE survey on identity construction,called the “history of life” survey

    • 8403 individuals• 2 sorts of variables :

    • Which of the following leisure activities do you practiceregularly : Reading, Listening to music, Cinema, Shows,Exhibitions, Computer, Sport, Walking, Travel, Playing amusical instrument, Collecting, Voluntary work, Homeimprovement, Gardening, Knitting, Cooking, Fishing, Numberof hours of TV per day on average

    • supplementary variables (4 questions) : sex, gender, profession,marital status

    6 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Leisure activity data

    Hobbies Number Sex Female 4616

    Listening music 5947 Male 3787

    Reading 5646 Age [15,25] 857

    Walking 4175 (25,35] 1302

    Cooking 3686 (35,45] 1646

    Mechanic 3539 (45,55] 1837

    Travelling 3363 (55,65] 1257

    Cinema 3359 (65,75] 937

    Gardening 3356 (75,85] 482

    Computer 3158 (85,100] 85

    Sport 3095 Marital Divorcee 792

    Exhibition 2595 status Married 4333

    Show 2425 Remarried 404

    Playing music 1460 Single 2140

    Knitting 1413 Widower 734

    Volunteering 1285 Profession

    foreman 735Fishing 945

    management 1052Collecting 862

    employee 2552

    Number of hours watching TV 0 1017

    unskilled worker 792

    1 1223

    manual labourer 1161

    2 2156

    technician 401

    3 1775 other 212

    4 2232 No answer 1498

    Sociodemographic

    variablesHobbies

    7 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Leisure activity data

    Respondent

    1

    1

    8403

    18 1 4

    Hobbies Sociodemographic

    = yes

    or no

    MCA 1 : active = leisure activity, then use supplementary data forinterpretation• 1 individual = vector of leisure activities• Principal axes of variability of leisure vectors• Links between these axes and the supplementary variables

    MCA 2 : active = supplementary variables, leisure activities assupplementary informationMCA 3 : active = BOTH 8 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Transforming the complete disjunctive table

    An individual’s weight is1I

    yik = 1 if the i-th individual is in k-th category of the j-th variable(for each pk)

    = 0 otherwise

    Idea : xik = yik/pk

    ∑Ii=1 xikI

    =1I

    ∑Ii=1 yikpk

    =1I

    I × pkpk

    = 1

    Centering : xik = yik/pk − 1

    9 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Plan

    1 Data - issues

    2 Studying the individuals

    3 Studying the categories

    4 Interpretation aids

    10 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Point cloud of individuals

    category k

    IN

    )',( iid

    kix '

    KRI

    iM

    OGI =

    'iM (weight = 1/I)

    (weight = )Jpk

    ikx

    1

    i

    I

    k1 K

    Ind

    ivid

    ua

    lsCategories

    Indicator matrix

    ikx

    d2i,i ′ =K∑

    k=1

    pkJ

    (xik − xi ′k)2 =K∑

    k=1

    pkJ

    (yikpk− yi

    ′k

    pk

    )2=

    1J

    K∑k=1

    1pk

    (yik−yi ′k)2

    • 2 individuals with same categories : distance = 0• 2 individuals with many shared categories : small distance• 2 individuals, only 1 with a rare category : large distance to indicate this• 2 individuals share rare category : small distance to indicate this shared

    specificity

    d(i ,GI )2 =

    K∑k=1

    pkJ(xik)

    2 =K∑

    k=1

    pkJ

    (yikpk− 1)2

    =1J

    K∑k=1

    yijpk− 1

    Inertia(NI ) =I∑

    i=1

    1Id2(i ,O)︸ ︷︷ ︸

    inertia of i

    =I∑

    i=1

    (1IJ

    K∑k=1

    yikpk− 1

    I

    )=

    K

    J− 1

    11 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Point cloud of individuals

    category k

    IN

    )',( iid

    kix '

    KRI

    iM

    OGI =

    'iM (weight = 1/I)

    (weight = )Jpk

    ikx

    1

    i

    I

    k1 K

    Ind

    ivid

    ua

    lsCategories

    Indicator matrix

    ikx

    d2i,i ′ =K∑

    k=1

    pkJ

    (xik − xi ′k)2 =K∑

    k=1

    pkJ

    (yikpk− yi

    ′k

    pk

    )2=

    1J

    K∑k=1

    1pk

    (yik−yi ′k)2

    d(i ,GI )2 =

    K∑k=1

    pkJ(xik)

    2 =K∑

    k=1

    pkJ

    (yikpk− 1)2

    =1J

    K∑k=1

    yijpk− 1

    Inertia(NI ) =I∑

    i=1

    1Id2(i ,O)︸ ︷︷ ︸

    inertia of i

    =I∑

    i=1

    (1IJ

    K∑k=1

    yikpk− 1

    I

    )=

    K

    J− 1

    11 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Building the point cloud of individuals

    Getting factor axes, as usual, like for all factor analysis methods

    Sequential construction : look for the axis maximizing the inertiaand orthogonal to previous axes

    12 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Leisure activity data

    • Extract from 2003 INSEE survey on identity construction,called the “history of life” survey

    • 8403 individuals• 2 sorts of variables :

    • Which of the following leisure activities do you practiceregularly : Reading, Listening to music, Cinema, Shows,Exhibitions, Computer, Sport, Walking, Travel, Playing amusical instrument, Collecting, Voluntary work, Homeimprovement, Gardening, Knitting, Cooking, Fishing, Numberof hours of TV per day on average

    • supplementary variables (4 questions) : sex, gender, profession,marital status

    13 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Diagram showing the inertia

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

    Eigenvalues

    Dimension

    Eig

    enva

    lues

    0.00

    0.05

    0.10

    0.15

    14 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Representation of the point cloud of individuals

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    0.5

    1.0

    Dim 1 (16.95%)

    Dim

    2 (6

    .91%

    )

    15 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Representation of the point cloud of individuals

    What kind of pattern might we see ?

    The Guttman effect

    16 / 38

  • Data - issues Studying the individuals Studying the categories Interpretation aids

    Individuals shown in terms of the gardening variable

    Idea : use the catego-ries and variables tointerpret the plot of theindividuals

    Put a category atthe barycenter of theindividuals in it

    −1.0 −0.5 0.0 0.5 1.0

    −1.

    0−

    0.5

    0.0

    0.5

    1.0

    Dim 1 (16.95%)

    Dim

    2 (

    6.91

    %)

    Gardening_0Gardening_1

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ● ●

    ●●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ● ●

    ● ●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●