Chapter13 Slides

download Chapter13 Slides

of 24

Transcript of Chapter13 Slides

  • 7/31/2019 Chapter13 Slides

    1/24

    Welcome to Powerpoint slidesfor

    Chapter 13

    Cluster Analysis

    for

    Market Segmentation

    Marketing Research

    Text and Cases

    by

    Rajendra Nargundkar

  • 7/31/2019 Chapter13 Slides

    2/24

    1. A cluster, by definition, is a group of similar objects

    2. There could be clusters of people, brands or other

    objects

    3. If clusters are formed of customers similar to one

    another, then cluster analysis can help marketers

    identify segments (clusters)

    4. If clusters of brands are formed, this can be used to

    gain insights into brands that are perceived as similar to

    each other on a set of attributes

    5. This chapter explains the use of cluster analysis for

    customer segmentation

    6. Cluster analysis is best performed when the variables

    are interval or ratio-scaled

    Slide 1

  • 7/31/2019 Chapter13 Slides

    3/24

    1. There are two major classes of cluster analysis

    techniques- hierarchical and non-hierarchical

    2. In hierarchical clustering, some measure of

    distance is used to identify distances between all pairs

    of objects to be clustered. One of the popular distance

    measures used is Euclidean Distance. Another is theSquared Euclidean Distance

    3. We begin with all objects in separate clusters. Say,

    we have ten objects in separate clusters. Two closest

    objects are joined to form a cluster. The remaining 8

    objects would remain separate. This is stage 1 of

    hierarchical clustering.

    Slide 2

  • 7/31/2019 Chapter13 Slides

    4/24

    4. In stage 2, again the two closest objects form another

    cluster. Now, we have two clusters, and 6 unclustered

    objects. This means a total of eight clusters, two with

    two objects each, and six with one object each.

    5. This process continues, until points join existing

    clusters (because they are closest to an existing cluster),

    and clusters join other clusters, based on the shortestdistance criterion

    6. In this way, a range of possible solutions is formed,

    from a 10-cluster solution in the beginning, to a single

    cluster solution at the end.

    7. We have to decide how many clusters the data seems

    to have, depending on either the agglomeration

    schedule, or the dendrogram to help make the

    decision. Both of these are computer outputs that

    describe in numbers or visually, the sequence of cluster

    formation. This decision is somewhat subjective, but

    there are some guidelines one can follow, as illustrated

    in the worked example

    Slide 2 contd...

  • 7/31/2019 Chapter13 Slides

    5/24

    Slide 3

    1. In non-hierarchical clustering methods (also

    known as k-means clustering methods), we need to

    specify the number of clusters we want the objects to

    be clustered into.

    2. This can be done if we have a hypothesis that the

    objects will group into a certain number of clusters.

    Alternatively, we can first do a hierarchical clustering

    on the data, find the approximate number of clusters,

    and then perform a k-means clustering

    3. In our illustration, we have used both hierarchical

    and non-hierarchical methods in combination with one

    another

    4. Let us move on to our worked example

  • 7/31/2019 Chapter13 Slides

    6/24

    Worked Out Example

    Problem: A major Indian FMCG company wants to map

    the profile of its target market in terms of lifestyle,attitudes and perceptions. The company's managers

    prepare, with the help of their marketing research team, a

    set of 15 statements, which they feel measure many of the

    variables of interest. These 15 statements are given below.

    The respondent had to agree or disagree (1 = Strongly

    Agree, 2 = Agree, 3 = Neither Agree nor Disagree, 4 =Disagree, 5 = Strongly Disagree) with each statement.

    1. I prefer to use e-mail rather than write a letter.2. I feel that quality products are always priced high.3. I think twice before I buy anything.

    4. Television is a major source of entertainment.5. A car is a necessity rather than a luxury.6. I prefer fast food and ready to use products.7. People are more health conscious today.8. Entry of foreign companies has increased the efficiencyof Indian companies.

    9. Women are active participants in purchase decisions.10. I believe politicians can play a positive role.11. I enjoy watching movies.12. If I get a chance, I would like to settle abroad.13. I always buy branded products.14. I frequently go out on weekends.15. I prefer to pay by credit card rather than in cash.

    Slide 4

  • 7/31/2019 Chapter13 Slides

    7/24

    Slide 5

    For the purpose of this illustration, we will assume

    that 20 respondents answered the questionnaire above

    (In a real life situation, the sample size would be

    higher). The input data matrix of 20 respondents x 15variables is shown in fig 1.

    Fig. 1

    var000

    01

    var0000

    2

    var0000

    3

    var0000

    4

    var0000

    5

    var0000

    6

    var0000

    7

    var00008

    1.

    1.00 3.00 5.00 4.00 3.00 5.00 3.00 2.00. 2.00 3.00 2.00 3.00 4.00 4.00 3.00 2.00. 3.00 2.00 3.00 4.00 3.00 5.00 3.00 3.00. 3.00 2.00 4.00 2.00 2.00 4.00 3.00 4.00. 2.00 2.00 4.00 2.00 2.00 5.00 3.00 3.00. 2.00 4.00 3.00 3.00 5.00 4.00 4.00 2.00. 1.00 1.00 2.00 4.00 4.00 1.00 2.00 4.00. 4.00 5.00 1.00 4.00 5.00 4.00 5.00 1.00. 2.00 1.00 5.00 3.00 4.00 4.00 2.00 1.000. 5.00 2.00 4.00 3.00 2.00 5.00 1.00 5.001. 4.00 3.00 3.00 2.00 1.00 2.00 1.00 5.002. 3.00 4.00 4.00 4.00 3.00 2.00 5.00 1.003. 4.00 3.00 2.00 2.00 3.00 3.00 4.00 2.004. 1.00 2.00 2.00 4.00 2.00 5.00 1.00 3.00

    5. 2.00 3.00 4.00 1.00 5.00 4.00 2.00 4.006. 3.00 2.00 1.00 3.00 4.00 3.00 2.00 3.007. 5.00 1.00 1.00 5.00 1.00 2.00 4.00 2.008. 3.00 5.00 5.00 3.00 5.00 5.00 5.00 1.009. 3.00 2.00 4.00 2.00 4.00 4.00 1.00 4.00

    0. 1.00 3.00 3.00 2.00 2.00 5.00 2.00 5.00

  • 7/31/2019 Chapter13 Slides

    8/24

    var000

    09

    var0001

    0

    var00011 var00012 var00013 var00014 var00015

    1. 3.00 2.00 4.00 1.00 1.00 1.00 5.00. 2.00 2.00 4.00 2.00 2.00 2.00 4.00. 4.00 2.00 4.00 3.00 4.00 4.00 3.00. 5.00 4.00 5.00 4.00 5.00 5.00 5.00.

    4.00 4.00 5.00 5.00 3.00 3.00 4.00. 3.00 4.00 5.00 4.00 3.00 3.00 3.00. 2.00 5.00 4.00 3.00 3.00 3.00 1.00

    8. 1.00 5.00 3.00 3.00 5.00 5.00 2.00. 2.00 1.00 2.00 2.00 4.00 4.00 3.00

    10. 3.00 2.00 5.00 1.00 2.00 2.00 1.0011. 2.00 2.00 4.00 5.00 1.00 1.00 2.0012. 5.00 3.00 2.00 4.00 4.00 4.00 3.0013. 2.00 3.00 4.00 3.00 5.00 5.00 4.0014. 5.00 4.00 3.00 2.00 2.00 2.00 5.0015. 4.00 5.00 2.00 1.00 1.00 1.00 4.0016. 2.00 5.00 1.00 2.00 5.00 5.00 3.0017. 2.00 4.00 4.00 3.00 3.00 3.00 2.0018. 2.00 3.00 4.00 4.00 2.00 2.00 1.0019. 1.00 3.00 4.00 5.00 3.00 3.00 2.000. 1.00 3.00 4.00 4.00 3.00 3.00 3.00

    Slide 5 contd...

    Fig 1 contd...

  • 7/31/2019 Chapter13 Slides

    9/24

    Slide 6

    The computer output is obtained by first doing a

    hierarchical cluster analysis to find the number of

    clusters that exist in the data. These outputs are in

    figs 2 to 4 (Agglomeration schedule, vertical Icicle

    Plot and Dendrogram using Average Linkage,

    respectively).

    The second stage is a K-means (quick cluster)

    output with a pre-determined number of clusters to

    be specified. In this case, the output is for 4 clusters.We will look at both stage 1 and stage 2 outputs to

    understand the interpretation of both stages.

  • 7/31/2019 Chapter13 Slides

    10/24

    Slide 7

    Fig. 2 : Agglomeration Schedule

    ClustersCombined

    Stage Cluster 1st

    Appears Next

    Sta

    ge

    Clus

    ter1

    Clus

    ter2

    Coefficient Clust

    er1

    Cluste

    r2

    Stage

    1 4 5 14.000000 0 0 5

    2 19 20 16.000000 0 0 73 6 18 17.000000 0 0 9

    4 1 2 17.000000 0 0 8

    5 3 4 20.000000 0 1 11

    6 13 16 25.000000 0 0 13

    7 11 19 28.000000 0 2 11

    8 1 14 28.500000 0 0 10

    9 6 8 32.500000 0 0 12

    10 1 15 34.666668 0 0 14

    11 3 11 36.444443 0 7 15

    12 6 12 36.666668 0 0 19

    13 7 13 39.500000 0 6 1714 1 9 41.000000 10 0 16

    15 3 10 41.666668 11 0 16

    16 1 3 46.342857 14 15 18

    17 7 17 47.000000 13 0 18

    18 1 7 51.791668 16 17 19

    19 1 6 58.156250 18 12 0

  • 7/31/2019 Chapter13 Slides

    11/24

    Slide 8

    1. A look at fig 2, the agglomeration schedule,

    can help us to identify large differences in thecoefficient (4th column). The agglomeration

    schedule from top to bottom (stage 1 to 19)

    indicates the sequence in which cases get

    combined with others (or one cluster combines

    with another), until all 20 cases are combined

    together in one cluster at the last stage (stage19).

    2. Therefore, stage 19 represents a 1 cluster

    solution, stage 18 represents a 2 cluster solution,

    stage 17 represents a 3 cluster solution, and so

    on, going up from the last row to the first row.

    We have to identify how many clusters are in the

    data. We use the difference between rows in a

    measure called coefficient (also known as fusion

    coefficient) in column 4 to identify the number

    of clusters in the data.

  • 7/31/2019 Chapter13 Slides

    12/24

    3. We will look at this figure from the last row upwards,

    because we would like to have lowest possible number of

    clusters, for reasons of economy and ease of interpretation.

    We see that there is a difference of (58.15 51.79) in thecoefficients between the 1 cluster solution (stage 19) and the 2

    cluster solution (stage 18). This is a difference of 6.36. The

    next difference is of (51.79 47.00) which is equal to 4.79

    (between stage 18, the 2 cluster solution and stage 17, the 3

    cluster solution). The next one after that is (47-46.34), only

    0.66, between stage 17 and stage 16. After this, there is again

    a large difference between the 4 cluster and 5 cluster

    solutions, of (46.3441.660) or 4.68. Thereafter, the

    differences are smaller between subsequent rows of

    coefficients.

    4. A large difference in the coefficient values between any two

    rows indicates a solution pertaining to the number of clusters

    which the lower row represents. Ignoring the first difference

    of 6.36 which would indicate only 1 cluster in the data, we

    look at the next largest differences. 4.79 is the difference

    between row 2 from the bottom and row 3 from the bottom,indicating a 2 cluster solution. But almost the same is the

    difference between stage 16 and 15, indicating a 4 cluster

    solution. At this point, it is the judgement of the researcher,

    which should decide whether to go for a 2 cluster or a 4

    cluster solution. Just for illustration, we will choose the 4cluster solution.

    Slide 8 Contd.

  • 7/31/2019 Chapter13 Slides

    13/24

    Slide 9

    Now, in stage 2, a k-means clustering is run with 4

    cluster solution requested (as identified from the

    hierarchical clustering done above). In the givenproblem, figs 5, 6, 7 and 8 indicate the outputs of K-

    means clustering for a 4 cluster solution. These

    outputs give us the initial cluster centres, the case

    listing of cluster membership (i.e., which case

    belongs to which of the clusters), the final cluster

    centres (the solution) and an ANOVA table.

    Fig. 7 : Final Cluster Centers

    VAR00001 VAR00002 VAR00003 VAR00004

    1 2.2000 2.2000 3.8000 3.2000

    2 3.5000 3.6667 2.6667 3.5000

    3 1.7500 2.0000 2.2500 3.0000

    4 3.0000 2.4000 3.6000 2.2000

    luster VAR00005 VAR00006 VAR00007 VAR000081 3.2000 4.4000 2.8000 2.4000

    2 3.6667 3.3333 4.5000 1.5000

    3 3.7500 3.2500 1.7500 3.5000

    4 2.2000 4.2000 1.6000 4.4000

    Cluster

  • 7/31/2019 Chapter13 Slides

    14/24

    luster VAR00009 VAR00010 VAR00011 VAR00012

    1 3.2000 2.2000 3.8000 2.4000

    2 2.5000 3.6667 3.6667 3.5000

    3 3.2500 4.7500 2.5000 2.0000

    4 2.2000 2.8000 4.4000 4.0000

    Cluster VAR00013 VAR00014 VAR00015

    1 2.4000 3.2000 4.0000

    2 4.1667 3.6667 2.5000

    3 1.2500 2.7500 3.2500

    4 3.0000 2.4000 2.4000

    Slide 9 Contd.

    Fig 7 contd...

  • 7/31/2019 Chapter13 Slides

    15/24

    Slide 10

    1. The final cluster centres (above) describe the mean valueof each variable for each of the 4 clusters. For example,cluster 1 is described by the mean values of variable 1 = 2.2,

    variable 2 = 2.2, variable 3 = 3.8, variable 4 = 3.2 and so on.Similarly, cluster 3 is described by variable 1 = 1.75,variable 2 = 2.0, variable 3 = 2.25, and variable 4 = 3.0, andso on.

    2. We now go back to the original variables (in this case the15 statements in our questinnaire), and interpret the clustersin terms of the 15 variables. For example, cluster 3 consistsof people who are on the e-mail rather than writingconventional letters (variable 1 value = 1.75 which isequivalent to agree on the scale of 1 to 5). Similarly, theyare also people who tend to think twice before buyinganything (variable 3 value 2.25) in other words, careful

    spenders. They also agree (variable 2 value = 2.00) thatquality products are always priced high that is, they have apositive correlation in their minds about a products qualityand price.

    3. On these same variables, cluster 2 shows people who

    prefer conventional mail to e-mail (variable 1 value = 3.5 orclose to disagree), people who do not necessarily associatehigh price with good quality (variable 2 value = 3.67), andtend to be neutral about care in spending (variable 3 value =2.67). In this way, when we compare final cluster centrevalues on each of the 15 variables, for 1 cluster at a time, a

    complete picture of the clusters emerges.

  • 7/31/2019 Chapter13 Slides

    16/24

    Slide 11

    In this case, we will briefly describe each of the 4 clusters

    as follows:

    Cluster 1

    E-mail users, feel quality comes at a price, not carefulspenders, do not like television much, do not think a car is

    a necessity, do not like fast food and ready to use products,

    are not sure whether people are more health-conscious

    today, think foreign companies have increased somewhat

    the efficiency of Indian companies, disagree that women

    are active purchasing decision makers, feel that politicians

    can play an active role, do not enjoy watching movies,

    might consider settling abroad, tend to buy branded

    products, do not go out much on weekends and like to pay

    cash, rather than charging to their credit cards (if they have

    one).

    It is thus a cluster exhibiting many traditional values,

    except that they have adapted to email use. They are also

    beginning to loosen their purse strings, and are probably in

    transition in some other factors like acceptance of women

    as decision makers and the advent of credit cards.

  • 7/31/2019 Chapter13 Slides

    17/24

    Cluster 2

    Regular mail writers, bargain hunters or aggressive buyers,

    not too particular about thinking before spending, not great

    valuers of TV, believe the car is a luxury not too fond of fastfood and convenience products, do not think people are very

    health conscious, feel foreign companies have done us good,

    think women are active purchasing decision makers, do not

    believe in politicians, do not like movies, do not want to

    settle abroad, do not stress on branded products, do not go out

    on weekends, but do prefer credit cards for payments.

    It is a group which likes to use credit, spends more freely,

    believes in woman power, believe in economics rather than

    politics, and feel quality products can be cheap. Also, they

    seem to have a patriotic streak, as they do not want to settleabroad.

    Slide 11 contd...

  • 7/31/2019 Chapter13 Slides

    18/24

    Cluster 3

    E-mail users, quality measured by price, think twice before

    buying, indifferent to TV, car is a luxury to them, not too

    fond of fast food, agree that people are health conscious, do

    not think foreign companies have made us efficient, do not

    believe in woman power, detest politicians, enjoy watchingmovies, willing to settle abroad, always buy branded

    products, go out on weekends, slightly prefer cash to credit

    cards.

    This group is not a free spending one, but health conscious,

    more patriarchical, more brand loyal to branded products,but outgoing compared to other groups, even willing to go

    abroad to settle.

    Slide 12

  • 7/31/2019 Chapter13 Slides

    19/24

    Cluster 4

    Not too particular about e-mail, measure quality by

    price, free spending, enjoy watching TV, think a car is

    necessary, not fond of fast food, think people are health

    conscious, do not think foreign companies have made

    us efficient, believe in woman power, somewhatpositive about politicians, not movie watchers, do not

    want to settle abroad, indifferent to branding,

    moderately outgoing and moderately in favour of credit

    cards rather than cash.

    This group is optimistic, free spending and a good

    target for TV advertising, particularly consumer

    durables and entertainment. But they are not necessarily

    influenced by brands. They may want value for money,

    but if they see value, they may spend a lot.

    In summary, the cluster analysis of this sample of

    respondents tells us a lot about the possible segments

    which exist in the target population.

    Slide 12 contd...

  • 7/31/2019 Chapter13 Slides

    20/24

    Slide 13

    ANOVA:

    Fig. 8 : Analysis of Variance

    Variable Cluster MS DF Error MS DF F Prob

    VAR00001 3.0500 3 1.315 16.0 2.3183 .114

    VAR00002 3.0722 3 1.083 16.0 2.8359 .071

    VAR00003 2.5722 3 1.630 16.0 1.5778 .234

    VAR00004 1.6333 3 .943 16.0 1.7307 .201

    VAR00005 2.5056 3 1.605 16.0 1.5609 .238

    VAR00006 1.7056 3 1.505 16.0 1.1331 .365

    VAR00007 9.6500 3 .390 16.0 24.7040 .000VAR00008 8.5500 3 .681 16.0 12.5505 .000

    VAR00009 1.3000 3 1.865 16.0 .6968 .567

    VAR00010 5.5.56 3 .730 16.0 7.5397 .002

    VAR00011 2.7389 3 1.020 16.0 2.6830 .082

    VAR00012 4.0833 3 1.293 16.0 3.1562 .054

    VAR00013 7.2556 3 .799 16.0 9.0813 .001VAR00014 1.6222 3 1.880 16.0 .8628 .480

    VAR00015 2.8500 3 1.465 16.0 1.9446 .163

  • 7/31/2019 Chapter13 Slides

    21/24

    The ANOVA table (fig. 8) tells us which of the 15

    variables is significantly different across the 4

    clusters. The last column indicates that variables 2, 7,

    8, 10, 11, 12, 13 are significant at the 0.10 level

    (equivalent to 90% confidence level) as they have

    prob. Values less than 0.10. The other variables arenot statistically significant, as they all have prob.

    Values greater then 0.10. But there is divided opinion

    about the utility of statistical testing for cluster

    analysis. Most established writers seen to feel that

    these tests (ANOVA or other tests) are not valid.

    Therefore, it is left to the researchers judgementwhether he would like to use these in determining

    which variables are significant. If the tests were used,

    then the interpretation of clusters and differences

    across clusters should be only on the basis of those

    variables which are (statistically) significantly

    different across clusters at 0.10 or 0.05 or some otherlevel.

    Slide 13 contd...

  • 7/31/2019 Chapter13 Slides

    22/24

    Slide 14Additional Comments on Cluster Analysis

    Objects

    We have looked at an example of classifying people,with interval-scaled data. It is possible to classify objectssuch as brands, products, cities, etc. with clusteranalysis. For example, which brands are clusteredtogether in terms of consumer perceptions for apositioning exercise, or which cities are clusteredtogether in terms of income, education and age profile ofits residents.

    Number of Clusters

    One of the main decisions of a researcher is to decidehow many clusters are present in the data. In certain

    cases, if for example we have a prior hypothesis abouthow many clusters ought to be present, this decisionmay already be made. But otherwise, it tends to be asubjective decision. One of the criteria that can be usedin addition to ones we have described in the chapter isthat every cluster must have a reasonable or minimum

    number of objects. Which means, if a cluster comes outwith only one or two objects in it, look for anothersolution.It may be useful to experiment with two or threepossible solutions before deciding on the number ofclusters.

  • 7/31/2019 Chapter13 Slides

    23/24

    Slide 15

    Variables

    Once the reader is aware of the basics of clusteranalysis, he can begin to use it creatively. For example,

    a cluster analysis can be done on some of the measured

    variables, and then other variables can be checked to

    see if they also exhibit differences across clusters. In

    the worked out example discussed earlier, only

    Psychographics or behavioural variables were used toget the 4 clusters. We could then see if they belonged to

    different places, had different education levels, or

    whether one gender figured predominantly in any one

    of the clusters.

    Scale

    Cluster analysis is ideally suited to interval scaled

    variables, because Euclidean distance is a commonly

    used distance measure used in the clustering process.

    But nominal and ordinal level data can be used afterstandardisation if appropriate. This may also necessitate

    the use of other measures of distance, more appropriate

    with the scales of variables being used. But this should

    be done with care. In general, it is a good idea to

    standardise the variables before clustering, if the units

    of measurement are radically different.

  • 7/31/2019 Chapter13 Slides

    24/24

    Statistical Tests

    As mentioned briefly earlier, some statistical tests

    for cluster analysis are available. But their validity

    being questionable, caution is recommended in

    using either ANOVA or any other tests.

    A general caution about cluster analysis itself is

    that it tends to produce different results with

    different methods and some methods are quite

    vulnerable to errors in data. So, the stability of the

    clusters can be checked through splitting the

    sample and repeating the cluster analysis.

    Slide 15 Contd...