From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities

FROM USER NEEDS TO COMMUNITY HEALTH: MINING USER BEHAVIOUR TO ANALYSE ONLINE COMMUNITIES DR. MATTHEW ROWE SCHOOL OF COMPUTING AND COMMUNICATIONS @MROWEBOT | [email protected]

Invited Talk @ 1st Workshop on Quality, Motivation and Coordination, International Conference on Social Informatics 2013. Kyoto, Japan

About Me

From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities

1

Time

Undergrad Postgrad Postdoc Lecturing

2006-2010: Ph.D. Computer Science

2010-2012: Postdoc Research Associate

2012-now: Lecturer in Social Computing

2002-2006: M.Eng Software Engineering

Research Interests


2

Data

Prediction Machines

Semantics Social networks Digital Identity

Forecasting + Classification Data Mining Disambiguation

Automating Processes Modelling Social Systems Artificial Intelligence

http://scholar.google.com/citations?user=rhyR4_kAAAAJ

Collaborators


3

Harith Alani. Senior Lecturer, Knowledge Media Institute, The Open University, UK. http://people.kmi.open.ac.uk/harith/

Miriam Fernandez. Research Associate, Knowledge Media Institute, The Open University, UK. http://kmi.open.ac.uk/people/member/miriam-fernandez

Conor Hayes. Senior Research Fellow, Digital Enterprise Research Institute, Galway, Ireland. http://www.deri.ie/users/conor-hayes

Marcel Karnstedt. Senior Postdoctoral Researcher, Digital Enterprise Research Institute, Galway, Ireland. http://www.marcel.karnstedt.com/

Outline


4

¨  Part I: Online Communities and User Behaviour ¤  define: online communities, user behaviour!

¤ The potential for examining user behaviour ¨  Part II: Comparing User Behaviour and User Needs

¤ Collecting users’ needs in online communities ¤ Linking needs to behaviour

¨  Part III: Predicting Community Health from User Behaviour ¤ Mining roles from user behaviour ¤ Community health forecasting from collective behaviour

Part I: Online Communities and User Behaviour

5


Defining Online Communities


6

a)  Distinct user containers in which users discuss a given topic

¤ E.g. message board forums ¤ E.g. question-answering systems

b)  Latent grouping of users by some common attribute

¤ E.g. semantic web community ¤ E.g. social network clusters with high social homophily

¨  This talk focuses on: a) User containers

From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities 7

BT (British telecommunications firm) use online communities to enable consumers to provide support to other consumers

BBC News web site provides comments sections to encourage user engagement with the news

Question-answering systems allow communities of ‘knowledgeable’ users to ask questions and provide answers

Why Provide Online Communities?


8

Increase Customer Loyalty

Raising Brand Awareness

Spreading through Word

of Mouth

Facilitating Idea

Generation

Understanding Product Issues

Managing Online Communities


9

¨  Online communities incur significant investments: ¤ Hosting and bandwidth:

n Cost (time + money) grows linearly with popularity ¤ Community management:

n Settling disputes n Encouraging engagement within the communities

¨  Common questions arise: ¤  ‘How do I know if my community is healthy?’ ¤  ‘What changes in the community lead to it becoming

unhealthy’?

How do I know if my community is ‘healthy’?!


10

¨  Approach 1: Needs Satisfaction ¤  Identify users’ needs for the community ¤  Analyse users to see if their needs have been met

¨  Approach 2: Numerical Health Measures ¤ Determine suitable measures for community health (e.g.

churn rate) ¤ Analyse these measures over time to see if the

community is remaining healthy, or not

Analysing User Behaviour


11

¨  Online communities are behavioural ecosystems ¤ Prevalent user behaviour can impact the behaviour of

other users (Preece. 2000)

‘the way’ ‘tangible measures derived from actions performed by and upon a user’


‘tangible measures of actions performed by and upon a user’

¨  Initiation ¤  The extent to which users begin discussions in a community

¨  Contribution ¤  The extent to which the user is providing content

¨  Popularity ¤  Proportion of the community that responds to the user

¨  Engagement ¤  Proportion of the community that a user responds to

¨  Focus Dispersion ¤  Variance of the user’s interests across topics

¨  Quality ¤  Reception of the user’s content by other users

User Post Forum

+1 +1

Behaviour Features

Part II: Comparing User Behaviour and User Needs

13


Maslow’s Hierarchy of Needs


14

How does this hierarchy resonate with online community users?

User Needs in Online Communities


15

¨  Users have different needs for participating in an online community: ¤ To create content and share information ¤ To communicate with other users ¤ To ask questions ¤ To collaborate with other users ¤ To help other users resolve problems and issues ¤ To discuss ideas

¨  We wanted to find out how important the above needs were to community users…


¨  Enterprise social software suite ¤ Communities within enterprises

¨  Anonymised dataset (Jan 2010 -> April 2011) ¤ #Communities of Practice (CoP): 100 ¤ #Team Communities (Team): 72 ¤ #Technical Support (Tech): 14

¨  Labels provided by (Muller et al. 2012)

Dataset 1:

Understanding Users’ Needs on IBM Connections


17

¨  Surveyed 186 users about their needs ¤ Spanning the aforementioned typed communities ¤ 150 responses

¨  Likert scale (1-5) for agreement with statements ¨  Examples included:

¤ How often do you do the following? n Browse for information, Search for information, etc.

¤ Rate how important the community features are to you? n Receiving recommendations, ability to filter information, etc.

Users Needs on IBM Connections


18

D3.1: Report on Social, Technical and Corporate Needs in Online Communities. M Rowe, H Alani, S Angeletou and G Burel. ROBUST Deliverable 3.1. (2012)

Ranked Community Features:

Users Needs on IBM Connections


19

D3.1: Report on Social, Technical and Corporate Needs in Online Communities. M Rowe, H Alani, S Angeletou and G Burel. ROBUST Deliverable 3.1. (2012)

User Behaviour on IBM Connections


20

¨  Measured the behaviour of users across the three IBM Connections community types

feature in our case, as the value limit is iteratively increased.In essence, it allows us to see how a feature is distributedacross its values and whether there is a skew towards thefeature being lower or higher in different community types.Let Ixi�t define an indicator function that returns 1 if thevalue of x is less than or equal to t and 0 otherwise, thenthe ECDF is defined, using an increasing value range for tbetween the features minimum and maximum value, as:

1

n

nX

i=1

Ixi�t (2)

We omit the ECDF plots for the macro features, as they showno clear differences between the distributions. Indeed, wefound the behaviour of the different community types to ap-pear consistent when observing the macro-level attributes ofa community. To assess the quantitive differences betweenthe distributions, we used the Kolmogorov-Smirnov two-sample test to compare, in a pairwise fashion, the inducedECDFs - e.g. comparing the distribution of Seeds betweenCoP, Team and Tech. This test returns the maximum devia-tion between the distributions and the p-value of the diver-gence, thereby allowing us to gauge the significance of thedivergence.Table 1. Mean and Standard Deviation (in parentheses) of macro-features within the different community types

Feature CoP Team TechSeeds 7.094 (15.601) 7.128 (15.622) 6.680 (13.076)Non-seeds 3.298 (9.418) 3.397 (9.594) 3.390 (8.896)Users 4.041 (6.669) 4.024 (6.616) 4.172 (6.767)

The differences between the feature distributions are pre-sented in Figure 2, where we compare the ECDF of eachof the macro features. The bar charts indicate the lack ofdeviance between the distributions. The largest appears tobe where Seeds are concerned, as there is a marked differ-ence between the Tech communities and the two other types- this difference is found to be significant at � = 0.05. Wealso find the difference between Tech communities and theother types to be significant, again at a significance level of� = 0.05, when assessing the Non-seeds distribution. Aswe have demonstrated, the differences between the commu-nities in terms of their macro features are minimal, in partic-ular when considering the empirical cumulative distributionfunctions. In the next section, we extend this analysis tothe behaviour exhibited by community users and how thatdiffers between the types of communities, thereby delvingdeeper into the implicit dynamics of the communities.

Micro Features

We now inspect the differences between community types interms of the micro features. Table 2 contains the mean andstandard deviation for each community type and feature. ForFocus Dispersion we find that CoP has the highest value -significant at � < 0.001 - indicating that users of that typeof community tend to disperse their activity across many dif-ferent communities. Conversely, for Tech communities thisvalue is lowest, where users are focussed on just participat-ing in a selection of communities. For Initiation, Table 2indicates that Team communities have a much higher mean(and standard deviation) than the other community types -

cop team tech

Seeds

Max

imum

Dev

iatio

n (D

)

0.0

0.2

0.4

0.6

0.8

1.0

copteamtech

cop team tech

Non−Seeds

Max

imum

Dev

iatio

n (D

)

0.0

0.2

0.4

0.6

0.8

1.0

copteamtech

cop team tech

Users

Max

imum

Dev

iatio

n (D

)

0.0

0.2

0.4

0.6

0.8

1.0

copteamtech

Figure 2. Maximum Deviation (D) between ECDFs from disparatecommunity types and macro features, measured using the Kolmogorov-Smirnov test

also significant at � < 0.001. This could be due to suchcommunities requiring users to work together, often on ashared goal, such as developing a product for a client, there-fore more ideas are shared through forum posts and blog en-tries.

The mean of the third micro feature, Contribution, is high-est for CoP (but not significantly higher than the others) in-dicating that more initiated content is interacted with thanin the other communities. Popularity is higher in Team andTech communities, but not significantly, than in CoP, sug-gesting that although users of the latter community providemore contributions, it is with content published by fewerusers. For Engagement the mean is significantly highest - at� < 0.001 - for Team indicating that users tend to participatewith more users in these communities than the others.Table 2. Mean and Standard Deviation (in parentheses) of the distribu-tion of micro features within the different community types

Feature CoP Team TechFocus Dis’ 1.682 (1.680) 1.391 (1.581) 1.382 (1.534)Initiation 7.788 (21.525) 13.235 (23.361) 3.088 (6.676)Contribution 26.084 (77.607) 21.130 (72.298) 11.753 (17.182)Popularity 1.660 (3.647) 2.302 (2.900) 2.286 (3.920)Engagement 1.016 (1.556) 1.948 (2.324) 1.036 ( 1.575)

We induce an empirical cumulative distribution function (ECDF)for each micro feature within each community and then qual-itatively analyse how the curves of the functions differ acrosscommunities. For instance, in the case of Figure 3 we seethat for Focus Dispersion Tech communities have the high-est proportion of focussed users (i.e. where entropy is 0).This indicates that users are interested in concentrating inthose communities alone for discussing support requests andasking/answering questions to specific topics. For CoP theusers are more dispersed, indicated by the low proportion ofusers who have an entropy of 0 and the low curve of this

6

Mean of the behaviour feature

Behaviour analysis across different types of Enterprise Online Communities. M Rowe, M Fernandez, H Alani, I Ronen, C Hayes and M Karnstedt. In the proceedings of the Web Science Conference. Evanston, US. (2012)

Standard deviation

User Behaviour on IBM Connections


21

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

Focus Dispersion

CD

F(x)

copteamtech

0 50 100 150 200 250

0.0

0.2

0.4

0.6

0.8

1.0

Contribution

CD

F(x)

copteamtech

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

Initiation

CD

F(x)

copteamtech

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Popularity

CD

F(x)

copteamtech

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Engagement

CD

F(x)

copteamtech

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

Focus Dispersion

CD

F(x)

copteamtech

0 50 100 150 200 250

0.0

0.2

0.4

0.6

0.8

1.0

Contribution

CD

F(x)

copteamtech

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

Initiation

CD

F(x)

copteamtech

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Popularity

CD

F(x)

copteamtech

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Engagement

CD

F(x)

copteamtech

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

Focus Dispersion

CD

F(x)

copteamtech

0 50 100 150 200 250

0.0

0.2

0.4

0.6

0.8

1.0

Contribution

CD

F(x)

copteamtech

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

Initiation

CD

F(x)

copteamtech

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Popularity

CD

F(x)

copteamtech

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Engagement

CD

F(x)

copteamtech

Linking Users Needs to User Behaviour


22

¨  Questionnaire questions related to different behaviour aspects (initiation, contribution, etc.)

¨  Mapped questions to these aspects: ¤ E.g. Initiation questions included:

n How often do you ask a question? n How often do you create content? n How often do you announce work events and news?

¨  Resulted in average likert-scale value response per behaviour aspect across community types

Linking Users Needs to User Behaviour


23

nities and 58 for Tech communities. We did not circulate thequestionnaire to any Idea lab or Recreation communities asthey were extremely sparse in our dataset.

To contrast the results of the previous empirical analysis withthe user needs extracted from the questionnaire, we first per-formed a mapping between each micro feature - e.g. Initia-tion, Contribution, etc. - and a subset of questions describ-ing the feature. We chose to omit macro features due to thelimited insights that such features provided. We have placedthe mapping online for the reader’s benefit10. For example,Initiation was described using questions like: How often doyou ask a question? How often do you create content? Howoften do you announce work news and events?, etc. Giventhis mapping we could then derive an average score for eachmicro feature based on the questionnaire responses. Due toour use of the Likert-type scale such averaging was feasibleby taking the response values (given that these ranged from1 to 5) and taking the mean over those for all communitytype responses - e.g. taking the mean of the responses for allInitiation questions for the 95 CoPs. The set of results canbe seen in Table 4.Table 4. Mean and standard deviation (in parentheses) values of micro-features obtained using the questionnaires for the different communitytypes

CoP Team TechFocus Dis’ 4.019 (0.093) 3.055 (0.426) 4.070 (0.070)Initiation 2.483 (0.838) 2.587 (0.838) 2.243 (0.873)Contribution 3.239 (0.926) 3.202 (1.016) 3.158 (0.945)Popularity 2.875 (0.070) 3.084 (0.168) 2.104 (0.173)Engagement 2.844 (0.539) 3.027 (0.588) 2.406 (0.522)

As Table 4 demonstrates, the findings from the analysis highlycorrelate with what users expressed to be relevant for eachcommunity type. We previously found that high levels ofInitiation and Contribution are discriminative factors ofTeam and CoP communities with respect to Tech communi-ties. Additionally, by looking at the behaviour distributionsof these features, we find that higher levels of Initiation aremore common for Team communities, while higher levelsof Contribution are more common for CoP communities.Collaboration is a strong element for both community types,either for sharing common interests as in the case of CoPsor for sharing a common task or goal in the case of Teams.However, in Team communities the collaboration is drivenby the task, and this may require frequent uploads of piecesof work to the community (in the form of wiki pages, blogentries or forum announcements) given the higher level ofInitiation. On the other hand, CoPs are driven by the needto share common interests or practices and therefore dis-cussions about the content posted in a blog, wiki, or forumthread constitute a more relevant factor. This correlates withour findings from the macro features analysis, where Teamcommunities have the highest levels of seed and non-seedposts (i.e. posts that do not generate a reply). As mentionedbefore, in these communities content initiations may be doneas part of the task, but not with the aim of generating a dis-cussion. As Table 4 describes, user needs corroborate these

10http://socsem.open.ac.uk/WebScience2012/

Association-of-microfeatures-with-questions.

html

facts. Average numbers for Initiation and Contribution arehigher for CoPs and Team communities than for Tech com-munities. Additionally, we also see that users consider Ini-tiation a more relevant factor for Team communities, whileContribution is considered a more relevant factor for CoPcommunities.

Another insight that emerged from the analysis, and is cor-roborated by the user questionnaires, is the fact that, over thethree different community types, Team communities showthe highest levels of Initiation, Popularity and Engage-ment. By intuition, in Team Communities each memberneeds to interact with other members of the team in orderto achieve their common goal, a key collaborative propertythat is missing from the two remaining community types.These interactions across team members make Popularityand Engagement discriminative factors of Team commu-nities. Moreover, as shown in Figure 4, while Contribu-tion and Initiation discriminate CoPs and Team communi-ties from Tech communities, Popularity is the factor thatbetter discriminates CoPs from Team communities.

For Focus Dispersion the findings from the analysis and thequestionnaire differ slightly. Our analysis and the users’opinions agree on the fact that Focus Dispersion is a dis-criminative factor for CoP communities, i.e. users of CoPcommunity tend to disperse their activity across may dif-ferent topics. In the Tech communities the diversity of ex-pertise was found to be a valuable community attribute inthe questionnaire responses, one would therefore have antic-ipated that the mean of Focus Dispersion would be higherthan for other community types in our previous empiricalanalysis task - see Table 2. However, this was not the case.The reason for this is the derivation of Focus Dispersion,given that this feature was engineered by using all posts bya user such that the content she initiated - e.g. creating awiki page - and contributed to - e.g. editing a wiki page- was pooled together. As a consequence, initiations couldbias the mean of the distribution for Tech communities. Forexample, it is common that users who initiate a forum threadare asking for information, but do not share the knowledgeof the community - i.e. users who are novices for the partic-ular community topic. As future work we plan to divide thedistributions explored previously into technology-dependentmicro features, thereby yielding a Focus Dispersion mea-sure for forum replies that captures the diversity of topicsthat users responding to forum threads have - such repliesoften denote answers in Tech communities.

CONCLUSIONS AND FUTURE WORKEnterprise communities are provided to support a variety ofpurposes with the common ground of economic benefit. Pre-vious work by Muller et al. [10] divided enterprise commu-nities on IBM Connections into distinct types, finding thateach community type had a specific intention and patternof social media tool usage. In this paper, we explored thequestion: How do enterprise community types differ fromone another? We performed both quantitative and qualita-tive analyses and in doing so have provided insights into thedifferences between community types and how those are re-

9

feature in our case, as the value limit is iteratively increased.In essence, it allows us to see how a feature is distributedacross its values and whether there is a skew towards thefeature being lower or higher in different community types.Let Ixi�t define an indicator function that returns 1 if thevalue of x is less than or equal to t and 0 otherwise, thenthe ECDF is defined, using an increasing value range for tbetween the features minimum and maximum value, as:

1

n

nX

i=1

Ixi�t (2)

We omit the ECDF plots for the macro features, as they showno clear differences between the distributions. Indeed, wefound the behaviour of the different community types to ap-pear consistent when observing the macro-level attributes ofa community. To assess the quantitive differences betweenthe distributions, we used the Kolmogorov-Smirnov two-sample test to compare, in a pairwise fashion, the inducedECDFs - e.g. comparing the distribution of Seeds betweenCoP, Team and Tech. This test returns the maximum devia-tion between the distributions and the p-value of the diver-gence, thereby allowing us to gauge the significance of thedivergence.Table 1. Mean and Standard Deviation (in parentheses) of macro-features within the different community types

Feature CoP Team TechSeeds 7.094 (15.601) 7.128 (15.622) 6.680 (13.076)Non-seeds 3.298 (9.418) 3.397 (9.594) 3.390 (8.896)Users 4.041 (6.669) 4.024 (6.616) 4.172 (6.767)

The differences between the feature distributions are pre-sented in Figure 2, where we compare the ECDF of eachof the macro features. The bar charts indicate the lack ofdeviance between the distributions. The largest appears tobe where Seeds are concerned, as there is a marked differ-ence between the Tech communities and the two other types- this difference is found to be significant at � = 0.05. Wealso find the difference between Tech communities and theother types to be significant, again at a significance level of� = 0.05, when assessing the Non-seeds distribution. Aswe have demonstrated, the differences between the commu-nities in terms of their macro features are minimal, in partic-ular when considering the empirical cumulative distributionfunctions. In the next section, we extend this analysis tothe behaviour exhibited by community users and how thatdiffers between the types of communities, thereby delvingdeeper into the implicit dynamics of the communities.

Micro Features

We now inspect the differences between community types interms of the micro features. Table 2 contains the mean andstandard deviation for each community type and feature. ForFocus Dispersion we find that CoP has the highest value -significant at � < 0.001 - indicating that users of that typeof community tend to disperse their activity across many dif-ferent communities. Conversely, for Tech communities thisvalue is lowest, where users are focussed on just participat-ing in a selection of communities. For Initiation, Table 2indicates that Team communities have a much higher mean(and standard deviation) than the other community types -

cop team tech

Seeds

Max

imum

Dev

iatio

n (D

)

0.0

0.2

0.4

0.6

0.8

1.0

copteamtech

cop team tech

Non−Seeds

Max

imum

Dev

iatio

n (D

)

0.0

0.2

0.4

0.6

0.8

1.0

copteamtech

cop team tech

Users

Max

imum

Dev

iatio

n (D

)

0.0

0.2

0.4

0.6

0.8

1.0

copteamtech

Figure 2. Maximum Deviation (D) between ECDFs from disparatecommunity types and macro features, measured using the Kolmogorov-Smirnov test

also significant at � < 0.001. This could be due to suchcommunities requiring users to work together, often on ashared goal, such as developing a product for a client, there-fore more ideas are shared through forum posts and blog en-tries.

The mean of the third micro feature, Contribution, is high-est for CoP (but not significantly higher than the others) in-dicating that more initiated content is interacted with thanin the other communities. Popularity is higher in Team andTech communities, but not significantly, than in CoP, sug-gesting that although users of the latter community providemore contributions, it is with content published by fewerusers. For Engagement the mean is significantly highest - at� < 0.001 - for Team indicating that users tend to participatewith more users in these communities than the others.Table 2. Mean and Standard Deviation (in parentheses) of the distribu-tion of micro features within the different community types

Feature CoP Team TechFocus Dis’ 1.682 (1.680) 1.391 (1.581) 1.382 (1.534)Initiation 7.788 (21.525) 13.235 (23.361) 3.088 (6.676)Contribution 26.084 (77.607) 21.130 (72.298) 11.753 (17.182)Popularity 1.660 (3.647) 2.302 (2.900) 2.286 (3.920)Engagement 1.016 (1.556) 1.948 (2.324) 1.036 ( 1.575)

We induce an empirical cumulative distribution function (ECDF)for each micro feature within each community and then qual-itatively analyse how the curves of the functions differ acrosscommunities. For instance, in the case of Figure 3 we seethat for Focus Dispersion Tech communities have the high-est proportion of focussed users (i.e. where entropy is 0).This indicates that users are interested in concentrating inthose communities alone for discussing support requests andasking/answering questions to specific topics. For CoP theusers are more dispersed, indicated by the low proportion ofusers who have an entropy of 0 and the low curve of this

6

User Needs from Questionnaire Responses:

Observed User Behaviour:

Understanding Needs Satisfaction


24

¨  Agreement between users’ needs and how users behave ¤ Reflected by the different needs values across the

different community types ¨  Limitations of this approach:

1.  Expensive to collect survey responses n Took around 6 months between questionnaire publication

and results compilation n Required contacting many users

2.  Implicit biases in reporting across community types n Team communities had the lowest % of responses

Part III: Predicting Community Health from User Behaviour

25


Community Health and User Behaviour


26

¨  Management of communities is helped by: ¤ Understanding how behaviour and health are

related n How user behaviour changes are associated with health

¤ Predicting health changes n Enables early decision making on community policy

¨  Can we accurately detect changes in community health from the behaviour of its users?

Dataset 2: SAP Community Network


27

¨  Collection of SAP forums in which users discuss: ¤  Software development, SAP Products, Usage of SAP tools

¨  Points system for awarding best answers ¨  Provided with a dataset covering 33 communities:

¤  Spanning 2004 - 2011 ¤  95,200 threads, 421,098 messages, 32,942 users

020

060

010

0014

00

Post

Cou

nt

2004 2005 2006 2007 2008 2009 2010 2011

User Behaviour Features on SAP


28

¨  Focus Dispersion ¤  Measure: Forum entropy of the user

¨  Engagement ¤  Measure: Out-degree proportioned by potential maximal out-degree

¨  Popularity ¤  Measure: In-degree proportioned by potential maximal in-degree

¨  Contribution ¤  Measure: Proportion of thread replies created by the user

¨  Initiation ¤  Measure: Proportion of threads that were initiated by the user

¨  Quality ¤  Measure: Average points per post awarded to the user

Inferring Roles from User Behaviour

¨  1. Construct features for community users at a given time step

¨  2. Derive bins using equal frequency binning ¤  Popularity-low cutoff = 0.5, Initiation-high cutoff = 0.4!

¨  3. Use skeleton rule base to construct rules using bin levels ¤  Popularity = low, Initiation = high -> roleA!

¤  Popularity < 0.5, Initiation > 0.4 -> roleA!

¨  4. Apply rules to infer user roles and community composition

¨  5. Repeat 1-4 for following time steps


29

Community Analysis through Semantic Rules and Role Composition Derivation. M Rowe, M Fernandez, S Angeletou and H Alani. In the Journal of Web Semantics (2012)

Mining Roles (Skeleton rule base compilation)


30

¨  1. Select the tuning segment

¨  2. Discover correlated behaviour dimensions ¤  Removed Engagement and Contribution, kept Popularity (Pearson r > 0.75, p < 0.01)

¨  3. Cluster users into behavioural groups

¨  4. Derive role labels for clusters

thereby separating users based on their behaviour and discov-ering distinct roles on the platform. We ran three differentunsupervised clustering algorithms: Expectation-Maximization(EM), K-means and Hierarchical Clustering, over the 6-months’ tuning segment. The model selection phase not onlyrequires choosing the correct clustering method but also se-lecting the optimum number of clusters to use - providing thisvalue as a parameter k. To judge the best model - i.e. clustermethod and number of clusters - we measure the cohesion andseparation of a given clustering as follows: For each clusteringalgorithm (!) we iteratively increase the number of clusters(k) to use where 2 ! k ! 30. At each increment of k werecord the silhouette coefficient produced by !, this is definedfor a given element (i) in a given cluster as:

si =bi ! ai

max(ai, bi)(3)

Where ai denotes the average distance to all other itemsin the same cluster and bi is given by calculating the averagedistance with all other items in each other distinct cluster andthen taking the minimum distance. The value of s i rangesbetween "1 and 1 where the former indicates a poor cluster-ing where distinct items are grouped together and the latterindicates perfect cluster cohesion and separation. To derivethe silhouette coefficient (s(!(k)) for the entire clusteringwe take the average silhouette coefficient of all items. Wefind that the best clustering model and number of clusters touse is K-means with 11 clusters. We found that for smallercluster numbers (k = [3, 8]) each clustering algorithm achievescomparable performance, however as we begin to increase thecluster numbers K-means improves while the two remainingalgorithms produce worse cohesion and separation.3) Deriving Role Labels: Provided with the most cohesive

and separated clustering of users we then derive role labelsfor each cluster. Role label derivation first involves inspectingthe dimension distribution in each cluster and aligning thedistribution with a level mapping (i.e. low, mid, high). Thisenables the conversion of continuous dimension ranges intodiscrete values which our rule-based approach requires in theSkeleton Rule Base. To perform this alignment we assess thedistribution of each dimension and derive boundary points forthe three feature levels using an equal-frequency binning ap-proach. The distribution of each dimension is shown in Figure2 for each of the 11 induced clusters together with the levelboundaries. We assess the distribution of each feature for eachcluster against the levels derived from the equal-frequencybinning of each feature, thereby generating a feature-to-levelmapping. This mapping is shown in Table II where certainclusters are combined together as they have the same feature-to-level mapping patterns - i.e. 2,5 and 8,9.In order to derive the role labels for each cluster we use

a maximum-entropy decision tree to divide the clusters intobranches that maximise the dispersion of dimension levels.Figure 3 shows the separation of the clusters from a completegrouping into a single cluster, or merged clusters in the case of2,5 and 8,9, in each leaf. To perform the separation at a given

Fig. 2. Boxplots of the feature distributions in each of the 11 clusters.Feature distributions are matched against the feature levels derived from equal-frequency binning

TABLE IIMAPPING OF CLUSTER DIMENSIONS TO LEVELS. THE CLUSTERS ARE

ORDERED FROM LOW PATTERNS TO HIGH PATTERNS TO AID LEGIBILITY.

Cluster Dispersion Initiation Quality Popularity1 L L L L0 L M H L6 L H M M10 L H M H4 L H H M2,5 M H L H8,9 M H H H7 H H L H3 H H H H

decision node, we measure the entropy of the dimensions andtheir levels across the clusters, we then choose the dimensionwith the largest entropy. This is defined formally as:

H(dim) = !|levels|!

level

p(level|dim) log p(level|dim) (4)

Fig. 3. Maximum-entropy decision tree used to segment the clusters intominimal-distance paths. The paths are used to generate the role labels for eachrespective cluster.

We perform this process until single clusters, or the pre-viously merged clusters, are in each leaf node and then usethe path to the root node to derive the label. For instance,for cluster 0 the path from the root node to the leaf nodeis quality=high, dispersion=low, initiation=medium, therebyderiving the role label Focussed Expert Participant for thecluster. In the label, focussed describes the focus dispersionof the role - i.e. it is low and therefore not distributed, expert

0 1 2 3 4 5 6 7 8 9

0.0

0.2

0.4

0.6

Cluster

Dispersion

0 1 2 3 4 5 6 7 8 9

0.00

0.01

0.02

0.03

0.04

Cluster

Initiation

0 1 2 3 4 5 6 7 8 9

02

46

810

Cluster

Quality

0 1 2 3 4 5 6 7 8 9

0.000

0.005

0.010

0.015

0.020

Cluster

Popularity

•  1 - Focussed Novice •  2,5 - Mixed Novice •  7 - Distributed Novice •  3 - Distributed Expert •  8,9 - Mixed Expert •  0 - Focussed Expert Participant •  4 - Focussed Expert Initiator •  6 - Knowledgeable Member •  10 - Knowledgeable Sink

Community Health Indicators

¨  From the literature there is no single agreed measure of ‘community health’ ¤  Emergent dimensions: loyalty, participation, activity, social capital

¨  Indicator 1: Churn Rate (loyalty) ¤  Proportion of users that remain

¨  Indicator 2: User Count (participation) ¤  Number of active contributors

¨  Indicator 3: Seeds-to-Non-Seeds Posts Proportion (activity) ¤  Replied to thread starters to non-replied to

¨  Indicator 4: Clustering Coefficient (social capital) ¤  Average of users’ clustering coefficients


31

Experiment 1: Health Indicator Regression

¨  Community management is helped by understanding the relation between behaviour and health

¨  Experimental Setup: ¤ Health Indicator Linear Regression Models (per community)

n  Independent vars: 9 roles with composition proportions as values @ t n  E.g. @ t = k: Mixed Expert = 0.05, Distributed Novice = 0.51, etc.

n  Dependent var: health indicator (e.g. churn rate) @ t n  E.g. @ t = k: Churn Rate= 0.21

¤  PCA of each community model using the model’s coefficients n  Look for a common health composition pattern


32

Experiment 1: Health Indicator Regression Results

¨  Idiosyncratic Health Composition Patterns ¤  Divergence patterns between outlier communities

¨  No general pattern exists that describes the relation between roles and health


33

−200 200 600

−200

0100

Churn Rate

PC1

PC2

101

161

197198210226252256 264265

270 319

353

354

412

413414

418

419

420 4447050

56

−800 −400 0 400

−200

0100

User Count

PC1

PC2 101

161197198210226252256

264 265270319

353

354412413414

418419

420

44

470

50

56

−400 0 200

−100

0100

200

300

Seeds / Non−seeds Prop

PC1

PC2

101

161197198210226252256

264

265270

319

353

354

412

413414

418

41942044470

50

56

−600 −200 200

−150

−50

050

100

Clustering Coefficient

PC1

PC2

101

161197

198210226252

256

264

265

270319

353

354412413414

418

419420

44 470

50 56

Experiment 2: Health Change Detection

¨  Can we accurately and effectively detect positive and negative changes in community health from its composition of behavioural roles?

¨  Experimental Setup ¤  Binary classification of indicator change using logistic regression ¤  At t=k+1: predict increase or decrease in health indicator from t=k

¤  Time-ordered dataset: n  Features @ t=k+1: 9 roles with composition proportions as values

n  Class @ t=k+1: positive (if increase from t=k), negative (if decrease) n  Divide dataset into 80/20 split maintaining time-ordering

¤  Evaluated using Area under the ROC Curve (AUC)


34

Experiment 2: Health Change Detection Results


35

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Churn Rate

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

User Count

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Seeds / Non−seeds Prop

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Clustering Coefficient

FPR

TPR

¨  ROC Curves surpass baseline for: ¤  Churn rate: 20/25 forums ¤  User Count: 20/25 forums ¤  Seeds-to-Non-Seeds: 19/25 forums ¤  Clustering Coefficient: 17/25 forums

What makes Communities Tick? Community Health Analysis using Role Compositions. M Rowe and H Alani. In the proceedings of the Fourth IEEE International Conference on Social Computing. Amsterdam, The Netherlands. (2012)

To Summarise 36


Findings


37

¨  User Behaviour is closely aligned with users’ needs ¤ Although this is expensive to collect and analyse

¨  Accurate predictions of community health from behaviour ¤  Inferring roles from collective behaviour ¤ Forecasting from role compositions

¨  Community Managers can understand how their community will develop from user behaviour ¤ Requires model tuning per-community

Community Analysis through Semantic Rules and Role Composition Derivation. M Rowe, M Fernandez, S Angeletou and H Alani. In the Journal of Web Semantics (2012)

Current/Future Work: Lifecycles


38

¨  Limitation of role-composition approach is the use of platform-wide windowing: ¤ Lack of high-fidelity behaviour inspection per-user

¨  Lifecycles periods: user-specific stages of development

1 2 3 … n

1 2 Divide lifetime into equal activity periods

#posts #posts =

First Post Last Post

that we eschew time-comparative assessments of how a useris changing relative to earlier properties. To inform suchcross-period assessment we examined the users’ in-degree,out-degree and term distributions across lifecycle periodsby computing the cross-entropy of one probability distri-bution with respect to another distribution from an lifecycleperiod, and then selecting the distribution that minimisescross-entropy. Assuming we have a probability distribution(P ) formed from a given lifecycle period ([t, t!]), and aprobability distribution (Q) from an earlier lifecycle period,then we define the cross-entropy between the distributionsas follows:

H(P,Q) = !!

x

p(x) log q(x) (5)

In the same vein as the earlier entropy analysis, wederived the period cross-entropy for each platform’s usersthroughout their lifecycles and then derived the mean cross-entropy for the 20 lifecycle periods. Figure 2 presents thecross-entropies derived for the different platforms and userproperties. We observe that for each distribution and eachplatform cross-entropies reduce throughout users’ lifecycles,suggesting that users do not tend to exhibit behaviour thathas not been seen previously. For instance, for the in-degreedistribution the cross-entropy gauges the extent to whichthe users who contact a given user at a given lifecyclestage differ from those who have contacted him previously,where a larger value indicates greater divergence. We findthat consistently across the platforms, users are contactedby people who have contacted them before and that fewernovel users appear. The same is also true for the out-degreedistributions: users contact fewer new people than they didbefore. This is symptomatic of community platforms wheredespite new users arriving within the platform, users formsub-communities in which they interact and communicatewith the same individuals. Figure 2(c) also demonstrates thatusers tend to reuse language over time and thus produce agradually decaying cross-entropy curve.

!!

!!

! ! ! ! ! ! ! ! ! !! ! ! ! !

0.0

00.1

00.2

00.3

0

Lifecycle Stages

Cro

ss E

ntr

opy

0 0.2 0.5 0.8 1

! FacebookSAPServer Fault

(a) In-degree

!

!

!

!!

!! !

! ! ! ! ! ! !! ! ! !

0.0

00.0

50.1

00.1

5

Lifecycle Stages

Cro

ss E

ntr

opy

0 0.2 0.5 0.8 1

(b) Out-degree

!

!! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! !

0.0

0.4

0.8

1.2

Lifecycle Stages

Cro

ss E

ntr

opy

0 0.2 0.5 0.8 1

(c) Lexical

Figure 2. Cross-entropies derived from comparing users’ in-degree, out-degree and lexical term distributions with previous lifecycle periods. Wesee a consistent reduction in the cross-entropies over time.

3) Community Contrasts (Community Cross-Entropy):

For the third inspection of user lifecycles and how userproperties change, we examined how users compare with

the platform in which they are interacting over the sametime interval. We used the in-degree, out-degree and termdistributions and compared them with the same distributionsderived globally over the same time periods. For the globalprobability distributions we used the same means as forforming user-specific distributions, but rather than using theset of posts that a given user had authored (Pui

) to derivethe probability distribution, we instead used all posts. Forinstance, for the global in-degree distribution we used thefrequencies of received messages for all users. Given thediscrete probability distribution of a user from a time interval(P[t,t!]), and the global probability distribution over the sametime interval (Q[t,t!]), we derived the cross-entropy as abovebetween the distributions. (H(P[t,t!], Q[t,t!])).

As before we derived the community cross-entropy foreach platform’s users over their lifetimes and then calculatedthe mean community cross-entropy for the lifecycle periods.Figure 3 presents the plots of the cross-entropies for the in-degree, out-degree and term distributions over the lifecycleperiods. We find that for all platforms the community cross-entropy of users’ in-degree increases over time indicatingthat a given user tends to diverge in his properties fromusers of the platform. For instance, for the community cross-entropy of the in-degree distribution the divergence towardslater parts of the lifecycle indicates that users who reply to agiven user differ from the repliers in the entire community.This complements cross-period findings from above wherewe see a reduction in cross entropy, thus suggesting thatusers form sub-communities in which interaction is consis-tently performed within (i.e. reduction in new users joining).We find a similar effect for the out-degree of the userswhere divergence from the community is evident towardsthe latter stages of users’ lifecycles. The term distributiondemonstrates differing effects however: for Facebook andSAP we find that the community cross-entropy reducesinitially before rising again towards the end of the lifecycle,while for Server Fault there is a clear increase in communitycross-entropy towards the latter portions of users’ lifecyclessuggesting that the language used by the users actually tendsto diverge from that of the community in a linear manner.This effect is consistent with the findings of Danescu et al.[2] where users adapt their language to the community tobegin with, before then diverging towards the end.

V. MINING LIFECYCLE TRAJECTORIES

Inspecting how communities of users develop we haveconcentrated on assessments at the macro-level on eachplatform, examining how the social dynamics and lexical dy-namics of communities of users have changed over time. Wenow turn to examining how individual users evolve through-out their lifecycle periods. Understanding how individualusers develop over time in online community platformsallows for churners to be predicted, as we shall demonstratein the following section through our experiments, and also

User Development


39

¨  Capture period-specific user properties (in period s): ¤  In-degree distribution ¤ Out-degree distribution ¤  Term distribution

s

Enabling: Churn prediction, stage-based recommendation

Mining User Lifecycles from Online Community Platforms and their Application to Churn Prediction. M Rowe. To appear in the proceedings of the International Conference on Data Mining. Dallas, US. (2013)

@mrowebot [email protected]

http://www.lancaster.ac.uk/staff/rowem/

Questions? 40


From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities

Technology

Transcript of From User Needs to Community Health: Mining User Behaviour to Analyse Online Communities