Analysing Event Stream Dynamics in Two Mode · PDF fileAnalysing Event Stream Dynamics in Two...

Analysing Event Stream Dynamics in Two Mode Networks

Christoph StadtfeldIME Graduate School

KIT Karlsruhe Institute of TechnologyKarlsruhe, Germany

Email: [email protected]

Andreas Geyer-SchulzInformation Services and Electronic Markets

KIT Karlsruhe Institute of TechnologyKarlsruhe, Germany

Email: [email protected]

Abstract—Exponential Random Graph Models (ERGMs) arewidely used to describe structural network dynamics usingpanel data. In this paper the private communication behaviourwithin a question and answer community is analysed. Using thisexample, it is shown how event based extensions of the standardERGM models can be applied on combined communicationand affiliation networks. It is tested in how far communica-tion patterns and affiliation patterns influence the choice ofmessage receivers. To visualize change and evolution in eventbased networks, a sliding window approach is introduced anddiscussed.

Keywords-Two Mode networks, ERGM, event streams, ques-tion and answer communities

I. INTRODUCTION

Question and answer (Q&A) communities (like Yahoo!Answers, Lycos IQ, answers.com) have become verypopular in the internet. People can easily (often evenwithout registration) pose arbitrary questions. Members ofthese communities try to answer these questions quickly.Often, the only obvious incentives to answer questions arevirtual points given to people who answer many questions.The more points someone has, the higher is his/her virtualranking (e.g. ranging from Newbie to Albert Einstein).But are there any other effects that make people stay inthese groups? Are there, for example, community structuresthat can be revealed when looking at how actors writeprivate messages to others? Or is most of this privatecommunication just related to questions, like furtherexplanations or to say thank you?

Exponential random graph models (ERGM) are agood way to model social actor decision processes innetworks ([1], [2], [3]). Several recent new developmentsmay help to describe the dynamics of the given onlinecommunity. First, there is new research about analysingevent streams, instead of working with panel data ([4], [5],[6]). Furthermore, ERGMs can be applied on multi-modenetworks ([7], Snijders2009). The combination of ERGMSand event stream analysis allows to visualize the evolutionof parameter estimates over time ([8]). Finally, fastercomputer hardware allows to estimate ERGM parametersfor big networks, especially if they are very sparse.

In this paper an event based exponential random graphmodel is used to analyse what drives the dynamics ofpersonal messages (PM’s) sent between actors in a Q&Acommunity. It is tested, whether two-mode affiliationstructures and community structures have an influence onhow people interact.

Section II introduces the case study, a dataset of theQ&A community Lycos IQ, and identifies four phases of thecommunity development. Section III explains the Markovprocess model with an embedded, adapted ERGM thatmodels the decision processes of actors regarding messagereceivers. In section IV parameter estimates for the testedeffects will be given for different phases of the communitydevelopment and interpreted. In section V a new approachis shown and discussed, which allows to visualize evolutionof decision patterns. Section VI summarizes and gives anoutlook on further research.

II. CASE STUDY

A dataset of the Q&A community Lycos IQ is analysedto demonstrate the potential of an analysis with eventbased ERGMs. The question will be addressed what drivesthe dynamics of private message (PM) writing. Somecharacteristics of the dataset at hand will be presented insection II-A. The event stream will be explained in sectionII-B.

Actors in the community pose questions that get answeredand deleted later. Both actors and questions are modes ina two mode network. This affiliation network is combinedwith a communication network, which only connects nodesof type actor. Posing and answering questions and privatecommunications between actors are modelled as differenttypes of ties in this combined affiliation and communicationnetwork.

A. The dataset

The dataset describes a time span from December 2005to June 2008. Regarding their communication behaviour,people in the Lycos IQ Q&A community behave very

different from other online social communities. First, thetotal number of members is big, but only a small subset ofall actors is “active” at the same time because a lot of peopleonly pose one question and leave quickly. There are 416,879activated user accounts but 329,055 (79%) of them are “lightaccounts“ that are just used to pose questions, but cannotbe used to write or receive private messages. Secondly,the communication within the community is assumed to beinfluenced by questions. A virtual rank in the community isonly based on how often and how good a member answersquestions.

There are 946,603 questions in the dataset with 2,996,446answers. Although the dataset starts in December 2005,PM’s are only logged from August 2006 on. Figure 1 showshow the activity concerning questions, answers and PM’schanges over time.

The x-axis shows the values for each month, beginningwith December 2005. The dotted line represents the numberof 1,000 answers, the dashed line the number of 1,000questions and the solid line the number of 1,000 sentPM’s. The y-axis gives the total amount divided by 1,000.Generally, the activity in the Q&A community increases.From this first visualisation, four different phases could beidentified as shown in figure 1.

In the first phase there is only little activity in the datasetwith a rather low growth rate (from December 2005 to endof January 2007). The number of private messages are lowand are only logged from August 2006 onwards with a verylow total number of a few hundred messages in the firstmonths. This first phase of the Q&A-Community is calledInitialisation.

The second phase was identified between February 2007and October 2007 and is characterised by a rapidly increas-ing amount of questions, answers and a slow increase of thenumber of PM’s. Therefore it is called Growth.

In the third phase, the numbers of questions, answers andPM’s seem to have reached a more or less stable level.Although there is a lot of variance between the months,the total number is always about 65,000 for questions and210,000 for answers. The number of PM’s ist stable at a levelof about 5,000 per month. Phase III ranges from November2007 to February 2008 and is called Stabilisation.

The fourth phase is probably the most interesting oneregarding the dynamics of private communication becausethe number of PM’s rapidly increases, while the numberof questions and answers is relatively stable. Phase IVends – like the whole dataset – at the beginning of June2008. The values for this last month are extrapolated asit was not completely logged. It will be tested, whethercommunity effects are the reason for this sudden andsignificant increase of messages from an average of about5,000 per months to more than 30,000 messages in the lastcompletely observed months. This phase is therefore calledCommunity Growth. Whether this name is suitable (because

ω.time ω.sender ω.receiver ω.type2007-07-07 14:10:47 Anke 283613 question opened2007-07-07 14:10:51 doc 44 283604 answer2007-07-07 14:11:00 larumoren 283604 answer2007-07-07 14:11:16 Snooker01 283600 answer2007-07-07 14:11:19 mrs incredible 270053 question closed2007-07-07 14:11:31 Nekoy larumoren message2007-07-07 14:11:42 Anke 283614 question opened

Table IPART OF THE ANALYSED EVENT STREAM

the increasing number of messages is based on ”communitystructures”) will be tested in this paper.

B. Event stream

Events are any kind of dyadic interaction between twonodes in a network for which at least a timestamp is defined.Events may include more information, like an event type oran event intensity. The Lycos IQ dataset includes events ofdifferent types, that can be used to describe the change indifferent layered networks. The change in these networks iswell defined by the event stream and a set of change rules.As the state of these networks is known for each point intime, this approach uses a lot more information than, forexample, aggregated panel data.

To analyse the PM dynamics in the database, four differentevent types were selected and transformed into an eventstream with more than 5 million entries. Each of these entries(a row in the resulting database table) describes one eventand consists of a timestamp, a sender, a receiver and anevent type. An examplary snapshot of the event stream isgiven in table I.

Senders are always actors, while receivers may either beactors or questions. The first event type is question openedwhich indicates that an actor poses a question (which isidentified by a unique number). Event type answer showsthat an actor answers a question, while question closedindicates that a question is closed by the question opener,by an administrator or because the maximum question lifetime of seven days has been reached. Though the differentevent streams are not independent, this paper focuses on thedynamics of the fourth event type message, which shows thatone actor wrote a private message (PM) to another actor.

III. MODELLING THE CASE STUDY

The decision making process of of actors regarding privatemessage (PM) sending can be modelled using exponentialrandom graph models (ERGMs), which are embedded in aMarkov process. Firstly, this process has to decide, whichactor writes a personal message. Secondly, it is decidedwhether the receiver is active or non-active. If the receiving

Figure 1. Number of three different event types (questions, answers and private messages) per month over the whole observed period. Four phases wereidentified: I) Initialisation, II) Growth, III) Stabilisation and IV) Community Growth.

actor is active, then thirdly, the chosen sender decides whomof all active actors he or she sends the PM. This decisiondepends on the network structures that surround sender andreceiver. Whether certain structures are relevant for actors’decisions can be tested with a regression model based on theobserved behaviour in the dataset. The network structures area result of all events having been observed in the past anda set of change rules. These rules will be shown in sectionIII-A. Section III-B shows how the regression statistics looklike that are tested for influence on the decision process ofactors. Section III-C explains how the Markov process ismodelled that describes the decision process.

A. Transforming events into graphs

The state of the whole process is named x. x is arealisation of a random variable X . x is defined by the stateof three graphs at a certain point in time. These graphsare defined on two sets of nodes (two modes), which are –as mentioned before – the set of actors A and the set ofquestions Q. The three graphs are named Xm, Xq and Xa

and a realisation x = (xm, xq, xa). Xm reflects the recentactivity of message writing between actors and has directed,weighted ties ∈ (0,∞). Xq shows which actors have poseda question which has not been closed, yet. Xa is a similargraph that connects actors with active questions they haveanswered. The last two of these graphs have directed, binaryties and are bipartite two-mode graphs (affiliation networks).

How these graphs connect the two different modes isshown in figure 2. Graph Xm only connects nodes of typeA with directed ties. Graphs Xq and Xa are bipartite and

Figure 2. Three different graphs are defined on the two node sets A(actors) and Q (questions).

connect single actors with questions.

The event stream is an ordered sequence Ω with elementsω1, ω2, . . .. If the position within the sequence is not ofinterest, the elements will just be named ω. ω.time, ω.sender,ω.receiver and ω.type indicate the four attributes of events asintroduced in section II-B and shown in table I. Dependingon these characteristics, events change certain ties of thegraphs that define the Markov state X . But even if no eventtakes place, the values of ties change due to deterministictime dependent processes. The event driven and, therefore,probabilistic changes are defined by a rule function RΩ :(X,Ω) 7→ X , which changes a Markov state into a new one,depending on the last observed event ω. The deterministic,time dependent changes are defined by a rule functionR∆ : (X,∆) 7→ X where ∆ is the set of all time spans δ,δ ∈ R+. If there are two subsequent events and the processstate after the first one is known and the state after the secondone shall be calculated, first, all time dependent changesare applied. The result is the process state just before the

second event. To receive the process state after this event,all probabilistic changes are applied afterwards. Figure 3shows how a directed communication tie (representing therecent PM activity) between to actors changes over time.Events ω with ω.sender = i and ω.receiver = j increase thetie value by one. Eight different events are shown. Betweeneach event there is a decay which depends on time spansδ. δ1 and δ4 which take place after the first and the fourthevent are shown examplarily.

As RΩ and R∆ can be distinguished by their parameters,they will both just named R in the following. Therefore,RΩ(x, ω) can also be written as R(x, i, j, γ) with i =ω.sender, j = ω.receiver and γ = ω.type. R(x, δ) is definedfor time spans δ between two consecutive events ω1 and ω2

that are calculated as

δ = ω2.time− ω1.time. (1)

For each subgraph of x (which are xm, xq and xa) existspecific probabilistic change rules. If a PM is sent (eventswith type message), the message graph xm is updated in thefollowing way:

R(xm, i, j,message) = (xm′kl ) =

xmkl , if i 6= k ∨ j 6= l

xmkl + 1 , if i = k ∧ j = l(2)

This change rule was already illustrated in figure 3. If anew question is opened (event type question open), graphxq gets a new binary tie between the actor who asks (i) andthe new question node j:

R(xq, i, j, question open) = (xq′kl) =

xqkl , if i 6= k ∨ j 6= l

1 , if i = k ∧ j = l(3)

This is very similar to what happens, if an actor answers aquestion (event type answer). Answering the same questionseveral times does not change the binary tie:

R(xa, i, j, answer) = (xa′kl) =

xakl , if i 6= k ∨ j 6= l

1 , if i = k ∧ j = l(4)

Event type question close changes two different graphs(xq and xa) by removing all those binary ties that areconnected with the deleted question:

R(xq, i, j, question close) = (xq′kl) =

xqkl , if j 6= l

0 , if j = l

(5)

R(xa, i, j, question close) = (xa′kl) =

xakl , if j 6= l

0 , if j = l

(6)

event typesquestion open question close answer message

grap

hs

Xm – – – tie value↑Xq add tie remove tie – –Xa – remove ties add tie –

Table IIOVERVIEW OF PROBABILISTIC GRAPH CHANGE RULES

An overview of the used probabilistic change rules isgiven in table II.There is only one deterministic (time dependent) changerule, which is an exponential decay of the communicationlevel between actors. This level is represented by graph Xm.Introducing such a natural decay function seems reasonable,as otherwise the communication intensity between actorscould only increase. Even, if there was no communicationbetween actors for very long periods, this value wouldremain stable. The exponential decay function is defined asfollows:

R(xm, δ) = (xm′kl ) =

(xmkle

−θδ) , if xmkle−θδ > ε

0 , else(7)

It holds that θ = ln 2t1/2

, therefore the only parameter thatneeds to be specified is a half life t1/2 which is the timeafter which each tie value decreases by 50%. Ties that havea value ≤ ε are reset to zero. ε is a very small value like 0.1or 0.01. This is done for computational reasons to reduce theset of active actors. Details are explained in section III-C.

B. Decision statistics

Most actors that are connected with positive messageties in Xm are embedded in other network structures andit can be tested whether the existence of these structuresinfluences the decisions of senders of PM’s. This can beestimated with a Generalised Linear Regression Modelof which the probability distribution is that of a commonexponential random graph model (ERGM). In this case, theindependent variables are the structures that are assumed toinfluence decisions about recipients of PM’s. The observeddecision behaviour in the dataset is the dependent variable.

These decision statistics are similar to what usually istested in the models implemented in software packages likePNet1 or SIENA2. But as this event based approach triesto model every single change in the process states (basedon the event stream), the independent variable structureswill only be evaluated locally. This means that only thosestructures are assumed to influence actors’ decisions, that

1http://www.sna.unimelb.edu.au/pnet/2http://stat.gamma.rug.nl/siena.html

Figure 3. Change of a directed PM tie from actor i to j. Events ω with ω.sender = i and ω.receiver = j increase the tie value by 1 (probabilistic rule).Between events there is a time decay which depends on time spans δ (deterministic rule).

include the currently emerged or strengthened message tie.This makes sense, as all other structures in the networkdo not change with the last event anyway. Also, the localevaluation of structures decreases the computational andmemory related overhead.

The tested structures are shown in figure 4. The counts ofthese statistics are part of a vector s(x, k, l) with elementss1(x, k, l), s2(x, k, l), . . . that weight structures of a Markovstate x. As only local statistics are counted, only thesending and receiving actor of the evaluated message eventare part of the evaluation (in case of figure 3 only thosestructures that include the message tie from i to j). Bothactive nodes are actors: k is the sender and l is the receiver,k, l ∈ A. How the structures formally look like is shown inthe following (see figure 4).

The single arc statistic of figure 4(a) evaluates the valueof the new tie. This value is at least 1 (as this is the resultof the change rule in equation 2), but may have a highervalue if there was communication before. If this statistic wasweighted with a negative value this would show a tendencyfor many “weak” outgoing message ties and therefore it issimilar to what is often called density effect in binary ERGMmodels. The decision statistic Single Message Tie is givenin equation 8:

s1(x, i, j) = xmij (8)

The structures of figures 4(b) and 4(c) are similar. Thenumber of questions is counted that the receiver of themessage is connected to, either as asker or as reponder.Afterwards, this value is normalized by extracting the squareroot. This was done because the activity of actors regardingquestions is far from being uniformly distributed. Due tothis normalization the skewed distribution gets smoothed. Inequation 9 this is shown for an answer tie ∈ Xa, in equation10 for a question tie ∈ Xq . Note that ties in Xa and Xq

are either 0 or 1 as these graphs are binary.

s2(x, i, j) =√∑q∈Q

xajq (9)

s3(x, i, j) =√∑q∈Q

xqjq (10)

Figures 4(d), 4(e) and 4(f) have one more tie. They evalu-ate whether there is a tendency for communication betweenthose actors that are connected with the same questions.Equation 11 shows this for communication between respon-ders of the same question, equation 12 for communicationfrom responders to the question asker, and equation 13 fromaskers to responders. Once again the counts are normalizedby extracting the square root.

s4(x, i, j) =√∑q∈Q

xaiqxajq (11)

s5(x, i, j) =√∑q∈Q

xaiqxqjq (12)

s6(x, i, j) =√∑q∈Q

xqiqxajq (13)

Finally, the relevance of two structures on the messagegraph Xm (that only connects actors) is tested. Thesestructures, a dyad and a transitive triad, are shown in figures4(g) and 4(h). They test whether there is significant PMcommunication that cannot directly be assigned to an openquestion but rather seems to be either responsive (dyad) orcluster building (transitive triad). Therefore these effects arecalled community effects in this paper. A positive weight ofthese effects would indicate that people tend to communicatewithin smaller and dense clusters, which are part of a com-munity structure. The dyad effect is described in equation 14,the transitivity effect in equation 15. Once again these effectsare only measured locally, meaning that these structures areonly counted, if the emerged or changed tie is part of it.

(a) Single message tie (b) PM to question responder

(c) PM to question opener (asker) (d) PM between responders of thesame question

(e) PM from responder to questionopener

(f) PM from question opener to re-sponder

(g) PM is part of a communicationdyad

(h) PM is part of a communicationtransitive triad

Figure 4. Network structures that might influence PM receiver decisionson a one and a two mode graphs

Each of these structures is assumed to be as strong as itsweakest tie.

s7(x, i, j) = min(xmij , xmji) (14)

s8(x, i, j) =∑

k∈A\i,j

(min(xmijx

mjkx

mik)

+ min(xmijxmkix

mkj)

+ min(xmijxmikx

mkj))

(15)

Equation 15 has three terms because there are threedifferent transitive triads of which the directed tie xmij may

be part with an additional node k ∈ A.

C. Markov process

A Markov process (or a continuous time Markov chain)is a process without memory and can therefore just bedefined by a set of process states and transition ratesbetween them. Random variable X , which represents thestate of the three underlying graphs, is the state of theMarkov process. The transition rates are only defined forthose probabilistic changes of this state that are triggeredby a message event. These rates are defined as Poissonrates as explained in the following.

For computational reasons it is differentiated betweenactive receivers and non-active actors. Active actors are thosethat are connected to a non-closed question (as asker orresponder) or have at least one in- or outgoing message tiewith a value > 0. The set of actors A is split into two subsetsA+, the set of active, and A−, the set of non-active actorswith A+ ∪ A− = A. Formally, for all actors i ∈ A+ holdsthe following:

∀i ∈ A+ :(∃k ∈ Q : xqik = 1 ∨ xaik = 1)∨(∃j ∈ A : xmij > 0 ∨ xmji > 0) (16)

Differentiating between active and non-active actorsseems reasonable, as a lot of accounts are only used forshort time spans, e.g. just to pose one question.

The Markov process is assumed to be homogeneous. Thisis reasonable at least for phases II to IV (see figure 1) as thedistribution of possible states would only marginally dependon the initial state with three empty graphs. A concrete stateof the process is defined by realisations x of X as introducedin section III-A. Transition rates are interesting for thosepoints in time when a PM is sent from one actor to another.These transition rates are described by a Poisson parameterλij(x) which is defined for each state and each tuple ofsenders i and receivers j; i, j ∈ A.

λij(x) =

ρip

+p?ij(x) , j ∈ A+

ρi(1− p+) 1|A−| , j ∈ A−

(17)

ρi is parameter of a Poisson process and describes thegeneral activity of actor i regarding the sending of PM’s.The phase of interest starts with the creation of the useraccount. The value ρi is 0 for most actors, as only asmall subset writes PM’s at all. However, these Poissonrates will not be analysed, as this paper focuses on thedecision patterns. The Poisson rate is split into several subPoisson rates using different probability distibutions (notethat this assumes independence). A case distinction is madedepending on whether the receiver of the PM is active(j ∈ A+) or not (j ∈ A−). Probability p+ is equal to

P (ω.receiver ∈ A+). p+ is assumed to be considerablyhigher than 1− p+. So, most receivers of PM’s are actuallyactive. For all other actors ∈ A− the probability for a certainreceiver is just equally distributed. In case of an activereceiver, this probability depends on the network structuressender i and the actual receiver j are embedded in. p?

ij(x)is the probability distribution of an ERGM and gives theprobability that i writes a PM to active actor j given that iwrites a PM to an active actor at all. It is defined in equation18:

p?ij(x;β) =

1c+

exp(βT s(R(x, i, j,message), i, j)

),

c+ =∑k∈A+

exp(βT s(R(x, i, k,message), i, k)

)(18)

The statistics s ∈ S have been introduced in sectionIII-B and are applied on the transformed Markov statex = R(x, i, j,message) (see equation 2). Each statistic oftype s is weighted with a corresponding parameter β ∈ R.Vectors β and s have the same dimension. This linearcharacteristic of the new observed graphs is transformedwith an exponential function. This resulting value, giving a“weight” for the structures surrounding the actually observedreceiver is normalised with the weights of all those statisticsthat might have occured, given that i decided to write themessage to another active actor. This assures that p?

ij(x) isa proper probability distribution. The normalising constantc+ (“+” indicates that only active actors are evaluated) isalso given in equation 18.

IV. ESTIMATION AND RESULTS

The best fitting parameters β are calculated by applying aMaximum Likelihood (ML) estimation. The log likelihoodfunction shown in equation 19 is maximized using a NewtonRaphson method.

maxβ

logL =∑ω∈Ω

log p?ij(x;β) (19)

For each event ω ∈ Ω the decision probabilities p?ij

are assumed to be independent. It holds that for eachof these probabilities i = ω.sender and j = ω.receiver.x is the state just before event ω. This means, that thedeterministic exponential decay has been applied, but notthe probabilistic changes, as introduced in section III-A.Standard erors are estimated using a bootstrapping approach.

The rate of messages sent to non-active actors ∈ A−

is 16.9% over the whole event stream (this is quite a lotbecause the decay half-life was only set to 12 hours). Thesemessages are not included in the following evaluation. Asthe four phases of the Q&A community seem to havedifferent characteristics (see figure 1), a subsequence of each

figure: name β s.e. β s.e. β s.e.

4(a): Single message tie 2.61 0.21 2.50 0.12 2.74 0.024(f): from asker to responder 0.34 0.06 0.35 0.094(g): part of a comm. dyad 3.00 0.91 3.12 0.644(d): between responders 0.05 0.01 0.05 0.014(e): from responder to asker 0.13 0.02 0.13 0.014(h): part of a comm. triad 0.96 0.224(c): to question opener 0.00 0.014(b): to question responder -0.00 0.00

logL -3.424 -3,427 -3,668

Table IIIBEST FITTING ESTIMATORS FOR PARAMETERS β IN THREE DIFFERENT

TEST SETS OF A SUB SEQUENCE OF PHASE II.

phase II to IV has been evaluated separately. As the wholeprocess includes millions of events it is already sufficientto analyse smaller samples of the stream to get statisticallysignificant results. Although analysing the whole streamdoes only increase memory and computational complexitylinearly for each simulation run (due to a preprocessing ofnetwork statistics), it is still a big challenge to process it atonce. Also, it is in question whether analysing the wholedata stream would be very helpful as all characteristicsof the different phases would get blurred. Therefore, subsequences of one day to two weeks were chosen forestimation.

The estimation results for phase II are shown in tableIII. The figure references and the names of the statisticsare given in the first column. The sub stream lasted from1st August 2007 to 14th August 2007 and included 1,081PM’s with an average of 314.9 active actors at the pointsin time when the events took place. The log likelihoodwithout including any parameter (β = 0) is −6, 084. Thisis equal to the outcome for a uniform distribution over allactive actors. The statistics are ordered by its additionalexplanatory value, beginning with the structure explainingmost. The order was found using a stepwise inclusion ofparameters. However, some questions about Goodness ofFit are still open (see section VI).

Three different sets of decisions have been estimated.Each of the tested statistics explains a different amount ofthe whole process. In the first estimation shown in columns2 and 3, all parameters were included, while in the secondand third set (columns 4–5 and 6–7) only the five bestexplaining statistics and the best explaining (Single Arc)statistic were estimated.

Interestingly, the community structures dyad andtransitivity are quite strong, although there is much lesscommunication in this phase than in the later phase


4(a): Single message tie 2.84 0.25 2.69 0.10 2.73 0.024(g): part of a comm. dyad 4.09 0.62 4.18 0.454(h): part of a comm. triad 0.97 0.284(f): from asker to responder 0.17 0.064(d): between responders 0.19 0.044(e): from responder to asker 0.30 0.104(c): to question opener -0.04 0.024(b): to question responder 0.00 0.00

logL -1,192 -1,196 -1,245

Table IVBEST FITTING ESTIMATORS FOR PARAMETERS β IN THREE DIFFERENT

TEST SETS OF A SUB SEQUENCE OF PHASE III.

IV. From the strong value of Single Arc follows thatthere is a tendency for strong message ties. People seemnot to interact with many but rather a few others. Thestructures Communication Dyad (3.00) and CommunicationTriad (0.96) are also strong, so people seem to tend tointeract dyadically and in small transitive clusters. Thequestion affiliation structure effects of figures 4(d), 4(e)and 4(f) explain a lot of the probability (even more thantransitivity) and are also positive. So there is obviouslycommunication about questions which cannot be explainedby community structures. The tendency to write PM’s toquestion askers or responders without being affiliated to thesame question is statistically not significant or has at leasta very low absolute value close to 0. These effects (seefigures 4(b) and 4(c)) are not important or at least very weak.

As the event density increases in the data stream, thefollowing tested time spans are much shorter than the firsttwo weeks in phase II. First, a three day sub stream ofphase III (stabilisation) ist tested. It lasts from 3rd Decemberto 5th December 2007 (Monday to Wednesday). Table IVshows the results. Once again parameters were ordered byit explanatory weight. The sub stream included only 378PM’s but with a much higher average number of 2,588 activeactors for each decision.

When comparing tables III and IV, the estimators seemto be quite similar but the order (the explanatory weight)is different. Single Arc still explains the majority, butnow the community structure communication Dyad andCommunication Triad are second and third. However,Single Arc and communication dyad already explainnearly all the maximum likelihood improvement. Togetherthey reach a log likelihood of −1, 196 (see set 3), usingall parameters is is only marginally better (−1, 192 in set 1).

The sample stream of phase IV was chosen from midof March 2008, as in this month the number of messagesincreases significantly for the first time. Table V shows the


4(a): Single message tie 0.72 0.15 0.72 0.20 0.73 0.174(g): part of a comm. dyad 0.57 0.09 0.57 0.14 0.66 0.164(h): part of a comm. triad 0.43 0.03 0.39 0.054(f): from asker to responder 0.31 0.05 0.28 0.064(e): from responder to asker 0.37 0.06 0.32 0.034(d): between responders -0.10 0.044(c): to question opener 0.00 0.024(b): to question responder 0.00 0.00

logL -1,810 -1,812 -1,867

Table VBEST FITTING ESTIMATORS FOR PARAMETERS β IN THREE DIFFERENT

TEST SETS OF A SUB SEQUENCE OF PHASE IV.

results for this subsequence from 10th March 2008 0 am to24 pm with 545 PM events and an average of 2,853 activeactors. The number of active actors is quite high because thegeneral activity in the Q&A community increased. Althoughthe analysed time span lasts only one day, there is a highPM activity in these 24 hours compared to the two othertime spans. Assuming a uniform distribution on the set ofactive actors (β = 0), the log likelihood was −3, 171.

Although the explanatory order is similar to those in tableIV, the parameter values are very different now. First, SingleArc is much lower. Obviously, people now interact with a lotmore actors but in less dense clusters. This is underlined bythe lower values for community structure CommunicationDyad and Communication Triad. The structures includingtwo different affiliation ties changed as well: PM fromasker to responder increased from 0.17 to 0.31 and PMfrom responder to asker from 0.30 to 0.37 compared withphase III. On the other hand, PM between reponders turnednegative (0.19) but now it explains less. However, otherthan assumed in the first place, when looking at figure 1,it is probably not an increase of community structures thatdrives the higher PM activity, but an increasing number ofPM’s about questions (affiliation structures). In section Vthis change of parameters will be analysed in detail.

V. EVOLUTION

The estimators shown in section IV are based on smallsubsets of a very big sample with more than 5 million events.As can be followed from the previous results, structuralparameters are not stable over time. Event streams include alot more information than panel data and they allow to testfor structural effects in arbitrary time spans. Still, there aretwo problems. First, comparing the effects of different timespans may help to find structural changes, but what if thetime spans with different structural characteristics are notknown a priori? And second, it cannot be answered how thebehaviour really changed: Can sudden structural breaks beobserved or are slower evolutionary processes the reason?

Therefore, it is proposed to use a sliding window approachto visualize structural changes in decision behaviour. Thedecisions about message receivers of a small sub set of thedata streams are analysed, for example within one week.Then, this one week “window” is moved step by step,e.g. by one day. For each window, the estimation resultscan be compared. The general idea is shown in figure 5.Each of the rectangles stands for an observed event of typemessage. The grey surface is the sliding window, whichis moved along the time line in steps with a fixed time span.

Figure 5. A sliding window is moved over the event stream. For each subsample, parameters can be estimated.

Still, this approach has some critical problems that haveto be addressed.First, the width of the sliding window is important: Awindow which is too big may decrease the expressivenessof the results, while a window with a too small widthmight return chaotically changing non-significant estimators.Also, a wrong window size may cause a revelation of non-interesting dynamic, periodic effects. Like a window sizeof one day might just show the different behaviour ofactors on week days and on the weekend but blur longerperiodic evolution. Other dynamic periodic effects, like inthis case the average active time of questions or the averageanswer time for private messages have to be concerned whendefining the size of a window. These problems also hold forthe sub stream samples in section IV.

Secondly, for computational reasons it is not yet possibleto provide standard errors for sliding window estimates, asthey are estimated using bootstrapping. A combination ofdifferent window sizes plus standard errors might help toprovide an early indicator for evolutionary changes in theobserved data stream.

Finally, the whole process needs some time forstabilisation as at the beginning of the simulation, emptygraphs are assumed. Also, sudden changes like structualbreaks will not change the networks directly. Once again, astabilisation phase will probably be oberserved. Dependingon the parameters of deterministic time dependent changes(see section III-A) this may need some time. Before astabilisation of the process it is difficult to draw conclusionsabout evolutionary effects.

However, as the beginning of phase IV seems especiallyinteresting (decreasing community effects and increasingaffiliation effects were observed), a sliding window with

a width of one day is moved from 7th March 2008 inone day steps (no overlapping in this case). Parametersfor the week before (beginning on 29th February) wereestimated at once. Although all eight effects were testedtogether, only the change of some effects are plotted in thefollowing. 7th March was chosen as it is the first day in thewhole dataset on which the number of sent PM’s stronglyincreases. Therefore this day will be marked as beginningof phase IV in the following figures. Figure 6 shows this:Before, this number was always below 100 per day, from7th March on it is always above 400.

Figure 6. Number of private messages per day from February 29 to March14, 2008.

But what happened in these day? Figure 7 shows aplot of community parameters Communication Dyad andCommunication Triad and Single Arc. With the beginning ofphase IV the parameters decrease. It seems as if most of thenew communication cannot be explained with communityeffects. People rather tend to communicate with a lot ofothers now and they communicate more outside of denseclusters.

Beside the change of these structures nearly all others staysimilar (or change chaotically) after the beginning of phaseIV. This can be explained with a necessary stabilisation.Only the tendency for private messages to question openers(see figure 4(c)) seems to increase and even slowly seemsto get positive as figure 8 indicates. Still, on the first threedays its value seems to change chaotically.

Using this exploratory approach, it is difficult to say whatactually caused the sudden increase in PM communication.Probably it was just a change in functionality. Maybe thepossibilty to send private messages to a question openerwas just made easier by including a direct link to thisfunction on the question site.

Figure 7. Change of parameters Single Arc, Communication Dyad andCommunication Triad at the beginning of phase IV.

Figure 8. Increase of the tendency for messages to question openers atthe beginning of phase IV.

If more of the mentioned problems with sliding windowscan be answered in the future, there are some possiblebusiness applications. For example, this approach couldpossibly be used when “monitoring” social communities in abusiness context. Especially when combined with additionalstatistical descriptions, like a standard error, these plots maygive early hints on structural changes in actor behaviour.They might, for example, help to evaluate activities that aimat strengthening group formation.

VI. CONCLUSION AND FURTHER RESEARCH

In this paper the structural dynamics of private messagecommunication in the Q&A community Lycos IQ wereanalysed. First, four general phases of the communitycould be identified and it was tested, whether the decisionsof actors about private message receivers have differentstructural characteristics in these phases. An event basedexponential random graph model was introduced andestimated using both community and question affiliationstatistics. The last phase, which is characterized by anincreasing number of private messages, turned out to bedifferent from the phases before, as private communicationsuddenly depended more on how actors were connected toquestions. A sliding window approach was used to visualizehow the importance of community structures suddenlydecreased at the beginning of this phase.

The model seems to be suitable to analyse event basedERGM dynamics in combined communication and affiliationnetworks. However, there are some open question that willbe adressed in future research. First, the Goodness of Fitof the current model cannot be evaluated. Although allparameters were tested for the relative explanatory value,there is not yet a measure, which gives the optimal numberof parameters. Second, the sliding window approach ofsection V will need to be precised. As already discussed,there are open questions concerning the best window sizeor the inclusion of further statistical measures. Finally, thecurrent model only describes a small part of the overalldynamics in the Q&A community. It could be extended tomeasure, for example, co-evolutionary dynamics of privatemessages, questions and answers in all directions.

ACKNOWLEDGEMENT

The first author acknowledges support from the DeutscheForschungsgemeinschaft (DFG), Graduate School IME atKarlsruhe Institute of Technology.The research leading to these results has received fundingfrom the European Community’s Seventh Framework Pro-gramme FP7/2007-2013 under grant agreement n215453 -WeKnowIt.The authors thank Michael Ovelgonne and Otto All-mendinger for supporting the data preprocessing.

REFERENCES

[1] S. Wasserman and P. Pattison, “Logit models and logisticregressions for social networks: I. an introduction to markovgraphs and p*,” Psychometrika, vol. 61, no. 3, pp. 401–425,1996.

[2] T. A. Snijders, P. E. Pattison, G. L. Robins, and M. S.Handcock, “New specifications for exponential random graphmodels,” Sociological Methodology, pp. 99–153, 2006.

[3] G. Robins, P. Pattison, Y. Kalish, and D. Lusher, “An intro-duction to exponential random graph (p*) models for socialnetworks,” Social Networks, vol. 29, no. 2, pp. 173–191,2007.

[4] C. T. Butts, “A relational event framework for social action,”Sociological Methodology, vol. 38, no. 1, pp. 155–200, 2008.

[5] U. Brandes, J. Lerner, and T. A. B. Snijders, “Networksevolving step by step: Statistical analysis of dyadic eventdata,” in Proceedings of the 2009 International Conferenceon Advances in Social Network Analysis and Mining, 2009,to appear.

[6] C. Stadtfeld, “A framework for the analysis of social networkevent streams,” forthcoming, 2009.

[7] P. Wang, K. Sharpe, G. L. Robins, and P. E. Pattison, “Ex-poential random graph (p*) models for affiliation networks,”Social Networks, vol. 31, pp. 12–25, 2009.

[8] L. Zenk and C. Stadtfeld, “Dynamic organizations. how tomeasure evolution and change in organizations by analyzingemail communication,” in Proceedings of Applications ofSocial Network Analysis (ASNA), to appear, 2009.

[9] A. C. Davison, Statistical Models, ser. Cambridge Seriesin Statistical and Probabilistic Mathematics. CambridgeUniversity Press, 2003.

[10] P. Deuflhard, Newton Methods for Nonlinear Problems, ser.Springer Series in Computational Mathematics. Springer-Verlag Berlin Heidelberg, 2004, no. 35.

[11] B. Efron and R. Tibshirani, “Bootstrap methods for standarderrors, confidence intervals, and other measures of statisticalaccuracy,” Statistical Science, vol. 1, no. 1, pp. 54–77, 1986.

[12] O. Frank and D. Strauss, “Markov graphs,” American Statis-tical Association, vol. 81, no. 395, 9 1986.

[13] W. Greiner, L. Neise, and H. Stcker, “Theoretische physik,”in Thermodynamik und Statistische Mechanik, 2nd ed.,W. Greiner, Ed. Verlag Harri Deutsch, 1993, vol. 9.

[14] R. H. Myers, D. C. Montgomery, and G. G. Vining, Gener-alized Linear Models, ser. Wiley Series In Probability AndStatistics. John Wiley & Sons, Inc., New York, 2002.

[15] G. Robins, S. T., W. P., M. Handcock, and P. & Pattison,“Recent developments in exponential random graph (p*)models for social networks,” Social Networks, vol. 29, pp.192–215, 2007.

[16] T. A. Snijders, “Statistical methods for network dynamics,” inProceedings of the XLIII Scientific Meeting, Italian StatisticalSociety, S. L. et al., Ed., 2006, pp. 281–296.

[17] ——, “Models and methods in social network analysis,” inModels for Longitudinal Network Data, J. S. P. Carringtonand S. Wasserman, Eds. New York: Cambridge UniversityPress, 2005.

[18] T. A. B. Snijders, “The statistical evaluation of social networkdynamics,” Sociological Methodology, vol. 31, pp. 361–395,2001.

[19] S. Wasserman and K. Faust, Social Network Analysis, 1st ed.,ser. Structural Analysis in the Social Sciences. Cambridge:Cambridge University Press, 1994, vol. 8.

[20] G. A. Young and R. L. Smith, Essentials of Statistical In-ference, ser. Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, 2005.

Analysing Event Stream Dynamics in Two Mode · PDF fileAnalysing Event Stream Dynamics in Two...

Documents

Transcript of Analysing Event Stream Dynamics in Two Mode · PDF fileAnalysing Event Stream Dynamics in Two...