Scalable Learning of Collective Behavior Guru

download Scalable Learning of Collective Behavior Guru

of 12

description

scalable learning

Transcript of Scalable Learning of Collective Behavior Guru

  • Scalable Learning of Collective BehaviorLei Tang, Member, IEEE, Xufei Wang, Student Member, IEEE, and Huan Liu, Fellow, IEEE

    AbstractThis study of collective behavior is to understand how individuals behave in a social networking environment. Oceans of

    data generated by social media like Facebook, Twitter, Flickr, and YouTube present opportunities and challenges to study collective

    behavior on a large scale. In this work, we aim to learn to predict collective behavior in social media. In particular, given information

    about some individuals, how can we infer the behavior of unobserved individuals in the same network? A social-dimension-based

    approach has been shown effective in addressing the heterogeneity of connections presented in social media. However, the networks

    in social media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails scalable

    learning of models for collective behavior prediction. To address the scalability issue, we propose an edge-centric clustering scheme to

    extract sparse social dimensions. With sparse social dimensions, the proposed approach can efficiently handle networks of millions of

    actors while demonstrating a comparable prediction performance to other nonscalable methods.

    Index TermsClassification with network data, collective behavior, community detection, social dimensions.

    1 INTRODUCTION

    THE advancement in computing and communicationtechnologies enables people to get together and shareinformation in innovative ways. Social networking sites (arecent phenomenon) empower people of different ages andbackgrounds with new forms of collaboration, communica-tion, and collective intelligence. Prodigious numbers ofonline volunteers collaboratively write encyclopedia articlesof unprecedented scope and scale; online marketplacesrecommend products by investigating user shoppingbehavior and interactions; and political movements alsoexploit new forms of engagement and collective action. Inthe same process, social media provides ample opportu-nities to study human interactions and collective behavioron an unprecedented scale. In this work, we study hownetworks in social media can help predict some humanbehaviors and individual preferences. In particular, giventhe behavior of some individuals in a network, how can weinfer the behavior of other individuals in the same socialnetwork [1]? This study can help better understandbehavioral patterns of users in social media for applicationslike social advertising and recommendation.

    Typically, the connections in social media networks arenot homogeneous. Different connections are associated withdistinctive relations. For example, one user might maintainconnections simultaneously to his friends, family, collegeclassmates, and colleagues. This relationship information,however, is not always fully available in reality. Mostly, wehave access to the connectivity information between users,but we have no idea why they are connected to each other.This heterogeneity of connections limits the effectiveness of

    a commonly used techniquecollective inference for networkclassification. A recent framework based on social dimensions[2] is shown to be effective in addressing this heterogeneity.The framework suggests a novel way of network classifica-tion: first, capture the latent affiliations of actors byextracting social dimensions based on network connectivity,and next, apply extant data mining techniques to classifica-tion based on the extracted dimensions. In the initial study,modularity maximization [3] was employed to extract socialdimensions. The superiority of this framework over otherrepresentative relational learning methods has been verifiedwith social media data in [2].

    The original framework, however, is not scalable to handlenetworks of colossal sizes because the extracted socialdimensions are rather dense. In social media, a network ofmillions of actors is very common. With a huge number ofactors, extracteddense social dimensions cannot evenbeheldin memory, causing a serious computational problem.Sparsifying social dimensions can be effective in eliminatingthe scalability bottleneck. In this work, we propose aneffective edge-centric approach to extract sparse social dimen-sions [4]. We prove that with our proposed approach,sparsity of social dimensions is guaranteed. Extensiveexperiments are then conducted with social media data.The framework based on sparse social dimensions, withoutsacrificing the prediction performance, is capable of effi-ciently handling real-world networks of millions of actors.

    2 COLLECTIVE BEHAVIOR

    Collective behavior refers to the behaviors of individuals ina social networking environment, but it is not simply theaggregation of individual behaviors. In a connectedenvironment, individuals behaviors tend to be interde-pendent, influenced by the behavior of friends. Thisnaturally leads to behavior correlation between connectedusers [5]. Take marketing as an example: if our friends buysomething, there is a better than average chance that wewill buy it, too.

    This behavior correlation can also be explained byhomophily [6]. Homophily is a term coined in the 1950s to

    1080 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    . L. Tang is with the Yahoo! Labs, 4301 Great America Pkwy, Santa Clara,CA 95054. E-mail: [email protected].

    . X. Wang and H. Liu are with the Computer Science and Engineering,Arizona State University, PO Box 878809, Tempe, AZ 85287-8809.E-mail: {Xufei.Wang, Huan.Liu}@asu.edu.

    Manuscript received 31 May 2010; revised 18 Nov. 2010; accepted 9 Jan.2011; published online 7 Feb. 2011.Recommended for acceptance by Y. Chen.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-05-0316.Digital Object Identifier no. 10.1109/TKDE.2011.38.

    1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

  • explain our tendency to link with one another in ways thatconfirm, rather than test, our core beliefs. Essentially, we aremore likely to connect to others who share certainsimilarities with us. This phenomenon has been observednot only in the many processes of a physical world, but alsoin online systems [7], [8]. Homophily results in behaviorcorrelations between connected friends. In other words,friends in a social network tend to behave similarly.

    The recent boom of social media enables us to studycollective behavior on a large scale. Here, behaviors includea broad range of actions: joining a group, connecting to aperson, clicking on an ad, becoming interested in certaintopics, dating people of a certain type, etc. In this work, weattempt to leverage the behavior correlation presented in asocial network in order to predict collective behavior insocial media. Given a network with the behavioral informa-tion of some actors, how can we infer the behavioraloutcome of the remaining actors within the same network?Here, we assume the studied behavior of one actor can bedescribed with K class labels fc1; . . . ; cKg. Each label, ci, canbe 0 or 1. For instance, one user might join multiple groupsof interest, so ci 1 denotes that the user subscribes togroup i, and ci 0 otherwise. Likewise, a user can beinterested in several topics simultaneously, or click onmultiple types of ads. One special case is K 1, indicatingthat the studied behavior can be described by a single labelwith 1 and 0. For example, if the event is the presidentialelection, 1 or 0 indicates whether or not a voter voted forBarack Obama. The problem we study can be describedformally as follows.

    Suppose there are K class labels Y fc1; . . . ; cKg. Givennetwork G V ;E; Y where V is the vertex set, E is theedge set, and Yi Y are the class labels of a vertex vi 2 V ,and known values of Yi for some subsets of vertices V

    L, howcan we infer the values of Yi (or an estimated probabilityover each label) for the remaining vertices V U V V L?It should be noted that this problem shares the same

    spirit as within-network classification [9]. It can also beconsidered as a special case of semi-supervised learning[10] or relational learning [11] where objects are connectedwithin a network. Some of these methods, if applieddirectly to social media, yield only limited success [2]. Thisis because connections in social media are rather noisy andheterogeneous. In the next section, we will discuss theconnection heterogeneity in social media, review theconcept of social dimension, and anatomize the scalabilitylimitations of the earlier model proposed in [2], whichprovides a compelling motivation for this work.

    3 SOCIAL DIMENSIONS

    Connections in social media are not homogeneous. Peoplecan connect to their family, colleagues, college classmates, orbuddies met online. Some relations are helpful in determin-ing a targeted behavior (category) while others are not. Thisrelation-type information, however, is often not readilyavailable in social media. A direct application of collectiveinference [9] or label propagation [12] would treat connec-tions in a social network as if they were homogeneous. Toaddress the heterogeneity present in connections, a frame-work (SocioDim) [2] has been proposed for collectivebehavior learning.

    The framework SocioDim is composed of two steps:1) social dimension extraction, and 2) discriminativelearning. In the first step, latent social dimensions areextracted based on network topology to capture thepotential affiliations of actors. These extracted socialdimensions represent how each actor is involved in diverseaffiliations. One example of the social dimension represen-tation is shown in Table 1. The entries in this table denotethe degree of one user involving in an affiliation. Thesesocial dimensions can be treated as features of actors forsubsequent discriminative learning. Since a network isconverted into features, typical classifiers such as supportvector machine and logistic regression can be employed.The discriminative learning procedure will determinewhich social dimension correlates with the targetedbehavior and then assign proper weights.

    A key observation is that actors of the same affiliationtend to connect with each other. For instance, it is reason-able to expect people of the same department to interactwith each other more frequently. Hence, to infer actorslatent affiliations, we need to find out a group of peoplewho interact with each other more frequently than atrandom. This boils down to a classic community detectionproblem. Since each actor can get involved in more than oneaffiliation, a soft clustering scheme is preferred.

    In the initial instantiation of the framework SocioDim, aspectral variant of modularity maximization [3] is adoptedto extract social dimensions. The social dimensions corre-spond to the top eigenvectors of a modularity matrix. It hasbeen empirically shown that this framework outperformsother representative relational learning methods on socialmedia data. However, there are several concerns about thescalability of SocioDim with modularity maximization:

    . Social dimensions extracted according to softclustering, such as modularity maximization andprobabilistic methods, are dense. Suppose there are1 million actors in a network and 1,000 dimensionsare extracted. If standard double precision numbersare used, holding the full matrix alone requires1 M 1 K 8 8 G memory. This large-size densematrix poses thorny challenges for the extraction ofsocial dimensions as well as subsequent discrimi-native learning.

    . Modularity maximization requires us to compute thetop eigenvectors of a modularity matrix, which is thesame size as a given network. In order to extractk communities, typically k 1 eigenvectors arecomputed. For a sparse or structured matrix, theeigenvector computation costs Ohmk nk2 k3time [13], where h, m, and n are the number ofiterations, the number of edges in the network, andthe number of nodes, respectively. Though comput-ing the top single eigenvector (i.e., k 1), such as

    TANG ET AL.: SCALABLE LEARNING OF COLLECTIVE BEHAVIOR 1081

    TABLE 1Social Dimension Representation

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

  • PageRank scores, can be done very efficiently,computing thousands of eigenvectors or even morefor a mega-scale network becomes a daunting task.

    . Networks in social media tend to evolve, with newmembers joining and new connections occurringbetween existing members each day. This dynamicnature of networks entails an efficient update of themodel for collective behavior prediction. Efficientonline updates of eigenvectors with expandingmatrices remain a challenge.

    Consequently, it is imperative to develop scalable meth-ods that can handle large-scale networks efficiently withoutextensive memory requirements. Next, we elucidate on anedge-centric clustering scheme to extract sparse social dimen-sions. With such a scheme, we can also update the socialdimensions efficiently when new nodes or new edges arrive.

    4 SPARSE SOCIAL DIMENSIONS

    In this section, we first show one toy example to illustrate theintuition of communities in an edge view and then presentpotential solutions to extract sparse social dimensions.

    4.1 Communities in an Edge-Centric View

    Though SocioDim with soft clustering for social dimensionextraction demonstrated promising results, its scalability islimited. A network may be sparse (i.e., the density ofconnectivity is very low), whereas the extracted socialdimensions are not sparse. Lets look at the toy networkwith two communities in Fig. 1. Its social dimensionsfollowing modularity maximization are shown in Table 2.Clearly, none of the entries is zero. When a network expandsinto millions of actors, a reasonably large number of socialdimensions need to be extracted. The correspondingmemory requirement hinders both the extraction of socialdimensions and the subsequent discriminative learning.Hence, it is imperative to develop some other approach sothat the extracted social dimensions are sparse.

    It seems reasonable to state that the number of affilia-tions one user can participate in is upper bounded by hisconnections. Consider one extreme case that an actor has

    only one connection. It is expected that he is probably activein only one affiliation. It is not necessary to assign a nonzeroscore for each of the many other affiliations. Assuming eachconnection represents one involved affiliation, we canexpect the number of affiliations an actor has is no morethan that of his connections. Rather than defining acommunity as a set of nodes, we redefine it as a set ofedges. Thus, communities can be identified by partitioningedges of a network into disjoint sets.

    An actor is considered associated with one affiliation if one ofhis connections is assigned to that affiliation. For instance,the two communities in Fig. 1 can be represented by twoedge sets in Fig. 2, where the dashed edges represent oneaffiliation, and the remaining edges denote the secondaffiliation. The disjoint edge clusters in Fig. 2 can beconverted into the representation of social dimensions asshown in the last two columns in Table 2, where an entry is1 (0) if an actor is (not) involved in that corresponding socialdimension. Node 1 is affiliated with both communitiesbecause it has edges in both sets. By contrast, node 3 isassigned to only one community, as all its connections arein the dashed edge set.

    To extract sparse social dimensions, we partition edgesrather than nodes into disjoint sets. The edges of thoseactors with multiple affiliations (e.g., actor 1 in the toynetwork) are separated into different clusters. Though thepartition in the edge view is disjoint, the affiliations in thenode-centric view can overlap. Each node can engage inmultiple affiliations.

    In addition, the extracted social dimensions followingedge partition are guaranteed to be sparse. This is because thenumber of ones affiliations is no more than that of herconnections. Given a network with m edges and n nodes, ifk social dimensions are extracted, then each node vi has nomore than mindi; k nonzero entries in her social dimen-sions, where di is the degree of node vi. We have thefollowing theorem about the density of extracted socialdimensions.

    Theorem 1. Suppose k social dimensions are extracted from anetwork with m edges and n nodes. The density (proportion ofnonzero entries) of the social dimensions based on edgepartition is bounded by the following:

    density Pn

    i1 mindi; knk

    P

    fijdikg di P

    fijdi>kg k

    nk:

    1

    Moreover, for many real-world networks whose node degreefollows a power law distribution, the upper bound in (1) can beapproximated as follows:

    1 2

    1

    k 1

    2 1

    k1; 2

    where > 2 is the exponent of the power law distribution.

    1082 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    Fig. 1. A Toy example.

    TABLE 2Social Dimension(s) of the Toy Example

    Fig. 2. Edge clusters.

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

  • The proof is given in the appendix, which can be found onthe Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.38. To give aconcrete example, we examine a YouTube network1 with 1million actors and verify the upper bound of the density. TheYouTube network has 1,128,499 nodes and 2,990,443 edges.Suppose we want to extract 1,000 dimensions from thenetwork. Since 232 nodes in the network have a degree largerthan 1,000, the density is upper bounded by 5;472; 909232 1;000=1;128;499 1;000 0:51% following (1). Thenode distribution in the network follows a power law withexponent 2:14 based on maximum likelihood estimation[14]. Thus, the approximate upper bound in (2) for thisspecific case is 0.54 percent.

    Note that the upper bound in (1) is network specific,whereas (2) gives an approximate upper bound for a familyof networks. It is observed that most power law distributionsoccurring in nature have 2 3 [14]. Hence, the bound in(2) is validmost of the time. Fig. 3 shows the function in termsof and k. Note that when k is huge (close to 10,000), thesocial dimensions become extremely sparse (

  • line graph. Hence, the line graph is not suitable for practicaluse. Next, we will present an alternative approach to edgepartition, which is much more efficient and scalable.

    4.3 Edge Partition via Clustering Edge Instances

    In order to partition edges into disjoint sets, we treat edgesas data instances with their terminal nodes as features. Forinstance, we can treat each edge in the toy network in Fig. 1as one instance, and the nodes that define edges as features.This results in a typical feature-based data format as inTable 3. Then, a typical clustering algorithm like k-meansclustering can be applied to find disjoint partitions.

    One concern with this scheme is that the total number ofedges might be too huge. Owing to the power lawdistribution of node degrees presented in social networks,the total number of edges is normally linear, rather thansquare, with respect to the number of nodes in the network.That is, m On as stated in the following theorem.Theorem 4. The total number of edges is usually linear, rather

    than quadratic, with respect to the number of nodes in thenetwork with a power law distribution. In particular, theexpected number of edges is given as

    Em n2

    1 2 ; 5

    where is the exponent of the power law distribution. Pleasesee the appendix, available in the online supplemental material,for the proof.

    Still, millions of edges are the norm in a large-scalenetwork. Direct application of some existing k-meansimplementation cannot handle the problem. For example,the k-means code provided in the Matlab package requiresthe computation of the similarity matrix between all pairs ofdata instances, which would exhaust the memory of normalPCs in seconds. Therefore, an implementation with onlinecomputation is preferred.

    On the other hand, the data of edge instances are quitesparse and structured. As each edge connects two nodes,the corresponding edge instance has exactly only twononzero features as shown in Table 3. This sparsity can helpaccelerate the clustering process if exploited wisely. Weconjecture that the centroids of k-means should also befeature sparse. Often, only a small portion of the datainstances share features with the centroid. Hence, we justneed to compute the similarity of the centroids with theirrelevant instances. In order to efficiently identify instancesrelevant to one centroid, we build a mapping from features(nodes) to instances (edges) beforehand. Once we have themapping, we can easily identify the relevant instances bychecking the nonzero features of the centroid.

    By taking into account the two concerns above, wedevise a k-means variant as shown in Fig. 5. Similar to k-means, this algorithm also maximizes within clustersimilarity as shown in

    argmaxS

    Xki1

    Xxj2Si

    xj ikxjkkik; 6

    where k is the number of clusters, S fS1; S2; . . . ; Skg is theset of clusters, and i is the centroid of cluster Si. In Fig. 5,we keep only a vector of MaxSim to represent themaximum similarity between one data instance and acentroid. In each iteration, we first identify the instancesrelevant to a centroid, and then compute similarities ofthese instances with the centroid. This avoids the iterationover each instance and each centroid, which will cost Omkotherwise. Note that the centroid contains one feature(node), if and only if any edge of that node is assigned to thecluster. In effect, most data instances (edges) are associatedwith few (much less than k) centroids. By taking advantageof the feature-instance mapping, the cluster assignment forall instances (lines 5-11 in Fig. 5) can be fulfilled inOm time. Computing the new centroid (lines 12-13) costsOm time as well. Hence, each iteration costs Om timeonly. Moreover, the algorithm requires only the feature-instance mapping and network data to reside in mainmemory, which costs Om n space. Thus, as long as thenetwork data can be held in memory, this clusteringalgorithm is able to partition its edges into disjoint sets.

    As a simple k-means is adopted to extract socialdimensions, it is easy to update social dimensions if agiven network changes. If a new member joins the networkand a new connection emerges, we can simply assign thenew edge to the corresponding clusters. The update ofcentroids with the new arrival of connections is alsostraightforward. This k-means scheme is especially applic-able for dynamic large-scale networks.

    4.4 Regularization on Communities

    The extracted social dimensions are treated as features ofnodes. Conventional supervised learning can be conducted.In order to handle large-scale data with high dimensionalityand vast numbers of instances, we adopt a linear SVM,

    1084 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    TABLE 3Edge Instances of the Toy Network in Fig. 1

    Fig. 5. Algorithm of scalable k-means variant.

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

  • which can be finished in linear time [18]. Generally, thelarger a community is, the weaker the connections withinthe community are. Hence, we would like to build an SVMrelying more on communities of smaller sizes by modifyingthe typical SVM objective function as follows:

    min Xni1

    j1 yixTi w b

    j 12wTw; 7where is a regularization parameter,2 jzj max0; z thehinge loss function, and a diagonal matrix to regularize

    the weights assigned to different communities. In particu-

    lar, jj hjCjj is the penalty coefficient associated withcommunity Cj. The penalty function h should be mono-

    tonically increasing with respect to the community size. In

    other words, a larger weight assigned to a large community

    results in a higher cost.Interestingly, with a little manipulation, the formula in

    (7) can be solved using a standard SVM with modified

    input. Let X represent the input data with each row being

    an instance (e.g., Table 2), and ~w 1=2w. Then, theformula can be rewritten as

    min Xi

    j1 yixTi w b

    j 12wT12

    12w

    min Xi

    j1 yixTi

    1212w bj 12 k~wk22

    min Xi

    j1 yi~xTi ~w b

    j 12 k~wk22;where ~xTi xTi

    12. Hence, given an input data matrix X,

    we only need to left multiply X by 12. It is observed that

    community sizes of a network tend to follow a power law

    distribution. Hence, we recommend hjCjj log jCjj orlog jCjj2.

    Meanwhile, one node is likely to engage in multiplecommunities. Intuitively, a node which is devoted to a fewselect communities has more influence on other groupmembers than those who participate in many communities.For effective classification learning, we regularize nodeaffiliations by normalizing each nodes social dimensions tosum to one. Later, we will study which regularizationapproach has a greater effect on classification.

    In summary, we conduct the following steps to learn amodel for collective behavior. Given a network (say, Fig. 1),we take the edge-centric view of the network data (Table 3)and partition the edges into disjoint sets (Fig. 2). Based onthe edge clustering, social dimensions can be constructed(Table 2). Certain regularizations can be applied. Then,discriminative learning and prediction can be accomplishedby considering these social dimensions as features. Thedetailed algorithm is summarized in Fig. 6.

    5 EXPERIMENT SETUP

    In this section, we present the data collected from socialmedia for evaluation and the baseline methods forcomparison.

    5.1 Social Media Data

    Two data sets reported in [2] are used to examine ourproposed model for collective behavior learning. The firstdata set is acquired from BlogCatalog,3 the second from apopular photo sharing site Flickr.4 Concerning behavior,following [2], we study whether or not a user joins a groupof interest. Since the BlogCatalog data do not have thisgroup information, we use blogger interests as the behaviorlabels. Both data sets are publicly available at the firstauthors homepage. To examine scalability, we also includea mega-scale network5 crawled from YouTube.6 We removethose nodes without connections and select the interestgroups with 500 subscribers. Some statistics of the threedata sets can be found in Table 4.

    5.2 Baseline Methods

    As we discussed in Section 4.2, constructing a line graph isprohibitive for large-scale networks. Hence, the line-graphapproach is not included for comparison. Alternatively, theedge-centric clustering (or EdgeCluster) in Section 4.2 isused to extract social dimensions on all data sets. We adoptcosine similarity while performing the clustering. Based oncross validation, the dimensionality is set to 5,000, 10,000,and 1,000 for BlogCatalog, Flickr, and YouTube, respec-tively. A linear SVM classifier [18] is then exploited fordiscriminative learning.

    Another related approach to finding edge partitions is bi-connected components [19]. Bi-connected components of agraph are the maximal subsets of vertices such that theremoval of a vertex from a particular component will notdisconnect the component. Essentially, any two nodes in abi-connected component are connected by at least two

    TANG ET AL.: SCALABLE LEARNING OF COLLECTIVE BEHAVIOR 1085

    2. We use a different notation from standard SVM formulation to avoidthe confusion with community C.

    Fig. 6. Algorithm for learning of collective behavior.

    3. http://www.blogcatalog.com/.4. http://www.flickr.com/.5. http://socialnetworks.mpi-sws.org/data-imc2007.html.6. http://www.youtube.com/.

    TABLE 4Statistics of Social Media Data

    gurusamyHighlight

    gurusamyHighlight

    gurusamyHighlight

  • paths. It is highly related to cut vertices (a.k.a. articulationpoints) in a graph, whose removal will result in an increasein the number of connected components. Those cut verticesare the bridges connecting different bi-connected compo-nents. Thus, searching for bi-connected components boilsdown to searching for articulation points in the graph,which can be solved efficiently in Onm time. Here nand m represent the number of vertices and edges in agraph, respectively. Each bi-connected component is con-sidered a community, and converted into one socialdimension for learning.

    We also compare our proposed sparse social dimensionapproach with classification performance based on denserepresentations. In particular, we extract social dimensionsaccording to modularity maximization (denoted asModMax) [2]. ModMax has been shown to outperformother representative relational learning methods based oncollective inference. We study how the sparsity in socialdimensions affects the prediction performance as well asthe scalability.

    Note that social dimensions allow one actor to be involvedin multiple affiliations. As a proof of concept, we alsoexamine the case when each actor is associatedwith only oneaffiliation. Essentially, we construct social dimensions basedon node partition. A similar idea has been adopted in a latentgroup model [20] for efficient inference. To be fair, we adoptk-means clustering to partition nodes of a network into

    disjoint sets, and convert the node clustering result into a setof social dimensions. Then, SVM is utilized for discrimina-tive learning. For convenience, we denote this method asNodeCluster.

    Note that our prediction problem is essentially multi-label. It is empirically shown that thresholding can affectthe final prediction performance drastically [21], [22]. Forevaluation purposes, we assume the number of labels ofunobserved nodes is already known, and check whether thetop-ranking predicted labels match with the actual labels.Such a scheme has been adopted for other multilabelevaluation works [23]. We randomly sample a portion ofnodes as labeled and report the average performance of10 runs in terms of Micro-F1 and Macro-F1 [22].

    6 EXPERIMENT RESULTS

    In this section, we first examine how prediction perfor-mances vary with social dimensions extracted followingdifferent approaches. Then, we verify the sparsity of socialdimensions and its implication for scalability. We also studyhow the performance varies with dimensionality. Finally,concrete examples of extracted social dimensions are given.

    6.1 Prediction Performance

    The prediction performance on all data is shown in Tables 5,6, and 7. The entries in bold face denote the best performance

    1086 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    TABLE 5Performance on BlogCatalog Network

    TABLE 6Performance on Flickr Network

    TABLE 7Performance on YouTube Network

  • in each column. Obviously, EdgeCluster is the winner mostof the time. Edge-centric clustering shows comparableperformance to modularity maximization on BlogCatalognetwork, yet it outperformsModMax on Flickr.ModMax onYouTube is not applicable due to the scalability constraint.Clearly, with sparse social dimensions,we are able to achievecomparable performance as that of dense social dimensions.But the benefit in terms of scalability will be tremendous asdiscussed in the next section.

    The NodeCluster scheme forces each actor to be involvedin only one affiliation, yielding inferior performancecompared with EdgeCluster. BiComponents, similar toEdgeCluster, also separates edges into disjoint sets, whichin turn deliver a sparse representation of social dimensions.However, BiComponents yields a poor performance. This isbecause BiComponents outputs highly imbalanced commu-nities. For example,BiComponents extracts 271 bi-connectedcomponents in the BlogCatalog network. Among these271 components, a dominant one contains 10,042 nodes,while all others are of size 2. Note that BlogCatalog containsin total 10,312 nodes. As a networks connectivity increases,BiComponents performs even worse. For instance, only10 bi-connected components are found in the Flickr data, andthus itsMacro-F1 is close to 0. In short,BiComponents is very

    efficient and scalable. However, it fails to extract informativesocial dimensions for classification.

    We note that the prediction performance on the studiedsocial media data is around 20-30 percent for F1 measure.This is partly due to the large number of distinctive labels inthe data. Another reason is that only the network informa-tion is exploited here. Since SocioDim converts a networkinto attributes, other behavioral features (if available) can becombined with social dimensions for behavior learning.

    6.2 Scalability Study

    As we have introduced in Theorem 1, the social dimensionsconstructed according to edge-centric clustering are guar-anteed to be sparse because the density is upper boundedby a small value. Here, we examine how sparse the socialdimensions are in practice. We also study how thecomputation time (with a Core2Duo E8400 CPU and 4 GBmemory) varies with the number of edge clusters. Thecomputation time, the memory footprint of social dimen-sions, their density, and other related statistics on all threedata sets are reported in Tables 8, 9, and 10.

    Concerning the time complexity, it is interesting thatcomputing the top eigenvectors of a modularity matrix isactually quite efficient as long as there is no memoryconcern. This is observed on the Flickr data. However,

    TANG ET AL.: SCALABLE LEARNING OF COLLECTIVE BEHAVIOR 1087

    TABLE 8Sparsity Comparison on BlogCatalog Data with 10,312 Nodes

    ModMax-500 corresponds to modularity maximization to select 500 social dimensions and EdgeCluster-x denotes edge-centric clustering toconstruct x dimensions. Time denotes the total time (seconds) to extract the social dimensions; Space represent the memory footprint (mega-byte)of the extracted social dimensions; Density is the proportion of nonzeros entries in the dimensions; Upper bound is the density upper boundcomputed following (1); Max-Aff and Ave-Aff denote the maximum and average number of affiliations one user is involved in.

    TABLE 9Sparsity Comparison on Flickr Data with 80,513 Nodes

    TABLE 10Sparsity Comparison on YouTube Data with 1,138,499 Nodes

  • when the network scales to millions of nodes (YouTube),modularity maximization becomes difficult (though aniterative method or distributed computation can be used)due to its excessive memory requirement. On the contrary,the EdgeCluster method can still work efficiently as shownin Table 10. The computation time of EdgeCluster forYouTube is much smaller than for Flickr, because theYouTube network is extremely sparse. The number of edgesand the average degree in YouTube are smaller than thosein Flickr, as shown in Table 4.

    Another observation is that the computation time ofEdgeCluster does not change much with varying numbersof clusters. Nomatter howmany clusters exist, the computa-tion timeofEdgeCluster is of the sameorder. This isdue to theefficacy of the proposed k-means variant in Fig. 5. In thealgorithm, we do not iterate over each cluster and eachcentroid to do the cluster assignment, but exploit the sparsityof edge-centric data to compute only the similarity of acentroid and those relevant instances. This, in effect, makesthe computational cost independent of the number of edgeclusters.

    As for the memory footprint reduction, sparse socialdimension does an excellent job. On Flickr, with only500 dimensions, ModMax requires 322.1 M, whereasEdgeCluster requires less than 100 M. This effect is strongeron the mega-scale YouTube network, where ModMaxbecomes impractical to compute directly. It is expected thatthe social dimensions of ModMax would occupy 4.6 Gmemory. On the contrary, the sparse social dimensionsbased on EdgeCluster requires only 30-50 M.

    The steep reduction ofmemory footprint can be explainedby the density of the extracted dimensions. For instance, inTable 10, when we have 50,000 dimensions, the density isonly 5:2 105. Consequently, even if the network has morethan 1 million nodes, the extracted social dimensions stilloccupy only a tiny memory space. The upper bound of thedensity is not tight when the number of clusters k is small. Ask increases, the bound becomes tight. In general, the truedensity is roughly half of the estimated bound.

    6.3 Sensitivity Study

    Our proposed EdgeCluster model requires users to specifythe number of social dimensions (edge clusters). Onequestion which remains to be answered is how sensitive theperformance iswith respect to the parameter.We examine allthree data sets, but find no strong pattern to determineoptimal dimensionality. Due to the space limit, we includeonly one case here. Fig. 7 shows the Macro-F1 performancechange on YouTube data. The performance, unfortunately, issensitive to the number of edge clusters. It thus remains achallenge to determine the parameter automatically.

    However, a general trend across all three data sets isobserved. The optimal dimensionality increases as theportion of labeled nodes expands. For instance, when thereis 1 percent of labeled nodes in the network, 500 dimensionsseem optimal. But when the labeled nodes increase to10 percent, 2,000-5,000 dimensions becomes a better choice.In other words, when label information is scarce, coarseextraction of latent affiliations is better for behaviorprediction. But when the label information multiplies, theaffiliations should be zoomed to a more granular level.

    6.4 Regularization Effect

    In this section, we study how EdgeCluster performancevaries with different regularization schemes. Three differentregularizations are implemented: regularization based oncommunity size, node affiliation, and both (we first applycommunity size regularization, then node affiliation). Theresults without any regularization are also included as abaseline for comparison. In our experiments, the communitysize regularization penalty function is set to (log(jCij2, andnode affiliation regularization normalizes the summation ofeach nodes communitymembership to 1. As shown in Fig. 8,regularization on both community size and node affiliationconsistently boosts the performance on Flickr data. We alsoobserve that node affiliation regularization significantlyimproves performance. It seems like community-size reg-ularization is not as important as the node-affiliationregularization. Similar trends are observed in the other twodata sets.Wemust point out that,when both community-sizeregularization and node affiliation are applied, it is indeedquite similar to the tf-idfweighting scheme for representationof documents [24]. Such aweighting schemeactually can leadto a quite different performance in dealing with socialdimensions extracted from a network.

    6.5 Visualization of Extracted Social Dimensions

    As shown in previous sections, EdgeCluster yields anoutstanding performance. But are the extracted socialdimensions really sensible? To understand what theextracted social dimensions are, we investigate tag cloudsassociated with different dimensions. Tag clouds, with fontsize denoting tags relative frequency, are widely used insocial media websites to summarize the most popularongoing topics. In particular, we aggregate tags ofindividuals in one social dimension as the tags of thatdimension. However, it is impossible to showcase all theextracted dimensions. Hence, we pick the dimension withthe maximum SVM weight given a category.

    Due to the space limit, we show only two examples fromBlogCatalog. To make the figure legible, we include onlythose tags whose frequency is greater than 2. The dimension

    1088 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    Fig. 7. Sensitivity to dimensionality. Fig. 8. Regularization effect on Flickr.

  • in Fig. 9 is about cars, autos, automobile, pontiac, ferrari, etc. Itis highly relevant to the category Autos. Similarly, thedimension in Fig. 10 is about baseball, mlb, basketball, andfootball. It is informative for classification of the categorySports. There are a few less frequent tags, such as movies andipod, associated with selected dimensions as well, suggest-ing the diverse interests of each person. EdgeCluster, byanalyzing connections in a network, is able to extract socialdimensions that are meaningful for classification.

    7 RELATED WORK

    Classificationwith networked instances are known aswithin-network classification [9], or a special case of relational learning[11]. The data instances in a network are not independentlyidentically distributed (i.i.d.) as in conventional data mining.To capture the correlation between labels of neighboringdata objects, typically a Markov dependency assumption isassumed. That is, the label of one node depends on the labels(or attributes) of its neighbors. Normally, a relationalclassifier is constructed based on the relational features oflabeled data, and then an iterative process is required todetermine the class labels for the unlabeled data. The classlabel or the class membership is updated for each nodewhilethe labels of its neighbors are fixed. This process is repeateduntil the label inconsistency between neighboring nodes isminimized. It is shown that [9] a simple weighted voterelational neighborhood classifier [25] works reasonablywell on some benchmark relational data and is recom-mended as a baseline for comparison.

    However, a network tends to present heterogeneousrelations, and the Markov assumption can only capture thelocal dependency. Hence, researchers propose to modelnetwork connections or class labels based on latent groups[20], [26]. A similar idea is also adopted in [2] to differentiateheterogeneous relations in a network by extracting socialdimensions to represent the potential affiliations of actors ina network. The authors suggest using the communitymembership of a soft clustering scheme as social dimen-sions. The extracted social dimensions are treated asfeatures, and a support vector machine based on that canbe constructed for classification. It has been shown that theproposed social dimension approach significantly outper-forms representative methods based on collective inference.

    There are various approaches to conduct soft clusteringfor a graph. Some are based on matrix factorization, likespectral clustering [27] and modularity maximization [3].Probabilistic methods are also developed [28], [29]. Pleaserefer to [30] for a comprehensive survey. A disadvantagewith soft clustering is that the resultant social dimensionsare dense, posing thorny computational challenges.

    Another line of research closely related to the methodproposed in this work is finding overlapping communities.

    Palla et al. propose a clique percolation method to discoveroverlapping dense communities [31]. It consists of two steps:first find out all the cliques of size k in a graph. Two k-cliquesare connected if they share k 1 nodes. Based on theconnections between cliques, we can find the connectedcomponents with respect to k-cliques. Each component thencorresponds to one community. Since a node can be involvedin multiple different k-cliques, the resultant communitystructure allows one node to be associated with multipledifferent communities. A similar idea is presented in [32], inwhich the authors propose to find out all themaximal cliquesof a network and then perform hierarchical clustering.

    Gregory [33] extends the Newman-Girvan method [34]to handle overlapping communities. The original Newman-Girvan method recursively removes edges with highestbetweenness until a network is separated into a prespeci-fied number of disconnected components. It outputsnonoverlapping communities only. Therefore, Gregoryproposes to add one more action (node splitting) besidesedge removal. The algorithm recursively splits nodes thatare likely to reside in multiple communities into two, orremoves edges that seem to bridge two different commu-nities. This process is repeated until the network isdisconnected into the desired number of communities.The aforementioned methods enumerate all the possiblecliques or shortest paths within a network, whose computa-tional cost is daunting for large-scale networks.

    Recently, a simple scheme proposed to detect overlappingcommunities is to construct a linegraphand thenapplygraphpartition algorithms [16], [17]. However, the construction ofthe line graph alone, as we discussed, is prohibitive for anetwork of a reasonable size. In order to detect overlappingcommunities, scalable approaches have to be developed.

    In this work, the k-means clustering algorithm is used topartition the edges of a network into disjoint sets. We alsopropose a k-means variant to take advantage of its specialsparsity structure, which can handle the clustering ofmillions of edges efficiently. More complicated datastructures such as kd-tree [35], [36] can be exploited toaccelerate the process. If a network might be too huge toreside in memory, other k-means variants can be consideredto handle extremely large data sets like online k-means [37],scalable k-means [38], and distributed k-means [39].

    8 CONCLUSIONS AND FUTURE WORK

    It is well known that actors in a network demonstratecorrelated behaviors. In this work, we aim to predict theoutcome of collective behavior given a social network andthe behavioral information of some actors. In particular, weexplore scalable learning of collective behavior whenmillions of actors are involved in the network. Our approachfollows a social-dimension-based learning framework. Social

    TANG ET AL.: SCALABLE LEARNING OF COLLECTIVE BEHAVIOR 1089

    Fig. 9. Dimension selected by Autos.Fig. 10. Dimension selected by Sports.

  • dimensions are extracted to represent the potential affilia-tions of actors before discriminative learning occurs. Asexisting approaches to extract social dimensions suffer fromscalability, it is imperative to address the scalability issue.We propose an edge-centric clustering scheme to extractsocial dimensions and a scalable k-means variant to handleedge clustering. Essentially, each edge is treated as one datainstance, and the connected nodes are the correspondingfeatures. Then, the proposed k-means clustering algorithmcan be applied to partition the edges into disjoint sets, witheach set representing one possible affiliation. With this edge-centric view, we show that the extracted social dimensionsare guaranteed to be sparse. This model, based on the sparsesocial dimensions, shows comparable prediction perfor-mance with earlier social dimension approaches. Anincomparable advantage of our model is that it easily scalesto handle networks with millions of actors while the earliermodels fail. This scalable approach offers a viable solution toeffective learning of online collective behavior on a largescale.

    In social media, multiple modes of actors can be involvedin the same network, resulting in a multimode network [40].For instance, in YouTube, users, videos, tags, and commentsare intertwined with each other in coexistence. Extendingthe edge-centric clustering scheme to address this objectheterogeneity can be a promising future direction. Since theproposed EdgeCluster model is sensitive to the number ofsocial dimensions as shown in the experiment, furtherresearch is needed to determine a suitable dimensionalityautomatically. It is also interesting to mine other behavioralfeatures (e.g., user activities and temporal-spatial informa-tion) from social media, and integrate them with socialnetworking information to improve prediction performance.

    ACKNOWLEDGMENTS

    This research is, in part, sponsored by the Air Force Office

    of Scientific Research Grant FA95500810132.

    REFERENCES[1] L. Tang and H. Liu, Toward Predicting Collective Behavior via

    Social Dimension Extraction, IEEE Intelligent Systems, vol. 25,no. 4, pp. 19-25, July/Aug. 2010.

    [2] L. Tang and H. Liu, Relational Learning via Latent SocialDimensions, KDD 09: Proc. 15th ACM SIGKDD Intl Conf.Knowledge Discovery and Data Mining, pp. 817-826, 2009.

    [3] M. Newman, Finding Community Structure in Networks Usingthe Eigenvectors of Matrices, Physical Rev. E (Statistical, Nonlinear,and Soft Matter Physics), vol. 74, no. 3, p. 036104, http://dx.doi.org/10.1103/PhysRevE.74.036104, 2006.

    [4] L. Tang and H. Liu, Scalable Learning of Collective BehaviorBased on Sparse Social Dimensions, CIKM 09: Proc. 18th ACMConf. Information and Knowledge Management, pp. 1107-1116, 2009.

    [5] P. Singla and M. Richardson, Yes, There Is a Correlation: - FromSocial Networks to Personal Behavior on the Web, WWW 08:Proc. 17th Intl Conf. World Wide Web, pp. 655-664, 2008.

    [6] M. McPherson, L. Smith-Lovin, and J.M. Cook, Birds of a Feather:Homophily in Social Networks, Ann. Rev. of Sociology, vol. 27,pp. 415-444, 2001.

    [7] A.T. Fiore and J.S. Donath, Homophily in Online Dating: WhenDo You Like Someone Like Yourself?, CHI 05: Proc. CHI 05Extended Abstracts on Human Factors in Computing Systems,pp. 1371-1374, 2005.

    [8] H.W. Lauw, J.C. Shafer, R. Agrawal, and A. Ntoulas, Homophilyin the Digital World: A LiveJournal Case Study, IEEE InternetComputing, vol. 14, no. 2, pp. 15-23, Mar./Apr. 2010.

    [9] S.A. Macskassy and F. Provost, Classification in Networked Data:A Toolkit and a Univariate Case Study, J. Machine LearningResearch, vol. 8, pp. 935-983, 2007.

    [10] X. Zhu, Semi-Supervised Learning Literature Survey, technicalreport, http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey_12_9_2006.pdf, 2006.

    [11] Introduction to Statistical Relational Learning, L. Getoor andB. Taskar, eds. The MIT Press, 2007.

    [12] X. Zhu, Z. Ghahramani, and J. Lafferty, Semi-SupervisedLearning Using Gaussian Fields and Harmonic Functions, Proc.Intl Conf. Machine Learning (ICML), 2003.

    [13] S. White and P. Smyth, A Spectral Clustering Approach toFinding Communities in Graphs, Proc. SIAM Data Mining Conf.(SDM), 2005.

    [14] M. Newman, Power Laws, Pareto Distributions and Zipfs Law,Contemporary Physics, vol. 46, no. 5, pp. 323-352, 2005.

    [15] F. Harary and R. Norman, Some Properties of Line Digraphs,Rendiconti del Circolo Matematico di Palermo, vol. 9, no. 2, pp. 161-168, 1960.

    [16] T. Evans and R. Lambiotte, Line Graphs, Link Partitions, andOverlapping Communities, Physical Rev. E, vol. 80, no. 1,p. 16105, 2009.

    [17] Y.-Y. Ahn, J.P. Bagrow, and S. Lehmann, Link CommunitiesReveal Multi-Scale Complexity in Networks, http://www.citebase.org/abstract?id=oai:arXiv.org:0903.3178, 2009.

    [18] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,LIBLINEAR: A Library for Large Linear Classification,J. Machine Learning Research, vol. 9, pp. 1871-1874, 2008.

    [19] J. Hopcroft and R. Tarjan, Algorithm 447: Efficient Algorithms forGraph Manipulation, Comm. ACM, vol. 16, no. 6, pp. 372-378,1973.

    [20] J. Neville and D. Jensen, Leveraging Relational Autocorrelationwith Latent Group Models, MRDM 05: Proc. Fourth IntlWorkshop Multi-Relational Mining, pp. 49-55, 2005.

    [21] R.-E. Fan and C.-J. Lin, A Study on Threshold Selection for Multi-Label Classification, technical report, 2007.

    [22] L. Tang, S. Rajan, and V.K. Narayanan, Large Scale Multi-LabelClassification via Metalabeler, WWW 09: Proc. 18th Intl Conf.World Wide Web, pp. 211-220, 2009.

    [23] Y. Liu, R. Jin, and L. Yang, Semi-Supervised Multi-LabelLearning by Constrained Non-Negative Matrix Factorization,Proc. Natl Conf. Artificial Intelligence (AAAI), 2006.

    [24] F. Sebastiani, Machine Learning in Automated Text Categoriza-tion, ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

    [25] S.A. Macskassy and F. Provost, A Simple Relational Classifier,Proc. Multi-Relational Data Mining Workshop (MRDM) at the NinthACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining,2003.

    [26] Z. Xu, V. Tresp, S. Yu, and K. Yu, Nonparametric RelationalLearning for Social Network Analysis, KDD 08: Proc. WorkshopSocial Network Mining and Analysis, 2008.

    [27] U. von Luxburg, A Tutorial on Spectral Clustering, Statistics andComputing, vol. 17, no. 4, pp. 395-416, 2007.

    [28] K. Yu, S. Yu, and V. Tresp, Soft Clustering on Graphs, Proc.Advances in Neural Information Processing Systems (NIPS), 2005.

    [29] E. Airodi, D. Blei, S. Fienberg, and E.P. Xing, Mixed MembershipStochastic Blockmodels, J. Machine Learning Research, vol. 9,pp. 1981-2014, 2008.

    [30] S. Fortunato, Community Detection in Graphs, Physics Reports,vol. 486, nos. 3-5, pp. 75-174, 2010.

    [31] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, Uncovering theOverlapping Community Structure of Complex Networks inNature and Society, Nature, vol. 435, pp. 814-818, 2005.

    [32] H. Shen, X. Cheng, K. Cai, and M. Hu, Detect Overlapping andHierarchical Community Structure in Networks, Physica A:Statistical Mechanics and Its Applications, vol. 388, no. 8, pp. 1706-1712, 2009.

    [33] S. Gregory, An Algorithm to Find Overlapping CommunityStructure in Networks, Proc. European Conf. Principles and Practiceof Knowledge Discovery in Databases (PKDD), pp. 91-102, http://www.cs.bris.ac.uk/Publications/pub_master.jsp?id=2000712,2007.

    [34] M. Newman andM. Girvan, Finding and Evaluating CommunityStructure in Networks, Physical Rev. E, vol. 69, p. 026113, http://www.citebase.org/abstract?id=oai:arXiv.org:cond-mat/0308217,2004.

    1090 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

  • [35] J. Bentley, Multidimensional Binary Search Trees Used forAssociative Searching, Comm. ACM, vol. 18, pp. 509-175, 1975.

    [36] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R.Silverman, and A.Y. Wu, An Efficient k-Means ClusteringAlgorithm: Analysis and Implementation, IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892,July 2002.

    [37] M. Sato and S. Ishii, On-Line EM Algorithm for the NormalizedGaussian Network,Neural Computation, vol. 12, pp. 407-432, 2000.

    [38] P. Bradley, U. Fayyad, and C. Reina, Scaling ClusteringAlgorithms to Large Databases, Proc. ACM Knowledge Discoveryand Data Mining (KDD) Conf., 1998.

    [39] R. Jin, A. Goswami, and G. Agrawal, Fast and Exact Out-of-Coreand Distributed K-Means Clustering, Knowledge and InformationSystems, vol. 10, no. 1, pp. 17-40, 2006.

    [40] L. Tang, H. Liu, J. Zhang, and Z. Nazeri, Community Evolutionin Dynamic Multi-Mode Networks, KDD 08: Proc. 14th ACMSIGKDD Intl Conf. Knowledge Discovery and Data Mining, pp. 677-685, 2008.

    [41] Encyclopaedia of Mathematics, M. Hazewinkel, ed. Springer, 2001.

    Lei Tang received the BS degree from FudanUniversity, China, and the PhD degree incomputer science and engineering from ArizonaState University. He is a scientist at Yahoo! Labs.His research interests include social computing,data mining and computational advertising. Hisbook Community Detection and Mining in SocialMedia was published by Morgan and ClaypoolPublishers in 2010. He is a professional memberof the AAAI, the ACM, and the IEEE.

    Xufei Wang received the bachelor of sciencedegree from Zhejiang University, China, and themasters degree from Tsinghua University. He iscurrently working toward the PhD degree incomputer science and engineering at ArizonaState University. His research interests are insocial computing and data mining. Specifically,he is interested in collective wisdom, familiarstranger, tag network, crowdsourcing, and datacrawling. He is a student member of the IEEE.

    Huan Liu received the bachelor of engineeringdegree from Shanghai Jiao Tong University andthe PhD degree from University of SouthernCalifornia. He is a professor of computer scienceand engineering at Arizona State University(ASU). He has worked at Telecom ResearchLabs in Australia, and taught at NationalUniversity of Singapore before joining ASU in2000. He has been recognized for excellence inteaching and research in computer science and

    engineering at ASU. His research interests are in data/web mining,machine learning, social computing, and artificial intelligence, investigat-ing problems that arise in many real-world applications with high-dimensional data of disparate forms such as social media, modelinggroup interaction, text categorization, bioinformatics, and text/webmining. His professional memberships include AAAI, ACM, ASEE,SIAM, and IEEE. He is a fellow of the IEEE. He can be contacted viahttp://www.public.asu.edu/~huanliu.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    TANG ET AL.: SCALABLE LEARNING OF COLLECTIVE BEHAVIOR 1091

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 36 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 2.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 36 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice