DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or...

30
28 DT14 Abstracts IP1 Tensor Factorizations for Multi-way Data: Past, Present, and Future In the past decade, tensor factorizations have exploded onto the data mining and graph analysis scene. In this talk, I’ll provide some historical context about the use of multi- way analysis, survey some recent advances in the field, and close with some challenging questions for moving forward. I’ll focus mainly on factorizations that result in sums of rank-1 components, known variously as canonical decom- position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa- tions. Assumptions about the data and the model have a major impact on the effectiveness of the approach, espe- cially in the case of sparse data. I’ll talk the importance of understanding the underlying data distribution, incorpo- rating constraints for the tensor models (symmetries, non- negativity, sparsity, etc.), ways to validate algorithms for fitting these models, and challenges for the future. Tamara G. Kolda Sandia National Laboratories [email protected] IP2 The Battle for the Future of Data Mining Deep learning has catapulted to the front page of the New York Times, formed the core of the so-called Google brain, and achieved impressive results in vision, speech recogni- tion, and elsewhere. Yet researchers have offered simple conundrums that deep learning doesnt address. For exam- ple: The large ball crashed right through the table because it was made of styrofoam. What was made of styrofoam? - the large ball - the table The answer is obviously the table, but if we change the word Styrofoam to steel, the answer is clearly the large ball. To automatically answer this type of question, our computers require an extensive body of common-sense knowledge. We believe that text mining can provide the requisite body of knowledge. My talk will describe work at the new Allen Institute for AI to- wards building the next-generation of text-mining systems. Oren Etzioni Allen Institute for Artificial Intelligence, USA [email protected] IP3 Beyond Big Data: Combining Cognitive Comput- ing and Analytics In 2011 a computer named Watson demonstrated super human question answering ability by defeating two Jeop- ardy! grand champions. This marked the beginning of a new era of computing, the cognitive computing era. Wat- son was built upon a strong foundation in natural language understanding and machine learning, and excels at ques- tion answering. In parallel, significant advanced have been made in data analytics, both in the scale of data that can be analyzed, and in the methods that can be applied to ex- tract signal from data. We have an opportunity to combine the technologies to deal with large volumes of both struc- tured and unstructured data, to answer questions using both written facts and signals extracted from data. This talk will explore examples and identify research opportu- nities. Brenda L. Dietrich IBM Research [email protected] IP4 Why Most Published Studies Are Wrong But Can Be Fixed Observational healthcare data, such as administrative claims and electronic health records, play an increas- ingly prominent role in healthcare. Pharmacoepidemio- logic studies in particular routinely estimate temporal as- sociations between medical product exposure and subse- quent health outcomes of interest and such studies influ- ence prescribing patterns and healthcare policy more gen- erally. Some authors have questioned the reliability and accuracy of such studies, but few previous efforts have at- tempted to measure their performance. The Observational Medical Outcomes Partnership (OMOP, http://omop.org) has conducted a series of experiments to empirically mea- sure the performance of various observational study designs with regard to predictive accuracy for discriminating be- tween true drug effects and negative controls. In this talk, I describe the past work of the Partnership, explore oppor- tunities to expand the use of observational data to further our understanding of medical products, and highlight areas for future research and development. David Madigan Columbia University [email protected] CP1 Class Augmented Active Learning Traditional active learning encounters a cold start issue when few labelled examples are present for learning a de- cent initial classifier. Its poor quality subsequently affects selection of the next query and stability of the iterative learning, resulting in more human annotation effort. This paper presents a novel class augmentation technique, which enhances each class’s representation which initially consists of only limited set of labelled examples. Our augmentation employs a connectivity-based influence computation algo- rithm with a decaying mechanism for the unlabelled sam- ples. Besides augmentation, our method also introduces structure preserving oversampling to correct class imbal- ance. Extensive experiments on ten publicly available data sets demonstrate the effectiveness of our proposed method over existing state-of-the-art methods. Moreover, our pro- posed modules perform at the fundamental data level with- out any requirement to modify the standard machine learn- ing tools. Hong Cao , Chunyu Bao Institute for Infocomm Research, A*STAR [email protected], [email protected] Xiaoli Li Institute for Infocomm Research [email protected] Yew-Kwong Wong EADS Innovation Works South Asia

Transcript of DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or...

Page 1: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

28 DT14 Abstracts

IP1

Tensor Factorizations for Multi-way Data: Past,Present, and Future

In the past decade, tensor factorizations have explodedonto the data mining and graph analysis scene. In this talk,I’ll provide some historical context about the use of multi-way analysis, survey some recent advances in the field, andclose with some challenging questions for moving forward.I’ll focus mainly on factorizations that result in sums ofrank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data orgraph task, we are seeking to model real-world observa-tions. Assumptions about the data and the model have amajor impact on the effectiveness of the approach, espe-cially in the case of sparse data. I’ll talk the importance ofunderstanding the underlying data distribution, incorpo-rating constraints for the tensor models (symmetries, non-negativity, sparsity, etc.), ways to validate algorithms forfitting these models, and challenges for the future.

Tamara G. KoldaSandia National [email protected]

IP2

The Battle for the Future of Data Mining

Deep learning has catapulted to the front page of the NewYork Times, formed the core of the so-called Google brain,and achieved impressive results in vision, speech recogni-tion, and elsewhere. Yet researchers have offered simpleconundrums that deep learning doesnt address. For exam-ple: The large ball crashed right through the table becauseit was made of styrofoam. What was made of styrofoam?- the large ball - the table The answer is obviously thetable, but if we change the word Styrofoam to steel, theanswer is clearly the large ball. To automatically answerthis type of question, our computers require an extensivebody of common-sense knowledge. We believe that textmining can provide the requisite body of knowledge. Mytalk will describe work at the new Allen Institute for AI to-wards building the next-generation of text-mining systems.

Oren EtzioniAllen Institute for Artificial Intelligence, [email protected]

IP3

Beyond Big Data: Combining Cognitive Comput-ing and Analytics

In 2011 a computer named Watson demonstrated superhuman question answering ability by defeating two Jeop-ardy! grand champions. This marked the beginning of anew era of computing, the cognitive computing era. Wat-son was built upon a strong foundation in natural languageunderstanding and machine learning, and excels at ques-tion answering. In parallel, significant advanced have beenmade in data analytics, both in the scale of data that canbe analyzed, and in the methods that can be applied to ex-tract signal from data. We have an opportunity to combinethe technologies to deal with large volumes of both struc-tured and unstructured data, to answer questions usingboth written facts and signals extracted from data. Thistalk will explore examples and identify research opportu-

nities.

Brenda L. DietrichIBM [email protected]

IP4

Why Most Published Studies Are Wrong But CanBe Fixed

Observational healthcare data, such as administrativeclaims and electronic health records, play an increas-ingly prominent role in healthcare. Pharmacoepidemio-logic studies in particular routinely estimate temporal as-sociations between medical product exposure and subse-quent health outcomes of interest and such studies influ-ence prescribing patterns and healthcare policy more gen-erally. Some authors have questioned the reliability andaccuracy of such studies, but few previous efforts have at-tempted to measure their performance. The ObservationalMedical Outcomes Partnership (OMOP, http://omop.org)has conducted a series of experiments to empirically mea-sure the performance of various observational study designswith regard to predictive accuracy for discriminating be-tween true drug effects and negative controls. In this talk,I describe the past work of the Partnership, explore oppor-tunities to expand the use of observational data to furtherour understanding of medical products, and highlight areasfor future research and development.

David MadiganColumbia [email protected]

CP1

Class Augmented Active Learning

Traditional active learning encounters a cold start issuewhen few labelled examples are present for learning a de-cent initial classifier. Its poor quality subsequently affectsselection of the next query and stability of the iterativelearning, resulting in more human annotation effort. Thispaper presents a novel class augmentation technique, whichenhances each class’s representation which initially consistsof only limited set of labelled examples. Our augmentationemploys a connectivity-based influence computation algo-rithm with a decaying mechanism for the unlabelled sam-ples. Besides augmentation, our method also introducesstructure preserving oversampling to correct class imbal-ance. Extensive experiments on ten publicly available datasets demonstrate the effectiveness of our proposed methodover existing state-of-the-art methods. Moreover, our pro-posed modules perform at the fundamental data level with-out any requirement to modify the standard machine learn-ing tools.

Hong Cao, Chunyu Bao

Institute for Infocomm Research, A*[email protected], [email protected]

Xiaoli LiInstitute for Infocomm [email protected]

Yew-Kwong WongEADS Innovation Works South Asia

Page 2: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 29

[email protected]

CP1

Batch Mode Active Learning with Hierarchical-Structured Embedded Variance

We consider active learning when the categories are repre-sented as a tree. Recent work has improved the traditionaltechniques by moving beyond “flat” structure through in-corporation of the label hierarchy into the uncertainty mea-sure. However, these methods have two major limitationswhen used. First, these methods roughly use the informa-tion in the label structure but do not take into accountthe training samples. Second, none of these methods canwork in a batch mode to reduce the computational time oftraining. We propose a batch mode active learning schemethat exploits both the hierarchical structure of the labelsand the characteristics of the training data. In addition,the selection criterion is designed to construct batches andincorporate a diversity measure. Experimental results in-dicate that our technique achieves a notable improvementin performance over the state-of-the-art approaches.

Yu ChengNorthwestern UniversityIBM T.J. Watson Research [email protected]

ZhengZhang ChenNorthwestern [email protected]

Hongliang Fei, Fei WangIBM T.J. Watson Research [email protected], [email protected]

Alok ChoudharyDept. of Electrical Engineering and Computer ScienceNorthwestern University, Evanston, [email protected]

CP1

Selective Sampling on Probabilistic Data

In the literature of supervised learning, most existing stud-ies assume that the labels provided by the labelers aredeterministic, which may introduce noise easily in manyreal-world applications. In many applications like crowd-sourcing, however, many labelers may simultaneously labelthe same group of instances and thus the label of each in-stance is associated with a probability. Motivated by thisobservation, we propose a new framework where each labelis enriched with a probability. In this paper, we study aninteractive sampling strategy, namely, selective sampling,in which each selected instance is labeled with a probabil-ity. Specifically, we flip a coin every time when we read anew instance and decide whether it should be labeled ac-cording to the flipping result. We prove that in our settingthe label complexity can be reduced dramatically. Finally,we conducted comprehensive experiments in order to verifythe effectiveness of our proposed labeling framework.

Peng Peng, RaymondChi-Wing WongDepartment of Computer Science and EngineeringHKUST

[email protected], [email protected]

CP1

Learning from Multi-User Multi-Attribute Anno-tations

There are cases in numerous annotation tasks wherein de-spite of the presence of multiple users, each user shouldclassify or rate multiple attributes for each sample. Thissituation is referred to as multi-user multi-attribute anno-tations in this paper. This work deals with the learningproblem under multi-user multi-attribute annotations. Agenerative model is introduced to describe the human la-beling for multi-user multi-attribute annotations. A maxi-mum likelihood approach is leveraged to infer the parame-ters in the generative model, namely, ground-truth labels,user expertise, and annotation difficulties. The classifiersfor each attribute are learned simultaneously. Further-more, the correlations among attributes are taken into ac-count during inference and learning using conditional ran-dom field. The experimental results reveal that our ap-proach can obtain better estimation of the ground truthlabels, user experts, annotation difficulties as well as at-tribute classifiers.

Ou Wu, Shuxiao LiNLPR, Institute of Automation, Chinese Academy [email protected], [email protected]

Honghui DongBeijing Jiaotong [email protected]

Ying Chen, Weiming HuNLPR, Institute of Automation, Chinese Academy [email protected], [email protected]

CP1

Disambiguation-Free Partial Label Learning

Partial label learning deals with the problem where eachtraining example is associated with a set of candidate la-bels, among which only one is correct. The common strat-egy is to try to disambiguate their candidate labels, suchas by identifying the ground-truth label iteratively or bytreating each candidate label equally. Nevertheless, theabove disambiguation strategy is prone to be misled bythe false positive label(s) within candidate label set. Inthis paper, a new disambiguation-free approach to partiallabel learning is proposed by employing the well-knownerror-correcting output codes (ECOC) techniques. Specif-ically, to build the binary classifier with respect to eachcolumn coding, any partially labeled example will be re-garded as a positive or negative training example only if itscandidate label set entirely falls into the coding dichotomy.Experiments on controlled and real-world data sets clearlyvalidate the effectiveness of the proposed approach.

Min-Ling ZhangSoutheast [email protected]

CP2

Influence Maximization with Viral Product Design

Product design and viral marketing are two popular con-

Page 3: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

30 DT14 Abstracts

cepts in the marketing literature that aim at the same goal:maximizing the adoption of a new product. While the ef-fect of the social network is nowadays kept in great con-sideration in any marketing-related activity, the interplaybetween product design and social influence is surprisinglystill largely unexplored. In this paper we move a first stepin this direction and study the problem of designing thefeatures of a novel product such that its adoption, fueledby peer influence, is maximized. Our experimental eval-uation on real-world data from the domain of social mu-sic consumption and social movie consumption confirmsthe effectiveness of the proposed framework in integratingproduct design in viral marketing.

Nicola BarbieriYahoo [email protected]

Francesco BonchiYahoo! [email protected]

CP2

Local Learning for Mining Outlier Subgraphs fromNetwork Datasets

Various systems can be modeled using entity-relationshipgraphs. Given such a graph, one may be interested in iden-tifying suspicious or anomalous subgraphs. A user maywant to identify suspicious subgraphs matching a querytemplate. A subgraph can be defined as anomalous basedon the connectivity structure within itself as well as withits neighborhood. Existence of low-probability links andabsence of high-probability links can be a good indicatorof subgraph outlierness. Probability of an edge can in turnbe modeled based on the weighted similarity between theattribute values of the nodes linked by the edge. We claimthat the attribute weights must be learned locally for ac-curate link existence probability computations. We designa system that finds subgraph outliers given a graph anda query by modeling the problem as a linear optimization.Experimental results on several synthetic and real datasetsshow the effectiveness of the proposed approach in comput-ing interesting outliers.

Manish Gupta, Arun Mallya, Subhro Roy, Jason ChoUniv of Illinois at [email protected], [email protected],[email protected], [email protected]

Jiawei [email protected]

CP2

A Probabilistic Approach to Uncovering At-tributed Graph Anomalies

We introduce a probabilistic model to identify anomaloussubgraphs containing a significantly different percentageof a certain vertex attribute. Our framework, gAnomaly,models generative processes of vertex attributes and di-vides the graph into regions that are governed by such pro-cesses. Two types of regularizers are employed to smoothenthe regions and facilitate vertex assignment. An iterativeprocedure is further proposed to find fine-grained anoma-lies. Experiments show gAnomaly outperforms a state-of-

the-art algorithm at uncovering anomalous subgraphs.

Nan Li, Huan SunComputer Science DepartmentUniversity of California, Santa [email protected], [email protected]

Kyle ChipmanNeural Science Research InstituteUniversity of California, Santa [email protected]

Jemin GeorgeU.S. Army Research [email protected]

Xifeng YanDepartment of Computer ScienceUniversity of California at Santa [email protected]

CP2

A Generative Model with Network Regularizationfor Semi-Supervised Collective Classification

In recent years much effort has been devoted to CollectiveClassification (CC) techniques to predict labels of linked in-stances given a large number of labeled data. However, inmany real-world applications labeled data are limited andvery expensive to obtain. Recently, Semi-Supervised Col-lective Classification (SSCC) has been examined to lever-age unlabeled data for enhancing the classification per-formance of CC. In this paper we propose a probabilis-tic generative model with network regularization (GMNR)for SSCC. Our main idea is to compute label probabilitydistributions for unlabeled instances by maximizing log-likelihood in the generative model and the label smoothnesson the network topology of data. We develop an effectiveEM algorithm to compute label probability distributionsfor label prediction. Experimental results on three realsparsely-labeled network datasets show that GMNR out-performs state-of-the-art CC algorithms and other SSCCalgorithms at 0.10 significance level.

Ruichao ShiShenzhen Key Laboratory of Internet InformationCollaboratioShenzhen Graduate School, Harbin Institute [email protected]

Qingyao WuSchool of Computer EngineeringNanyang Technological [email protected]

Yunming YeHarbin Institute of Technology, [email protected]

Shen-Shyang HoSchool of Computer EngineeringNanyang Technological [email protected]

CP2

Dava: Distributing Vaccines over Networks under

Page 4: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 31

Prior Information

In this paper, we study how to immunize healthy nodes, inpresence of already infected nodes. Efficient algorithms forsuch a problem can help public-health experts make moreinformed choices. First we formulate the Data-Aware Vac-cination problem, and prove it is NP-hard and also that itis hard to approximate. Secondly, we propose two effectivepolynomial-time heuristics DAVA and DAVA-fast. Finally,we also demonstrate the scalability and effectiveness of ouralgorithms through extensive experiments on multiple realnetworks including epidemiology datasets, which show sub-stantial gains of up to 10 times more healthy nodes at theend.

Yao ZhangVirginia [email protected]

B. Aditya PrakashCS, [email protected]

CP3

FlexiFaCT: Scalable Flexible Factorization of Cou-pled Tensors on Hadoop

Given multiple datasets of relational data, how can we ef-ficiently decompose our data into latent factors? Factor-ization of a single matrix or tensor has attracted muchattention, e.g. in the Netflix challenge, with users ratingmovies. However, we often have additional side informa-tion, e.g. demographic data. Incorporating this informa-tion leads to the coupled factorization problem. So far,it has been solved for small datasets. We provide a dis-tributed, scalable method for matrix, tensor, and coupledfactorization through stochastic gradient descent. We offerthe following contributions: (1) Versatility: Our algorithmcan perform factorizations with flexible objective functions,e.g. the Frobenius norm, sparse factorization, and non-negative factorization. (2) Scalability: FlexiFaCT scalesto unprecedented sizes in both the data and model, withup to billions of parameters, and runs on standard Hadoop.(3) Convergence proofs showing that FlexiFaCT converges,even with projections.

Alex Beutel, Abhimanu KumarCarnegie Mellon [email protected], [email protected]

Evangelos [email protected]

Partha Talukdar, Christos FaloutsosCarnegie Mellon [email protected], [email protected]

Eric XingSchool of Computer [email protected]

CP3

Context-Preserving Hashing for Fast Text Classifi-cation

There have been a number of approximate algorithms fortext similarity computation, such as min-wise hashing, ran-

dom projection, and feature hashing, which are based onthe bag-of-words representation. A limitation of their “flat-set’ representation is that context information and seman-tic hierarchy cannot be preserved. In this paper, we aim tofast compute similarities between texts while also preserv-ing context information. To take into account semantichierarchy, we consider a notion of “multi-level exchange-ability’ which can be applied at word-level, sentence-level,paragraph-level, etc. We employ a nested-set to representa multi-level exchangeable object. To fingerprint nested-sets for fast comparison, we propose a Recursive Min-wiseHashing (RMH) algorithm at the same computational costof the standard min-wise hashing algorithm. The empiricalstudies show that the proposed context-preserving hashingmethod can significantly outperform min-wise hashing.

Bin Li, Lianhua ChiUniversity of Technology, [email protected], [email protected]

Xingquan ZhuFlorida Atlantic [email protected]

CP3

Dusk: A Dual Structure-Preserving Kernel forSupervised Tensor Learning with Applications toNeuroimages

We introduce a new scheme to design structure-preservingkernels for supervised tensor learning. Specifically, wedemonstrate how to leverage the naturally available struc-ture within the tensorial representation to encode priorknowledge in the kernel. Our approach is an extensionof the conventional kernels in the vector space to tensorspace. We applied our novel kernel in conjunction withSVM to real-world tensor classification problems includingbrain fMRI classification for three different diseases.

Lifang HeSouth China University of [email protected]

Xiangnan Kong, Philip YuUniversity of Illinois at [email protected], [email protected]

Ann RaginNorthwestern [email protected]

Zhifeng HaoFaculty of Computer, Guangdong University [email protected]

Xiaowei YangSouth China University of Technology, [email protected]

CP3

Vog: Summarizing and Understanding LargeGraphs

How can we succinctly describe a million-node graph with afew simple sentences? How can we measure the importanceof a set of discovered subgraphs in a large graph? Theseare exactly the problems we focus on. Our main ideas are

Page 5: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

32 DT14 Abstracts

to construct a ’vocabulary’ of subgraph-types that oftenoccur in real graphs (e.g., stars, cliques, chains), and froma set of subgraphs, find the most succinct description ofa graph in terms of this vocabulary. We measure successin a well-founded way by means of the Minimum Descrip-tion Length (MDL) principle: a subgraph is included inthe summary if it decreases the total description length ofthe graph. Our contributions are three-fold: (a) formula-tion: we provide a principled encoding scheme to choosevocabulary subgraphs; (b) algorithm: we develop VOG,an efficient method to minimize the description cost, and(c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the NotreDame web graph.

Danai KoutraCarnegie Mellon [email protected]

U [email protected]

Jilles VreekenMax Planck Institute for InformaticsSaarland [email protected]

Christos FaloutsosCarnegie Mellon [email protected]

CP3

Turbo-Smt: Accelerating Coupled Sparse Matrix-Tensor Factorizations by 200x

How can we correlate the neural activity in the humanbrain as it responds to typed words, with properties ofthese terms (like edible, fits in hand)? This is one ofmany settings of the Coupled Matrix- Tensor Factoriza-tion (CMTF) problem. We introduce Turbo-SMT, a meta-method capable of boosting the performance of any CMTFalgorithm, by up to 200x, along with an up to 65 fold in-crease in sparsity, with comparable accuracy to the base-line. We apply Turbo-SMT to BrainQ, a dataset consistingof a (nouns, brain voxels, human subjects) tensor and a(nouns, properties) matrix, with coupling along the nounsdimension. Turbo-SMT is able to find meaningful latentvariables, as well as to predict brain activity with compet-itive accuracy.

Evangelos [email protected]

Tom MitchellCarnegie Mellon [email protected]

Nicholas SidiropoulosUniversity of [email protected]

Christos FaloutsosCarnegie Mellon [email protected]

Partha TalukdarCMU

[email protected]

Brian M. MurphyQueens University of Belfast

CP4

Unveiling Variables in Systems of Linear Equations

Motivated by emerging applications including vehiculartraffic networks and crowdsourcing systems, we introducethe UNVEIL problem defined as follows: given a system oflinear equations with less equations than unknowns iden-tify a set of k variables to query their values so that thenumber of variables that can be uniquely deduced in thenew system is maximized. We will discuss the complexityof this problem and algorithms for solving it. Finally, wewill present experimental evidence of the utility of thesealgorithms in real datasets.

Behzad Golshan, Evimaria TerziBoston [email protected], [email protected]

CP4

Transductive Hsic Lasso

Sparse regression methods such as Lasso, are com-monly used for analysis of high-dimensional, small-sampledatasets, due to their good generalization and feature-selection properties. In this work, we develop a novelmethod, called Transductive HSIC Lasso, that incorporatestransduction into a nonlinear sparse regression approachknown as HSIC Lasso. Our method exploits the structureof the HSIC Lasso, which maximizes the relevance betweenthe selected features and the label, while minimizing theredundancy between the selected features; transduction isachieved by including unlabeled samples into the redun-dancy computation. Our experiments demonstrate advan-tages of the proposed method over the state-of-the-art ap-proaches, both on simulated and real-life data.

Dan He, Irina Rish, Laxmi ParidaIBM T.J. Watson [email protected], [email protected], [email protected]

CP4

How Can I Index My Thousands of Photos Effec-tively and Automatically? An Unsupervised Fea-ture Selection Approach

It is not easy for a person to organize a large collection ofphotos. To automatically index a large photo collection,we first propose a simple strategy to extract meaningfulsemantics from original images as candidate dimensions,and then propose an efficient unsupervised feature selec-tion method to select a dimension subset that is small andsufficient to index every image uniquely. Our empiricalstudy demonstrates both the efficiency and the effective-ness of our method.

Juhua HuSimon Fraser [email protected]

Jian PeiSchool of Computing ScienceSimon Fraser [email protected]

Page 6: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 33

Jie TangTsinghua [email protected]

CP4

Robust Subspace Discovery Through SupervisedLow-Rank Constraints

In this paper, we propose a Supervised Regularizationbased Robust Subspace (SRRS) approach via low-ranklearning. Unlike existing subspace methods, our approachjointly learns low-rank representations and a robust sub-space from noisy observations. To improve the classifica-tion performance, class label information is incorporatedas supervised regularization. The problem can be formu-lated as a constrained rank minimization objective func-tion. Experimental results on four datasets demonstratethat our approach outperforms the state-of-the-art sub-space and low-rank learning methods.

Sheng Li, Yun FuNortheastern [email protected], [email protected]

CP4

Adaptive Quantization for Hashing: AnInformation-Based Approach to Learning Bi-nary Codes

Large-scale data mining and retrieval applications have in-creasingly turned to compact binary data representationsas a way to achieve both fast queries and efficient datastorage. Most of existing algorithms focus on learning aset of projection hyperplanes for the data and simply bi-narizing the result from each hyperplane, but this neglectsthe fact that informativeness may not be uniformly dis-tributed across the projections. In this paper, we addressthis issue by proposing a novel adaptive quantization (AQ)strategy that adaptively assigns varying numbers of bits todifferent hyperplanes based on their information content.Our method provides an information-based schema thatpreserves the neighborhood structure of data points, andwe jointly find the globally optimal bit-allocation for allhyperplanes.

Caiming Xiong, Wei Chen, Gang Chen, David Johnson,Jason CorsoSUNY [email protected], [email protected],[email protected], [email protected],[email protected]

CP5

Active Multitask Learning Using Both Latent andSupervised Shared Topics

Multitask learning (MTL) via a shared representation hasbeen adopted to alleviate problems with sparsity of labeleddata across different learning tasks. Active learning, onthe other hand, reduces the cost of labeling examples bymaking informative queries over an unlabeled pool of data.A unification of both of these approaches can potentiallybe useful in settings where labeled information is expen-sive to obtain but the learning tasks have some commoncharacteristics. This paper introduces two such models –Active Doubly Supervised Latent Dirichlet Allocation andits non-parametric variation that integrate MTL and activelearning in the same framework. These models make use

of both latent and supervised shared topics to accomplishmultitask learning. Experimental results on both docu-ment and image classification show that integrating MTLand active learning along with shared latent and supervisedtopics is superior to other methods which do not employall of these components.

Ayan AcharyaUniversity of Texas at AustinDepartment of [email protected]

CP5

Linking Heterogeneous Input Spaces with Pivotsfor Multi-Task Learning

Most existing works on multi-task learning (MTL) assumethe same input space for different tasks. In this paper, weaddress a general setting where different tasks have hetero-geneous input spaces. Our key observation is that in manyreal applications, there might exist some correspondenceamong the inputs of different tasks, which is referred toas pivots. For such applications, we first propose a learn-ing scheme for multiple tasks and analyze its generaliza-tion performance. Then we focus on the problems whereonly a limited number of the pivots are available, and pro-pose a general framework to leverage the pivot information.We further propose an effective optimization algorithm tofind both the mappings and the prediction model. Experi-mental results demonstrate its effectiveness, especially withvery limited number of pivots.

Jingrui HeIBM T.J. Watson Research [email protected]

Yan LiuUniversity of Southern [email protected]

Qiang YangHong Kong University of Science and TechnologyDepartment of Computer Science and [email protected]

CP5

Multi-Task Feature Selection on Multiple Networksvia Maximum Flows

We propose a new formulation of multi-task feature selec-tion coupled with multiple network regularizers, and showthat the problem can be exactly and efficiently solved bymaximum flow algorithms. This method contributes toone of the central topics in data mining: How to exploitstructural information in multivariate data analysis, whichhas numerous applications, such as gene regulatory andsocial network analysis. On simulated data, we show thatthe proposed method leads to higher accuracy in discov-ering causal features by solving multiple tasks simultane-ously using networks over features. Moreover, we apply themethod to multi-locus association mapping with Arabidop-sis thaliana genotypes and flowering time phenotypes, anddemonstrate its ability to recover more known phenotype-related genes than other state-of-the-art methods.

Mahito Sugiyama, Chloe-Agathe AzencottMax Planck Institutes [email protected],[email protected]

Page 7: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

34 DT14 Abstracts

Dominik GrimmMax Planck Institutes TubingenEberhard Karls Universitat [email protected]

Yoshinobu KawaharaISIR, Osaka [email protected]

Karsten BorgwardtMax Planck Institute for Intelligent SystemsEberhard Karls Universitat [email protected]

CP5

Mixed-Transfer: Transfer Learning over MixedGraphs

Heterogeneous transfer learning has been proposed as anew learning strategy to improve performance in a tar-get domain by leveraging data from other heterogeneoussource domains where feature spaces can be different acrossdifferent domains. In this paper, we propose a novel algo-rithm named Mixed- Transfer to solve heterogeneous trans-fer learning with noise co-occurrence data from the Web.It is composed of three components, that is, a cross domainharmonic function to avoid personal biases, a joint transi-tion probability graph of mixed instances and features tomodel the heterogeneous transfer learning problem, a ran-dom walk process to simulate the label propagation on thegraph and avoid the data sparsity problem.

Ben TanHong Kong University of Science and Technology,Clear Water Bay, Hong [email protected]

Erheng ZhongHong Kong University of Science and TechnologyDepartment of Computer Science and [email protected]

Michael K. NgDepartment of Mathematics, Hong Kong [email protected]

Qiang YangHong Kong University of Science and TechnologyDepartment of Computer Science and [email protected]

CP5

Multi-Graph Learning with Positive and UnlabeledBags

In this paper, we formulate a new multi-graph learningtask with only positive and unlabeled bags, where labelsare only available for bags but not for individual graphsinside the bag. This problem setting raises significant chal-lenges because bag-of-graph setting does not have featuresto directly represent graph data, and no negative bags ex-its for deriving discriminative classification models. Tosolve the challenge, we propose a puMGL learning frame-work which relies on two iteratively combined processesfor multi-graph learning: (1) deriving features to representgraphs for learning; and (2) deriving discriminative modelswith only positive and unlabeled graph bags. Experiments

and comparisons on real-world multi-graph data demon-strate the algorithm performance.

Jia WuCentre for Quantum Computation and IntelligentSystems, FEITUniversity of Technology Sydney, [email protected]

Zhibin Hong, Shirui PanUniversity of Technology Sydney, [email protected],[email protected]

Xingquan ZhuFlorida Atlantic University, [email protected]

Chengqi ZhangUniversity of Technology, [email protected]

Zhihua CaiChina University of Geosciences,Wuhan, [email protected]

CP6

Forecasting a Moving Target: Ensemble Models forIli Case Count Predictions

Modern epidemiological forecasts of common illnesses, suchas the flu, rely on both traditional surveillance sources aswell as digital surveillance data. However, most publishedstudies have been retrospective. The reports about flu arealso lagged by several weeks and revised over several weeks.We posit that effectively handling this uncertainty is one ofthe key challenges for a real-time prediction system in thissphere. In this paper, we present a detailed prospectiveanalysis on the generation of robust quantitative predic-tions about temporal trends of flu activity, using severalsurrogate data sources for 15 Latin American countries.We present our findings about the effects of correcting theuncertainty associated with official flu estimates and com-pare accuracy at model and data level fusion. In the pro-cess we present a a novel matrix factorization approachusing neighborhood embedding and compare the methodto several baseline methods.

Prithwish ChakrabortyDept. of Computer Science, Virgnia Tech, Blacksburg,VA, [email protected]

Pejman KhadiviDept. of Computer Science, Virginia Tech, Blacksburg,VA, [email protected]

Bryan LewisVirigina Bioinformatics Institute, VA [email protected]

Aravindan MahendiranDept. of Computer Science, Virginia Tech, Blacksburg,VA, [email protected]

Page 8: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 35

Jiangzhou ChenVirginia [email protected]

Patrick BulterDept. of Computer Science, Virgnia Tech, Blacksburg,VA, [email protected]

Elaine O. NsoesieVirginia TechBoston Children’s [email protected]

Sumiko MekaruChildren’s Hopsital Boston, MA, [email protected]

John BrownsteinChildren’s Hospital Boston, MA, [email protected]

Madhav MaratheVirginia Bioinformatics Institute, VA [email protected]

Naren RamakrishnanComputer ScienceVirginia [email protected]

CP6

Keeping Up with Innovation: A Predictive Frame-work for Modeling Healthcare Data with EvolvingClinical Interventions

Medical outcomes are inexorably linked to patient illnessand clinical interventions. Interventions change the courseof disease, crucially determining outcome. Traditional out-come prediction models build a single classifier by aug-menting interventions with disease information. Interven-tions, however, differentially affect prognosis, thus a sin-gle prediction rule may not suffice to capture variations.Interventions also evolve over time as more advanced in-terventions replace older ones. To this end, we propose aBayesian nonparametric, supervised framework that mod-els a set of intervention groups through a mixture distri-bution building a separate prediction rule for each group,and allows the mixture distribution to change with time.Experiments on synthetic and medical cohorts for 30-dayreadmission prediction demonstrate the superiority of theproposed model over clinical and data mining baselines.

Sunil K. Gupta, Santu Rana, Dinh Phung, SvethaVenkateshDeakin University, Geelong Waurn Ponds CampusVictoria, [email protected], [email protected],[email protected],[email protected]

CP6

Predictive Learning in the Presence of Heterogene-ity and Limited Training Data

Many real-world applications possess heterogeneity in theirdata, and further lack sufficient training data, making pre-dictive learning difficult. However, there often exists a

structure among the data instances, which can be lever-aged in predictive learning. We present an approach thatreduces the model complexity while addressing heterogene-ity, and demonstrate its application for forest cover esti-mation. We show that our framework captures meaningfulinformation about heterogeneity in data, improves predic-tion performance, and is robust to over-fitting.

Anuj KarpatneUniversity of [email protected]

Ankush KhandelwalUniversity of Minnesota- Twin [email protected]

Shyam BoriahDepartment of Computer ScienceUniversity of [email protected]

Vipin KumarUniversity of [email protected]

CP6

Turning the Tide: Curbing Deceptive Yelp Behav-iors

The popularity and influence of reviews, make sites likeYelp ideal targets for malicious behaviors. We presentMarco, a novel system that exploits the unique combina-tion of social, spatial and temporal signals gleaned fromYelp, to detect venues whose ratings are impacted by fraud-ulent reviews. Marco increases the cost and complexityof attacks, by imposing a tradeoff on fraudsters, betweentheir ability to impact venue ratings and their ability toremain undetected. We contribute a new dataset to thecommunity, which consists of both ground truth and goldstandard data. We show that Marco significantly outper-forms state-of-the-art approaches, by achieving 94% accu-racy in classifying reviews as fraudulent or genuine, and95.8% accuracy in classifying venues as deceptive or le-gitimate. Marco successfully flagged 244 deceptive venuesfrom our large dataset with 7,435 venues, 270,121 reviewsand 195,417 users.

Bogdan Carbunar, Mahmudur [email protected], [email protected]

Jaime [email protected]

George BurriJive [email protected]

Duen Horng ChauGeorgia [email protected]

CP6

Consumer Segmentation and Knowledge Extrac-tion from Smart Meter and Survey Data

Many electricity suppliers around the world are deploying

Page 9: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

36 DT14 Abstracts

smart meters to gather fine-grained spatiotemporal con-sumption data and to effectively manage the collective de-mand of their consumer base. In this paper, we introduce astructured framework and a discriminative index that canbe used to segment the consumption data along multiplecontextual dimensions such as locations, communities, sea-sons, weather patterns, holidays, etc. The generated seg-ments can enable various higher-level applications such asusage-specific tariff structures, theft detection, consumer-specific demand response programs, etc. Our framework isalso able to track consumers’ behavioral changes, evaluatedifferent temporal aggregations, and identify main charac-teristics which define a cluster.

Tri Kurniawan [email protected]

Tanuja GanuIBM Research, [email protected]

Dipanjan ChakrabortyIBM Research [email protected]

Karl [email protected]

Deva SeetharamIBM Research [email protected]

CP7

A Marginalized Denoising Method for Link Predic-tion in Relational Data

We propose a novel link prediction algorithm, where theproblem of predicting missing links is cast as a problem ofmatrix denoising. We train a mapping function by recover-ing the originally observed matrix from conceptually “infi-nite’ corrupted matrices where some links are randomlymasked from the observed matrix. By re-applying thelearned function to the observed relational matrix, we aimto “denoise’ the observed matrix and thus to recover theunobserved links.

Zheng Chen, Weixiong ZhangWashington University in St. [email protected], [email protected]

CP7

A Deep Learning Approach to Link Prediction inDynamic Networks

Time varying problems usually have complex underlyingstructures represented as dynamic networks where enti-ties and relationships appear and disappear over time. Inthis paper, we propose a novel deep learning framework,i.e., Conditional Temporal Restricted Boltzmann Machine(ctRBM), which predicts links based on individual transi-tion variance as well as influence introduced by local neigh-bors.

Xiaoyi Li, Nan Du, Hui LiState University of New York at [email protected], [email protected],[email protected]

Kang LiThe State University of New York at [email protected]

Jing GaoUniversity at [email protected]

Aidong ZhangDepartment of Computer ScienceState University of New York at [email protected]

CP7

Convex Optimization for Binary Classifier Aggre-gation in Multiclass Problems

We present a convex optimization method for an optimalaggregation of binary classifiers to estimate class member-ship probabilities in multiclass problems.

Sunho Park, Tae Hyun HwangDepartment of Clinical Sciences, UTSW Medical [email protected],[email protected]

Seungjin ChoiDepartment of Computer Science and Engineering,Pohang University of Science and [email protected]

CP7

Learning on Probabilistic Labels

Probabilistic information becomes more prevalent nowa-days and can be found easily in many applications likecrowdsourcing and pattern recognition. In this paper, wefocus on a dataset which contains probabilistic informationfor classification. Based on this probabilistic dataset, wepropose a classifier and give a theoretical bound linkingthe error rate of our classifier and the number of instancesneeded for training. Interestingly, we find that our theo-retical bound is asymptotically at least no worse than thepreviously best-known bounds developed based on the tra-ditional dataset. Furthermore, our classifier guarantees afast rate of convergence compared with traditional classi-fiers. Experimental results show that our proposed algo-rithm has a higher accuracy than the traditional algorithm.

Peng Peng, Raymond ChiWing WongDepartment of Computer Science and [email protected], [email protected]

Philip S. YuDepartment of Computer ScienceUniversity of Illinois at [email protected]

CP7

AUC Dominant Unsupervised Ensemble of BinaryClassifiers

Ensemble methods are widely used in practice with thehope of obtaining better predictive performance than couldbe obtained from any of the constituent classifiers in theensemble. Most of the existing literature is concerned withlearning ensembles in a supervised setting. In this paper

Page 10: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 37

we propose an unsupervised iterative algorithm to combinethe discriminant scores from different binary classifiers. Weprove that (under certain assumptions) the Area Underthe ROC Curve (AUC) of the resulting ensemble is greaterthan or equal to the AUC of the best classifier (with maxi-mum AUC). We also experimentally validate this claim ona number of datasets and also show that the performanceis better than the supervised ensembles.

Priyanka Agrawal, Vikas C. Raykar, Amrita SahaIBM Research, [email protected], [email protected], [email protected]

CP8

Make It Or Break It: Manipulating Robustness inLarge Networks

The function and performance of networks rely on theirrobustness, defined as their ability to continue functioningin the face of damage (targeted attacks or random fail-ures) to parts of the network. Prior research has proposeda variety of measures to quantify robustness and variousmanipulation strategies to alter it. In this paper, our con-tributions are two-fold. First, we critically analyze variousrobustness measures and identify their strengths and weak-nesses. Our analysis suggests natural connectivity, basedon the weighted count of loops in a network, to be a reliablemeasure. Second, we propose the first principled manip-ulation algorithms that directly optimize this robustnessmeasure, which lead to significant performance improve-ment over existing, ad-hoc heuristic solutions. Extensiveexperiments on real-world datasets demonstrate the effec-tiveness and scalability of our methods against a long listof competitor strategies.

Hau ChanStony Brook [email protected]

Leman AkogluStonybrook [email protected]

Hanghang TongCity Collge, [email protected]

CP8

Triangle Counting in Streamed Graphs Via SmallVertex Covers

We present a new randomized algorithm for estimatingthe number of triangles in massive graphs revealed as astream of edges in arbitrary order. It exploits the fact thatreal-world graphs often have small vertex covers, which al-lows to reduce the space and sample complexity of trianglecounting algorithms. Experiments on real-world graphsvalidate our theoretical analysis and show that our algo-rithm yields more accurate estimates than state-of-the-artapproaches using the same amount of space.

David Garcia-SorianoYahoo! Research [email protected]

Konstantin KutzkovIT University of Copenhagen, Denmark

[email protected]

CP8

Laplacian Spectral Properties of Graphs from Ran-dom Local Samples

In this paper, we study the relationship between the eigen-value spectrum of normalized Laplacian matrix and thestructure of ‘local’ subgraphs. We propose techniques toestimate the spectral properties from a random collectionof local subgraphs. Particularly, an algorithm is providedto estimate the spectral moments of the normalized Lapla-cian matrix. Moreover, we propose to compute bounds onthe spectral radius based on convex optimization. Numer-ical results on a large-scale e-mail network are illustrated.

Zhengwei Wu, Victor PreciadoUniversity of [email protected], [email protected]

CP8

It Takes Two to Tango: Exploring Social Tie Devel-opment with Both Online and Offline Interactions

Understanding social tie development is crucial for userengagement in social networking services. In this paper,we analyze the social interactions, both online and offline,and investigate the development of their social ties. In thisstudy, we aim to answer three key questions: 1) is there acorrelation between online and offline interactions? 2) howis the social tie developed via heterogeneous interactionchannels? 3) would the development of social tie betweentwo users be affected by their common friends? To achieveour goal, we develop a Social-aware Hidden Markov Model(SaHMM) that explicitly takes into account the factor ofcommon friends in measure of the social tie development.Our experiments show that the social tie development cap-tured by our SaHMM is significantly more consistent tolifetime profiles of users.

Peifeng YinPennsylvania State [email protected]

Qi [email protected]

Xingjie [email protected]

Wang-Chien LeePennsylvania State [email protected]

CP8

Adaptive User Distance Modeling in Social Media

One important challenge in social network analysis is howto model users’ distance as a single measure. We proposeto model this distance by simultaneously exploring users’profile attributes and local network structures. Due to thesparsity of data, where each user may interact with just afew people and only a few users provide their profile in-formation, it is typically difficult to learn effective distancemeasures for any individual network. One important obser-vation is that, people nowadays engage in multiple social

Page 11: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

38 DT14 Abstracts

networks, such as Facebook, Twitter, etc., where auxil-iary knowledge from related networks can help alleviatethe data sparsity problem. Nonetheless, due to the net-work differences, borrowing knowledge directly does notwork well. Instead, we propose an adaptive metric learn-ing framework. The basic idea is to exploit knowledge fromrelated networks collectively through embedding and em-ploy boosting-based techniques to eliminate irrelevant at-tributes.

Erheng ZhongHong Kong University of Science and TechnologyDepartment of Computer Science and [email protected]

Wei FanHuawei Noah’s Ark [email protected]

Qiang YangHong Kong University of Science and TechnologyDepartment of Computer Science and [email protected]

CP9

Automatic Construction and Ranking of TopicalKeyphrases on Collections of Short Documents

We introduce a framework for topical keyphrase genera-tion and ranking, based on the output of a topic model runon a collection of short documents. By shifting from theunigram-centric traditional methods of keyphrase extrac-tion and ranking to a phrase-centric approach, we are ableto directly compare and rank phrases of different lengths.Our method defines a function to rank topical keyphrasesso that more highly ranked keyphrases are considered to bemore representative phrases for that topic. We study theperformance of our framework on multiple real world doc-ument collections, and also show that it is more scalablethan comparable phrase-generating models.

Marina DanilevskyUniversity of Illinois [email protected]

Chi WangUniversity of Illinois at [email protected]

Nihit Desai, Xiang RenUniversity of Illinois [email protected], [email protected]

Jingyi GuoUniversity of Massachusetts [email protected]

Jiawei [email protected]

CP9

Recurrent Chinese Restaurant Process with aDuration-Based Discount for Event Identi?cationfrom Twitter

Event identi?cation and analysis on Twitter has become animportant task. Recently the Recurrent Chinese Restau-

rant Process (RCRP) has been successfully used for eventidenti?cation from news streams and news-centric socialmedia streams. However, these models cannot be directlyapplied to Twitter for two reasons:(1)Events emerge anddie out fast on Twitter, (2)Most Twitter posts are per-sonal interest oriented while only a small fraction is eventrelated. Motivated by these challenges, we propose a newnonparametric model which considers burstiness. We fur-ther combine this model with traditional topic models toidentify both events and topics simultaneously.

Qiming Diao, Jing JiangSingapore Management [email protected], [email protected]

CP9

A Dynamic Nonparametric Model for Characteriz-ing the Topical Communities in Social Streams

The data in social networks often come as streams, i.e.,both text content (e.g., emails, user postings) and net-work structure (e.g., user friendship) evolve over time.To capture the time-evolving latent structures in such so-cial streams, we propose a fully nonparametric DynamicTopical Community Model (nDTCM), where infinite la-tent community variables coupled with infinite latent topicvariables in each epoch, and the temporal dependencies be-tween variables across epochs are modeled via the rich-gets-richer scheme. nDTCM can effectively characterize threedynamic aspects in social streams: the number of commu-nities or topics changes (e.g., new communities or topicsare born and old ones die out); the popularity of commu-nities or topics evolves; the semantics such as communitytopic distribution, community participant distribution andtopic word distribution drift.

Ziqi LiuMOEKLINNS Lab, Department of Computer ScienceXi’an Jiaotong University, [email protected]

Qinghua ZhengDepartment of Computer ScienceXi’an Jiaotong University, [email protected]

Fei WangIBM T J Watson Research [email protected]

Zhenhua Tian, Bo LiDepartment of Computer ScienceXi’an Jiaotong University, [email protected], [email protected]

CP9

Joint Author Sentiment Topic Model

Traditional works in sentiment analysis and aspect ratingprediction do not take author preferences and writing styleinto account during rating prediction of reviews. In thiswork, we introduce Joint Author Sentiment Topic Model(JAST), a generative process of writing a review by an au-thor. Authors have different topic preferences, ‘emotional’attachment to topics, writing style based on the distribu-tion of semantic (topic) and syntactic (background) wordsand tendency to switch topics. JAST uses Latent Dirich-let Allocation to learn the distribution of author-specifictopic preferences and emotional attachment to topics. It

Page 12: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 39

uses a Hidden Markov Model to capture short range syn-tactic and long range semantic dependencies in reviews tocapture coherence in author writing style. JAST jointlydiscovers topics in a review, author preferences for topics,topic ratings as well as overall review rating from the pointof view of an author.

Subhabrata MukherjeeMax-Planck-Institut fur [email protected]

Gaurab Basu, Sachindra JoshiIBM Research [email protected], [email protected]

CP9

A Constrained Hidden Markov Model Approachfor Non-Explicit Citation Context Extraction

In this paper we present a constrained hidden markovmodel based approach for extracting non-explicit citingsentences in research articles. Our method involves firstindependently training a separate HMM for each citationin the article being processed and then performing a con-strained joint inference to label non-explicit citing sen-tences. Results on a standard test collection show thatour method significantly outperforms the baselines and iscomparable to the state of the art approaches.

Parikshit SondhiUniversity of Illinois Urbana ChampaignWalmart [email protected]

ChengXiang ZhaiUniversity of Illinois at Urbana [email protected]

CP10

A Constructive Setting for the Problem of DensityRatio Estimation

We introduce a constructive setting for the problem ofdensity ratio estimation through the solution of a multi-dimensional integral equation. In this equation, not onlyits right hand side is approximately known, but also theintegral operator is approximately defined. We show thatthis ill-posed problem has a rigorous solution and obtainthe solution in a closed form. The key element of this solu-tion is the novel V -matrix, which captures the geometry ofthe observed samples. We compare our method with pre-viously proposed ones, using both synthetic and real data.Our experimental results demonstrate the good potentialof the new approach.

Vladimir VapnikNEC Laboratories [email protected]

Igor BragaUniversity of Sao [email protected]

Rauf IzmailovApplied Communication Sciences

[email protected]

CP10

Embed and Conquer: Scalable Embeddings forKernel k-Means on MapReduce

The kernel k-means is an effective method for data cluster-ing which extends the commonly-used k-means algorithmto work on a similarity matrix over complex data struc-tures. It is, however, computationally very complex as itrequires the complete kernel matrix to be calculated andstored. Further, its kernelized nature hinders the paral-lelization of its computations on modern scalable infras-tructures for distributed computing. In this paper, we aredefining a family of low-dimensional embeddings that al-lows for scaling kernel k-means on MapReduce via an ef-ficient and unified parallelization strategy. Afterwards, wepropose two practical methods for low-dimensional embed-ding that adhere to our definition of the embeddings fam-ily. Exploiting the proposed parallelization strategy, wepresent two scalable MapReduce algorithms for kernel k-means. We demonstrate the effectiveness and efficiency ofthe proposed algorithms through an empirical evaluationon benchmark datasets.

Ahmed Elgohary, Ahmed Farahat, Mohamed Kamel,Fakhri KarrayUniversity of [email protected], [email protected],[email protected], [email protected]

CP10

Density Estimation with Adaptive Sparse Grids forLarge Data Sets

Even though kernel density estimation is widely used, itsperformance highly depends on the kernel bandwidth, andit can become computationally expensive for large datasets. We present a sparse-grid-based density estimationmethod which discretizes the density on basis functionscentered at grid points rather than on kernels centered atthe data points. Thus, the costs of evaluating the estimateddensity function are independent from the number of datapoints. Numerical results confirm that our method is com-petitive to current kernel-based approaches with respect toaccuracy and runtime.

Benjamin PeherstorferSCCS, Department of InformaticsTechnische Universitat [email protected]

Dirk PflugerUniversitat Stuttgart, SimTech-IPVSSimulation of Large [email protected]

Hans-Joachim BungartzTechnische Universitat Munchen, Department ofInformaticsChair of Scientific Computing in Computer [email protected]

CP10

On Randomly Projected Hierarchical Clusteringwith Guarantees

Hierarchical clustering algorithms are very popular owing

Page 13: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

40 DT14 Abstracts

to their simplicity of implementation and ease of interpre-tation of the results. However, they impose a large runtimecost, thus limiting their applicability to typically only smalldata instances. Here we mitigate this shortcoming and ex-plore fast hierarchical clustering algorithms based on ran-dom projections. We present adaptations for well-knownHC variants, such as single (SLC) and average (ALC) link-age clustering. The algorithms maintain, with arbitraryhigh probability, the outcome of hierarchical clustering aswell as the worst-case running-time guarantees.

Johannes SchneiderZurich [email protected]

CP10

Self-Taught Spectral Clustering Via ConstraintAugmentation

Although constrained spectral clustering has been used ex-tensively for the past few years, all work assumes the guid-ance (constraints) are given by humans. Original formu-lations of the problem assumed the constraints are givenpassively whilst later work allowed actively polling an Or-acle (human experts). In this paper, for the first time toour knowledge, we explore the problem of augmenting thegiven constraint set for constrained spectral clustering al-gorithms. This moves spectral clustering towards the direc-tion of self-teaching as has occurred in the supervised learn-ing literature. We present a formulation for self-taughtspectral clustering and show that the self-teaching processcan drastically improve performance without further hu-man guidance.

Xiang WangIBM [email protected]

Jun WangIBM Thomas J. Watson Research CenterBusiness Analytics and Mathematical [email protected]

Buyue QianIBM [email protected]

Fei WangIBM T.J. Watson Research [email protected]

Ian DavidsonUniversity of California, [email protected]

CP11

User Preference Learning with Multiple Informa-tion Fusion for Restaurant Recommendation

In this paper, we develop a generative probabilistic modelto exploit the multi-aspect ratings of restaurants for restau-rant recommendation. Also, the geographic proximityis integrated into the probabilistic model to capture thegeographic influence. Moreover, the profile information,which contains customer/restaurant-independent featuresand the shared features, is also integrated into the model.Finally, we conduct a comprehensive experimental study

on a real-world data set. The experimental results clearlydemonstrate the benefit of exploiting multi-aspect ratingsand the improvement of the developed generative proba-bilistic model.

Hui XiongRutgers, the State University of New [email protected]

Yanjie Fu, Bin LiuRutgers [email protected], [email protected]

Yong GeUNC [email protected]

Zijun YaoRutgers [email protected]

CP11

On Modeling Community Behaviors and Senti-ments in Microblogging

In this paper, we propose the CBS topic model, a proba-bilistic graphical model, to derive the user communities inmicroblogging networks based on the sentiments they ex-press on their generated content and behaviors they adopt.As a topic model, CBS can uncover hidden topics andderive user topic distribution. In addition, our model as-sociates topic-specific sentiments and behaviors with eachuser community. Notably, CBS has a general frameworkthat accommodates multiple types of behaviors simulta-neously. Our experiments on two Twitter datasets showthat the CBS model can effectively mine the representa-tive behaviors and emotional topics for each community.We also demonstrate that CBS model perform as well asother state-of-the-art models in modeling topics, but out-performs the rest in mining user communities.

Tuan-Anh HoangLiving Analytics Research CentreSingapore Management [email protected]

William W. CohenCarnegie Mellon [email protected]

Ee-Peng LimSingapore Management [email protected]

CP11

Contextual Combinatorial Bandit and Its Applica-tion on Diversified Online Recommendation

we propose a principled framework called contextual com-binatorial bandit and introduce its application on diversi-fied online recommendation task. Specifically, we formu-late the recommendation task as a diversity promoting ex-ploration/exploitation problem. On each of n rounds, thelearning algorithm sequentially selects a set of items ac-cording to the item-selection strategy that balances explo-ration and exploitation, and collects the user feedback onthese selected items. For the above bandit problem, we fur-ther provide a algorithm that achieves O(

√n) regret after

Page 14: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 41

playing n rounds.

Lijing QinTsinghua [email protected]

Shouyuan ChenThe Chinese University of Hong [email protected]

Xiaoyan ZhuTsinghua [email protected]

CP11

Selecting a Representative Set of Diverse QualityReviews Automatically

In this paper, we study how to find a representative setof high quality reviews to cover diversified aspects of useropinions. Existing work cannot solve this problem well.To overcome the drawbacks of existing methods, we de-fine a new problem of finding the minimum set of reviewsto cover all of features with different sentiment polarityand high quality without user-defined parameters. To solvethe problem efficiently, we define potential objective func-tion and develop greedy algorithm to find the solution inpolynomial-time with approximation guarantee. We alsopropose two strategies to further reduce the number of re-views and to prune the search space respectively. Compre-hensive experiments conducted on real review sets showthat the proposed methods are effective and outperformexisting methods.

Nana [email protected]

Hongyan LiuTsinghua [email protected]

Jiawei Chen, Jun HeRenmin University of [email protected], [email protected]

CP11

Latent Factor Transition for Dynamic Collabora-tive Filtering

User preferences change over time and capturing suchchanges is essential for developing accurate recommendersystems. Despite its importance, only a few works in col-laborative filtering have addressed this issue. In this paper,we consider evolving preferences and we model user dy-namics by introducing and learning a transition matrix foreach user’s latent vectors between consecutive time win-dows. Intuitively, the transition matrix for a user sum-marizes the time-invariant pattern of the evolution for theuser. We first extend the conventional probabilistic matrixfactorization and then improve upon this solution throughits fully Bayesian model. These solutions take advantageof the model complexity and scalability of conventionalBayesian matrix factorization, yet adapt dynamically touser’s evolving preferences. We evaluate the effectivenessof these solutions through empirical studies on six large-

scale real life data sets.

Chenyi ZhangZhejiang [email protected]

Ke WangSimon Fraser University, [email protected]

Hongkun YuSimon Fraser [email protected]

Jianling SunZhejiang [email protected]

Ee-Peng LimSingapore Management [email protected]

CP12

A Statistical Learning Theory Framework for Su-pervised Pattern Discovery

This paper formalizes a latent variable inference problemwe call supervised pattern discovery, the goal of which isto find sets of observations that belong to a single “pat-tern.’ We discuss two versions of the problem and proveuniform risk bounds for both. In the first version, collec-tions of patterns can be generated in an arbitrary mannerand the data consist of multiple labeled collections. In thesecond version, the patterns are assumed to be generatedindependently by identically distributed processes. Theseprocesses are allowed to take an arbitrary form, so obser-vations within a pattern are not in general independent ofeach other. The bounds for the second version of the prob-lem are stated in terms of a new complexity measure, thequasi-Rademacher complexity.

Jonathan H. [email protected]

Cynthia RudinMassachusetts Institute of [email protected]

CP12

Ensembles of Elastic Distance Measures for TimeSeries Classification

Many alternative distance measures for comparing timeseries have been proposed for Time Series Classification(TSC) problems. These include variants of Dynamic TimeWarping (DTW), such as weighted and derivative DTW,and edit distance-based measures, including Longest Com-mon Subsequence and Time Warp Edit Distance. Our aimis to test two hypotheses related to these measures. Firstly,we test whether there is any significant difference whentraining nearest neighbour classifiers with these measuresfor TSC problems. Secondly, we test whether combiningthese measures through simple ensemble schemes gives sig-nificantly better accuracy. We have carried out one of thelargest studies ever conducted into TSC. Our first findingis that there is no significant difference in accuracy betweenthe distance measures on our data sets. Our second find-

Page 15: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

42 DT14 Abstracts

ing, and the major contribution of this work, is to definean ensemble classifier that significantly outperforms the in-dividual classifiers.

Jason Lines, Anthony BagnallUniversity of East [email protected], [email protected]

CP12

Finding the True Frequent Itemsets

Frequent Itemsets (FIs) mining from a dataset D is a fun-damental primitive in data mining. In many applicationsD is a collection of samples obtained from an unknownprobability distribution π on transactions. By extractingthe FIs in D one attempts to infer itemsets that are gen-erated by π with probability at least θ, which we call theTrue Frequent Itemsets (TFIs). The set of FIs often con-tains a huge number of false positives w.r.t. the TFIs. We

design and analyze an algorithm to identify a threshold θsuch that the collection of itemsets with frequency at least

θ in D contains only TFIs with probability at least 1 − δ,for some user-specified δ. Our method uses the (empiri-cal) VC-dimension of the problem at hand. This allows usto identify almost all the TFIs without including any falsepositive.

Matteo RiondatoDepartment of Computer ScienceBrown [email protected]

Fabio VandinDepartment of Mathematics and Computer ScienceUniversity of Southern [email protected]

CP12

When and Where: Predicting Human MovementsBased on Social Spatial-Temporal Events

Predicting both the time and the location of human move-ments is valuable but challenging for a variety of applica-tions. To address this problem, we propose an approachconsidering both the periodicity and the sociality of hu-man movements. We first define a new concept, SocialSpatial-Temporal Event (SSTE), to represent social inter-actions among people. For the time prediction, we char-acterise the temporal dynamics of SSTEs with an ARMA(AutoRegressive Moving Average) model. To dynamicallycapture the SSTE kinetics, we propose a Kalman Filterbased learning algorithm to learn and incrementally updatethe ARMA model as a new observation becomes available.For the location prediction, we propose a ranking modelthe periodicity and the sociality of human movements aresimultaneously taken into consideration for improving theprediction accuracy. Extensive experiments conducted onreal data sets validate our proposed approach.

Ning YangCollege of Computer ScienceSichuan [email protected]

Xiangnan Kong, Fengjiao Wang, Philip YuUniversity of Illinois at Chicago

[email protected], [email protected], [email protected]

CP12

Discovery of Rare Sequential Topic Patterns inDocument Stream

In this paper, we propose the novel mining problem of rareSequential Topic Patterns (STPs) from Internet documentstreams, which are rare on the whole but relatively often forspecific users. It can be applied in many practical fields,such as personalized context-aware recommendation andreal-time monitoring on abnormal user behaviors. We de-sign a group of effective algorithms to discover these user-related rare STPs based on the temporal and probabilisticinformation of extracted topics.

Zhongyi Hu, Hongan WangInstitute of Software,Chinese Academy of [email protected], [email protected]

Jiaqi ZhuInstitute of Software, Chinese Academy of [email protected]

Maozhen LiSchool of Engineering and Design,Brunel [email protected]

Ying Qiao, Changzhi DengInstitute of Software,Chinese Academy of [email protected], [email protected]

MS1

Clustering with Linear Programming

Arindam BanerjeeUniversity of [email protected]

MS1

Clustering Evaluation and Validation: Some Re-sults, Challenges, and Research Questions

Ricardo J. CampelloDepartment of Computer SciencesUniversity of So Paulo at So [email protected]

MS1

Utilizing Multiple Clusterings: Beyond ConsensusClustering

Xiaoli Z. FernOregon State [email protected]

MS1

Title Not Available - Karypis

George Karypis

Page 16: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 43

University of Minnesota / [email protected]

PP1

Discovering Groups of Time Series with Similar Be-havior in Multiple Small Intervals of Time

The focus of this paper is to address the problem of dis-covering groups of time series that share similar behaviorin multiple small intervals of time. This problem has twocharacteristics: i) There are exponentially many combina-tions of time series that needs to be explored to find thesegroups, ii) The groups of time series of interest need tohave similar behavior only in some subsets of the time di-mension. We present an Apriori based approach to addressthis problem. We evaluate it on a synthetic dataset anddemonstrate that our approach can directly find all groupsof intermittently correlated time series without finding spu-rious groups unlike other alternative approaches that findmany spurious groups. We also demonstrate, using a neu-roimaging dataset, that groups of intermittently coherenttime series discovered by our approach are reproducibleon independent sets of time series data. In addition, wedemonstrate the utility of our approach on an S&P 500stocks data set.

Gowtham AtluriDept. of Computer ScienceUniversity of [email protected]

Michael SteinbachUniversity of [email protected]

Kelvin LimDept. of PsychiatryUniversity of [email protected]

Angus MacDonald IiiDept. of PsychologyUniversity of [email protected]

Vipin KumarUniversity of [email protected]

PP1

Large Building Associations Between Markersof Environmental Stressors and Adverse HumanHealth Impacts Using Frequent Itemset Mining

Health effects are unknown for most of the >83,000 chem-icals in commerce, creating challenges in balancing indus-trial needs with health protection. Here we describe the useof frequent itemset mining to identify exposure ⇒ healthassociations from the NHANES, a large-scale survey mea-suring biomarkers of environmental exposure and humanhealth. Our approach is designed to enable more effectiveknowledge discovery of potential health impacts of envi-ronmental chemicals by facilitating the comprehensive datamining and meta-analysis of the NHANES dataset.

Shannon M. BellOak Ridge Institute for Science and EducationU.S. Environmental Protection Agency

[email protected]

Stephen EdwardsUS Environmental [email protected]

PP1

Characterising Seismic Data

When a seismologist analyses a new seismogram it is oftenuseful to have access to a set of similar seismograms. Forexample if she tries to determine the event, if any, thatcaused the particular readings on her seismogram. So, thequestion is: when are two seismograms similar? We pro-pose a framework based on wavelets and MDL for this.Using MuLTi-Krimp we are able to extract succinct setsof characteristics from seismograms on which the similarityis based.

Roel Bertens, Arno SiebesUniversiteit UtrechtDepartment of Information and Computing [email protected], [email protected]

PP1

Density-Based Clustering Validation

One of the most challenging aspects of clustering is valida-tion, which is the objective and quantitative assessment ofclustering results. A number of different relative validitycriteria have been proposed for the validation of globular,clusters. Not all data, however, are composed of globularclusters. Density-based clustering algorithms seek parti-tions with high density areas of points (clusters, not nec-essarily globular) separated by low density areas, possiblycontaining noise objects. In these cases relative validityindices proposed for globular cluster validation may fail.In this paper we propose a relative validation index fordensity-based, arbitrarily shaped clusters. The index as-sesses clustering quality based on the relative density con-nection between pairs of objects. Our index is formulatedon the basis of a new kernel density function, which isused to compute the density of objects and to evaluate thewithin- and between-cluster density connectedness of clus-tering results. Experiments on synthetic and real worlddata show the effectiveness of our approach for the eval-uation and selection of clustering algorithms and their re-spective appropriate parameters.

Davoud MoulaviDepartment of Computing Science, University of [email protected]

Pablo JaskowiakDepartment of Computing Sciences, University of [email protected]

Ricardo J. CampelloDepartment of Computer SciencesUniversity of So Paulo at So [email protected]

Arthur ZimekLMU [email protected]

Joerg SanderDepartment of Computing Science, University of Alberta

Page 17: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

44 DT14 Abstracts

[email protected]

PP1

Online Discovery of Group Level Events in TimeSeries

Recent advances in high throughput data collection andstorage technologies have led to a dramatic increase in theavailability of high-resolution time series datasets in var-ious domains. These time series reflect the dynamics ofthe underlying physical processes. Detecting changes in atime series over time or changes in the relationships amongthe time series in a dataset containing multiple contempo-raneous time series can be useful to detect events in thesephysical processes. Contextual events detection algorithmsdetect changes in the relationships between multiple re-lated time series. In this work, we introduce a new typeof contextual events, called group level contextual changeevents. In contrast to individual contextual change eventsthat reflect the change in behavior of one target time se-ries against a context, group level events reflect the changein behavior of a target group of time series relative to acontext group of time series.

XI ChenUniversity of MinnesotaDepartment of Computer Science and [email protected]

Abdullah MueenUniversity of New [email protected]

Vijay Narayanan, Nikos Karampatziakis, Gagan BansalMicrosoft [email protected], [email protected],[email protected]

Vipin KumarUniversity of [email protected]

PP1

Frugal Traffic Monitoring with Autonomous Par-ticipatory Sensing

In this paper we propose a strategy for frugal sensing inwhich the participants send only a fraction of the observedtraffic information to reduce costs while achieving high ac-curacy. The strategy is based on autonomous sensing, inwhich participants make decisions to send traffic informa-tion without guidance from the central server, thus reduc-ing the communication overhead and improving privacy.We propose to use traffic flow theory in deciding whether ornot to send an observation to the server. To provide accu-rate and computationally efficient estimation of the currenttraffic, we propose to use a budgeted version of the Gaus-sian Process model on the server side. The experimentson real-life traffic data set indicate that the proposed ap-proach can use up to two orders of magnitude less samplesthan a baseline approach when estimating traffic speed ona highway network, with only a negligible loss in accuracy.

Vladimir Coric, Nemanja Djuric, Slobodan VuceticTemple [email protected], [email protected],

[email protected]

PP1

Improving Credit Card Fraud Detection with Cal-ibrated Probabilities

In this paper, two different methods for calibrating proba-bilities are evaluated and analyzed in the context of creditcard fraud detection, with the objective of finding themodel that minimizes the real losses due to fraud. It isshown that by calibrating the probabilities and then usingBayes minimum Risk the losses due to fraud are reduced.Furthermore, because of the good overall results, the afore-mentioned card processing company is currently incorpo-rating the methodology proposed in this paper into theirfraud detection system. Finally, the methodology has beentested on a different application, namely, direct marketing.

Alejandro Correa Bahnsen, Aleksandar Stokanovic,Djamila Aouada, Bjorn OtterstenLuxembourg [email protected], [email protected],[email protected], [email protected]

PP1

Iterative Framework Radiation Hybrid Mapping

Building comprehensive radiation hybrid maps for largesets of markers is a computationally expensive process,since the basic mapping problem is equivalent to the trav-eling salesman problem. The mapping problem is alsosusceptible to noise, and as a result, it is often beneficialto remove markers that are not trustworthy. The result-ing framework maps are typically more reliable but don’tprovide information about as many markers. We presentan approach to mapping most markers by first creatinga framework map and then incrementally adding the re-maining markers. We consider chromosomes of the humangenome, for which the correct ordering is known, and com-pare the performance of our two-stage algorithm with theCarthagene radiation hybrid mapping software. We showthat our approach is not only much faster than mappingthe complete genome in one step, but that the quality ofthe resulting maps is also much higher.

Raed SeetanNorth Dakota State [email protected]

Anne M. DentonDepartment of Computer ScienceNorth Dakota State [email protected]

Omar Al-AzzamUniversity of Minnesota [email protected]

Ajay KumarNorth Dakota State [email protected]

Jazarai SturdivantVirginia State [email protected]

Shahryar Kianian

Page 18: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 45

University of [email protected]. gov

PP1

Mining Interpretable and Predictive Diagno-sis Codes from Multi-Source Electronic HealthRecords

Mining patterns from multi-source electronic health-carerecords (EHR) can potentially lead to better and morecost-effective treatments. We aim to ?nd the groups ofICD-9 diagnosis codes from EHRs that can predict the im-provement of urinary incontinence of patients and are alsointerpretable to domain experts. In particular, we incor-porate two additional information from EHRs: the clinicalinformation such as demographic, behavioral, physiologi-cal, and psycho-social variables available for same set ofsamples and the prior information available from clinicaldomain knowledge such as the clinical classification sys-tem (CCS) while grouping the ICD-9 codes to enhancetheir interpretability. Our results obtained from a large-scale EHR data set show that the proposed integrativeframework enhances clinical interpretability substantiallyas compared to the baseline model obtained from ICD-9codes only, while achieving almost the same predictive ca-pability.

Sanjoy DeyUniversity of [email protected]

Gyorgy SimonInstitute of Health Informatics,University of [email protected]

Bonnie WestraSchool of Nursing,University of [email protected]

Michael Steinbach, Vipin KumarUniversity of [email protected], [email protected]

PP1

Online Matrix Completion Through Nuclear NormRegularisation

It is the main goal of this paper to propose a novel methodto perform matrix completion on-line. Motivated by awide variety of applications, ranging from the design of rec-ommender systems to sensor network localization throughseismic data reconstruction, we consider the matrix com-pletion problem when entries of the matrix of interest areobserved gradually. The algorithm promoted in this arti-cle builds upon the Soft Impute approach introduced inMazumder et al. (2010). The major novelty essentiallyarises from the use of a randomised technique for bothcomputing and updating the Singular Value Decomposi-tion (SVD) involved in the algorithm. Though of disarm-ing simplicity, the method proposed turns out to be veryefficient, while requiring reduced computations. Severalnumerical experiments based on real datasets illustratingits performance are displayed, together with preliminaryresults giving it a theoretical basis.

Charanpal DhanjalTelecom ParisTech

[email protected]

Romaric GaudelUniversity of [email protected]

S ClemenconTelecom [email protected]

PP1

Classifying Imbalanced Data Streams Via DynamicFeature Group Weighting with Importance Sam-pling

Data stream classification and imbalanced data learningare two important areas of data mining research. Each hasbeen well studied to date with many interesting algorithmsdeveloped. However, only a few approaches reported in lit-erature address the intersection of these two fields due totheir complex interplay. In this work, we proposed an im-portance sampling driven, dynamic feature group weight-ing framework for classifying data streams of imbalanceddistribution. We derived the theoretical upper bound forthe generalization error of the proposed algorithm. We alsostudied the empirical performance of our method on a setof benchmark synthetic and real world data, and significantimprovement has been achieved over the competing algo-rithms in terms of standard evaluation metrics and parallelrunning time.

Ke Wu, Andrea EdwardsXavier Univ. Of [email protected], [email protected]

Wei FanHuawei Noah’s Ark [email protected]

Jing GaoUniversity at [email protected]

Kun ZhangXavier University of [email protected]

PP1

Exploiting Monotonicity Constraints in ActiveLearning for Ordinal Classification

We consider ordinal classification and instance rankingproblems where each attribute is known to have an increas-ing or decreasing relation with the class label. We aim toexploit such monotonicity constraints by using labeled at-tribute vectors to draw conclusions about the class labelsof order related unlabeled ones. Assuming we have a poolof unlabeled attribute vectors, and an oracle that can bequeried for class labels, the central problem is to choose aquery point whose label is expected to provide the mostinformation. We evaluate several query strategies on arti-ficial and real data.

Ad Feelders, Pieter SoonsUniversiteit Utrecht

Page 19: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

46 DT14 Abstracts

[email protected], [email protected]

PP1

Classifier-Adjusted Density Estimation forAnomaly Detection and One-Class Classifica-tion

Density estimation methods are often regarded as unsuit-able for anomaly detection in high-dimensional data dueto the difficulty of estimating multivariate probability dis-tributions. Instead, the scores from popular distance- andlocal-density-based methods, such as local outlier factor(LOF), are used as surrogates for probability densities. Wequestion this infeasibility assumption and explore a familyof simple statistically-based density estimates constructedby combining a probabilistic classifier with a naive densityestimate. Across a number of semi-supervised and unsu-pervised problems formed from real-world data sets, weshow that these methods are competitive with LOF andthat even simple density estimates that assume attributeindependence can perform strongly. We show that thesedensity estimation methods scale well to data with highdimensionality and that they are robust to the problem ofirrelevant attributes that plagues methods based on localestimates.

Lisa Friedland, Amanda GentzelUniversity of Massachusetts [email protected], [email protected]

David JensenUMass [email protected]

PP1

Online Anomaly Detection by Improved GrammarCompression of Log Sequences

Log sequences mining is widely used in detecting anoma-lies. The state-of-the-art detection methods either needsignificant computation, or require specific assumptions onthe logs distribution. Our proposed CADM needs no as-sumptions and can generate alerts online with only O(n)computation. It exploits the relative entropy between testand normal logs for anomalous levels discovery and an im-proved grammar-based compression method for relative en-tropy estimation. Experiments prove CADM can achievehigher detection accuracy with minimal computation.

Yun Gao, Wei Zhou, Zhang Zhang, Jizhong Han, DanMengInstitute of Information EngineeringChinese Academy of [email protected], [email protected],[email protected], [email protected],[email protected]

Zhiyong XuDepartment of Math and Computer ScienceSuffolk University [email protected]

PP1

Submodularity in Team Formation Problem

The team formation problem is about finding a team ofexperts for a project. Several formulations have been pro-posed for this problem but each of them focuses only on

a subset of design criteria such as skill coverage, socialcompatibility, economy, skill redundancy, etc. In this pa-per, for the first time, we show that most of these cri-teria can be encapsulated within one single formulation,namely unconstrained submodular function maximization.In our formulation, the submodular function turns out tobe non-negative and non-monotone. The maximization ofthis function is much less explored than its monotone con-strained counterpart. Recently, for this problem, a simu-lated annealing based randomized scheme having 0.41 ap-proximation ratio has been proposed. We customize thisalgorithm to our setting and conduct an extensive set ofexperiments to show its efficacy. Our formulation offersmany advantages such as skill cover softening, better teamcommunication, and connectivity relaxation.

Dinesh GargIBM India Research LabNew [email protected]

Avradeep BhowmikUniversity of Texas, [email protected]

Vivek BorkarIIT [email protected]

Madhavan PallanIBM India Research Lab,New [email protected]

PP1

Quantifying Trust Dynamics in Signed Graphs, theS-Cores Approach

This paper focuses on the issue of defining models and met-rics for reciprocity in signed graphs. In unsigned networks,reciprocity quantifies the predisposition of network mem-bers in creating mutual connections. On the other hand,this concept has not yet been investigated in the case ofsigned graphs. We capitalize on the graph degeneracy con-cept to identify subgraphs of the signed network in whichreciprocity is more likely to occur. This enables us to as-sess reciprocity at a global level, rather than at a local oneas in existing approaches. The large scale experiments weperform on real world trust networks lead to both interest-ing and intuitive results. These reciprocity measures canbe used in various social applications such as trust manage-ment, community detection and evaluation of individuals.The global reciprocity we define in this paper is closelycorrelated to the clustering structure of the graph, morethan the local reciprocity, indicated by our experimentalevaluation.

Christos Giatsidis, Michalis VazirgiannisEcole [email protected], [email protected]

Dimitrios ThilikosCNRS, LIRMM, [email protected]

Silviu ManiuUniversity of Hong [email protected]

Page 20: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 47

Bogdan CautisUniversite [email protected]

PP1

On Finding the Point Where There Is No Return:Turning Point Mining on Game Data

Gaming expertise is usually accumulated through playingor watching many game instances, and identifying criti-cal moments in these game instances called turning points.Turning Point Rules (TPRs) are game patterns that almostalways lead to some irreversible outcomes. In this paper,we formulate the notion of irreversible outcome propertywhich can be combined with pattern mining so as to au-tomatically extract TPRs from any given game datasets.To show the usefulness of TPRs, we apply them to Tetris,a popular game. We mine TPRs from Tetris games andgenerate challenging game sequences so as to help trainingan intelligent Tetris algorithm.

Wei GongSchool of Information SystemsSingapore Management [email protected]

Ee-Peng Lim, Feida ZhuSingapore Management [email protected], [email protected]

Palakorn Achananuparp, David LoSchool of Information SystemsSingapore Management [email protected], [email protected]

PP1

Memory-Efficient Query-Driven Community De-tection with Application to Complex Disease As-sociations

Rather than enumerating all communities in a network,mining for communities containing user-specified querynodes produces output more relevant to the users prefer-ence. In this paper, we propose and systematically comparetwo memory efficient approaches: out-of-core and index-based. The achieved scalability of our methods enablesthe discovery of diseases that are known to be or likely as-sociated with Alzheimer’s, when a genome-scale network ismined with Alzheimer’s biomarker genes as query nodes.

Steve Harenberg, Ramona Seay, Stephen Ranshous,Kanchana Padmanabhan, Jitendra Harlalka, EricSchendel, Michael OBrien, Rada ChirkovaNorth Carolina State [email protected], [email protected],[email protected], [email protected],[email protected], [email protected],[email protected], [email protected]

William HendrixNorthwestern [email protected]

Alok ChoudharyDept. of Electrical Engineering and Computer ScienceNorthwestern University, Evanston, [email protected]

Vipin KumarUniversity of [email protected]

Murali DoraiswamyDuke [email protected]

Nagiza SamatovaNorth Carolina State UniversityOak Ridge National [email protected]

PP1

Unsupervised Feature Learning by Deep SparseCoding

In this paper, we propose a new unsupervised feature learn-ing framework, namely Deep Sparse Coding (DeepSC),that extends sparse coding to a multi-layer architecturefor visual object recognition tasks. The main innovation ofthe framework is that it connects the sparse-encoders fromdifferent layers by a sparse-to-dense module. The sparse-to-dense module is a composition of a local spatial poolingstep and a low-dimensional embedding process, which takesadvantage of the spatial smoothness information in the im-age. As a result, the new method is able to learn multiplelayers of sparse representations of the image which capturefeatures at a variety of abstraction levels and simultane-ously preserve the spatial smoothness between the neigh-boring image patches. Combining the feature representa-tions from multiple layers, DeepSC achieves the state-of-the-art performance on multiple object recognition tasks.

Yunlong HeGeorgia Institute of [email protected]

Koray KavukcuogluDeepMind [email protected]

Yun WangPrinceton [email protected]

Arthur SzlamThe City College of New [email protected]

Yanjun QiUniversity of [email protected]

PP1

Recovering Missing Labels of CrowdsourcingWorkers

Labels collected from crowdsourcing platforms may be in-correct or missing. While most existing work focuses onmodeling the labeling errors, this paper proposes an algo-rithm to predict the missing labels of crowd workers. Weadopt thoughts from semi-supervised learning and utilizethe particular consistency between crowd workers. Exper-iments show that our algorithm can recover such missinglabels properly, which are useful in predicting the ground

Page 21: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

48 DT14 Abstracts

truth and discovering properties of crowd workers.

Qingyang HuCollege of Computer Science and Technology, [email protected]

Kevin ChiewProvident Technology Pte. Ltd., [email protected]

Hao HuangSchool of Computing,National University of Singapore,[email protected]

Qinming HeCollege of Computer Science and Technology,Zhejiang [email protected]

PP1

Detecting Influence Relationships from Graphs

Graphs have been used to represent objects and object con-nections in many applications. Mining influence relation-ships from graphs has gained increasing interests becauseproviding influence information about object connectionscan facilitate graph exploration, connection recommenda-tions, etc. In this paper, we study the problem of detectinginfluence aspects, on which objects are connected, and in-fluence degrees, with which one graph node influences othernodes on given aspects. Existing techniques focus on in-ferring either the influence degrees or influence types fromgraphs. We propose two generative Aspect Influence Mod-els, OAIM and LAIM, to detect both influence aspects andinfluence degrees at the aspect level. We compare thesemodels with one baseline approach which considers onlythe text content of objects. The empirical studies on ci-tation graphs and networks of users from Twitter showthat our models can discover more effective results thanthe baseline approach.

Chuan HuNew Mexico State [email protected]

Huiping CaoComputer ScienceNew Mexico State [email protected]

Chaomin KeNew Mexico State [email protected]

PP1

Less Is More: Similarity of Time Series under Lin-ear Transformations

When comparing time series, z-normalization preprocess-ing and dynamic time warping (DTW) distance becamealmost standard procedure. This paper makes a pointagainst carelessly using this setup by discussing implica-tions and alternatives. A (conceptually) simpler distancemeasure is proposed that allows for a linear transformationof amplitude and time only, but is also open for other nor-malizations (unachievable by z-normalization preprocess-

ing). Lower bounding techniques are presented for thismeasure that apply directly to raw series.

Frank HoppnerOstfalia University of Applied [email protected]

PP1

Meta: Multi-Resolution Framework for EventSummarization

Event summarization is an effective process that mines andorganizes event patterns to represent the original events. Itallows the analysts to quickly gain the general idea of theevents. In recent years, several event summarization al-gorithms have been proposed, but they all focus on howto find out the optimal summarization results, and are de-signed for one-time analysis. As event summarization is acomprehensive analysis work, merely handling this prob-lem with a single optimal algorithm is not enough. In theabsence of an integrated summarization solution, we pro-pose an extensible framework META to enable analysts toeasily and selectively extract and summarize events fromdifferent views with different resolutions. In this frame-work, we store the original events in a carefully-designeddata structure that enables an efficient storage and multi-resolution analysis.

Yexi JiangFlorida International [email protected]

Chang-Shing PerngIBM [email protected]

Tao LiFlorida International [email protected]

PP1

Beating Human Analysts in Nowcasting CorporateEarnings by Using Publicly Available Stock Priceand Correlation Features

Corporate earnings are a crucial indicator for investmentand business valuation. Despite their importance and thefact that classic econometric approaches fail to match an-alyst forecasts by orders of magnitude, the automatic pre-diction of corporate earnings from public data is not in thefocus of current machine learning research. In this paper,we present for the first time a fully automatized machinelearning method for earnings prediction that at the sametime a) only relies on publicly available data and b) canoutperform human analysts. The latter is shown empiri-cally in an experiment involving all S&P 100 companies ina test period from 2008 to 2012. The approach employsa simple linear regression model based on a novel featurespace of stock market prices and their pairwise correla-tions. With this work we follow the recent trend of now-casting, i.e., of creating accurate contemporary forecastsof undisclosed target values based on publicly observableproxy variables.

Michael Kamp, Mario Boley, Thomas GartnerFraunhofer [email protected],[email protected],

Page 22: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 49

[email protected]

PP1

Adversarial Learning with Bayesian HierarchicalMixtures of Experts

Many data mining applications operate in adversarial envi-ronment, for example, webpage ranking in the presence ofweb spam. A growing number of adversarial data miningtechniques are recently developed, providing robust solu-tions under specific defense-attack models. Existing tech-niques are tied to distributional assumptions geared to-wards minimizing the undesirable impact of given attackmodels. However, the large variety of attack strategies ren-ders the adversarial learning problem multimodal. There-fore, it calls for a more flexible modeling ideology for equiv-ocal input. In this paper we present a Bayesian hierarchicalmixtures of experts for adversarial learning. Optimal at-tacks minimizing the likelihood of malicious data are mod-eled interactively at both expert and gating levels in thelearning hierarchy. We demonstrate that our adversarialhierarchical-mixtures-of-experts learning model is robustagainst adversarial attacks on both artificial and real data.

Yan Zhou, Murat KantarciogluUniversity of Texas at [email protected], [email protected]

PP1

Large-Scale Multi-Label Learning with IncompleteLabel Assignments

Multi-label learning deals with the classification problemswhere each instance can be assigned with multiple la-bels simultaneously. Conventional multi-label learning ap-proaches mainly focus on exploiting label correlations. Thelabel sets for training instances are usually assumed tobe fully labeled without any missing labels. However, inmany real-world cases, the label assignments for traininginstances can be incomplete. This problem is typical whenthe number instances is very large, and the labeling cost isvery high, which makes it almost impossible to get a fullylabeled training set. In this paper, we study the problemof large-scale multi-label learning with incomplete labelassignments. We propose an approach based upon posi-tive and unlabeled stochastic gradient descent and stackedmodels, which can effectively and efficiently consider miss-ing labels and label correlations simultaneously, and is veryscalable, that has linear time complexities over the size ofthe data.

Xiangnan KongUniversity of Illinois at [email protected]

Zhaoming WuTsinghua University, [email protected]

Li-Jia LiYahoo! Research, [email protected]

Ruofei ZhangMicrosoft, [email protected]

Philip Yu

University of Illinois at [email protected]

Hang WuTsinghua University, [email protected]

Wei FanIBM T.J.Watson [email protected]

PP1

Large-Scale Kernel Ranksvm

Learning to rank is an important task for recommenda-tion systems, online advertisement and web search. Amongthose learning to rank methods, rankSVM is a widely usedmodel. Both linear and nonlinear (kernel) rankSVM havebeen extensively studied, but the lengthy training time ofkernel rankSVM remains a challenging issue. In this paper,after discussing difficulties of training kernel rankSVM, wepropose an efficient method to handle these problems. Theidea is to reduce the number of variables from quadraticto linear with respect to the number of training instances,and efficiently evaluate the pairwise losses. Our settingis applicable to a variety of loss functions. Further, gen-eral optimization methods can be easily applied to solvethe reformulated problem. Implementation issues are alsocarefully considered. Experiments show that our methodis faster than state-of-the-art methods for training kernelrankSVM.

Tzu-Ming Kuo, Ching-Pei Lee, Chih-Jen LinNational Taiwan [email protected], [email protected],[email protected]

PP1

Directed Interpretable Discovery in Tensors withSparse Projection

Tensors or multiway arrays are useful constructs capableof representing complex graphs and multi-source data, etc.Analyzing such complex data requires enforcing simplicityso as to give meaningful insights. Sparse tensor decom-position typically requires appending penalty term(s) tothe optimization, which provides a number of challenges.Not only is tuning the penalty weights time consuming,but the resulting decomposition often has a non-uniformdistribution of sparsity. This is undesirable for some datasets where we want each factor to have similar sparsity foreasier interpretation. We propose an alternative methodthat allows the user to specify the exact amount of spar-sity and where the sparsity should be. This is achieved byaugmenting the alternating non-negative least squares al-gorithm with a projection step. We demonstrate our worksusefulness in finding interpretable features for real worldproblems in fMRI scan data and face image analysis with-out any pre-processing.

Chia-Tung KuoDepartment of Computer ScienceUniversity of California, [email protected]

Ian DavidsonUniversity of California, Davis

Page 23: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

50 DT14 Abstracts

[email protected]

PP1

Efficient Matching of Substrings in Uncertain Se-quences

Substring matching is fundamental to data mining meth-ods for sequential data. It involves checking the existenceof a short subsequence within a longer sequence, ensuringno gaps within a match. Whilst a large amount of existingwork has focused on substring matching and mining tech-niques for certain sequences, there are only a few resultsfor uncertain sequences. Uncertain sequences provide pow-erful representations for modelling sequence behaviouralcharacteristics in emerging domains, such as bioinformat-ics, sensor streams and trajectory analysis. In this pa-per, we focus on the core problem of computing substringmatching probability in uncertain sequences and proposean efficient dynamic programming algorithm for this task.We demonstrate our approach is both competitive theoret-ically, as well as effective and scalable experimentally. Ourresults contribute towards a foundation for adapting classicsequence mining methods to deal with uncertain data.

Yuxuan LiUniversity of [email protected]

James BaileyThe University of [email protected]

Lars KulikUniversity of [email protected]

Jian PeiSchool of Computing ScienceSimon Fraser [email protected]

PP1

Result Integrity Verification of OutsourcedBayesian Network Structure Learning

There has been considerable recent interest in the data-mining-as-a-service paradigm: the client that lacks compu-tational resources outsources his/her data and data miningneeds to a third-party service provider. One of the securityissues of this outsourcing paradigm is how the client canverify that the service provider indeed has returned correctdata mining results. In this paper, we focus on the prob-lem of result verification of outsourced Bayesian network(BN) structure learning. We consider the untrusted serviceprovider that intends to return wrong BN structures. Wedevelop three efficient probabilistic verification approachesto catch the incorrect BN structure with high probabilityand cheap overhead. Our experimental results demonstratethat our verification methods can capture wrong BN struc-ture effectively and efficiently.

Wendy Hui WangStevens Institute of [email protected]

Ruilin LiuDepartment of Computer ScienceStevens Institute of Technology

[email protected]

Changhe YuanQueens College, [email protected]

PP1

The Benefits of Personalized Smartphone-BasedActivity Recognition Models

Abstract not available at time of publication.

Jeff LockhartFordham [email protected]

PP1

A New Framework for Traffic Anomaly Detection

Trajectory data is becoming more and more popular nowa-days and extensive studies have been conducted on trajec-tory data. One important research direction about tra-jectory data is the anomaly detection which is to find allanomalies based on trajectory patterns in a road network.In this paper, we introduce a road segment-based anomalydetection problem, which is to detect the abnormal roadsegments each of which has its “real’ traffic deviatingfrom its “expected’ traffic and to infer the major causesof anomalies on the road network. First, a deviation-based method is proposed to quantify the anomaly of reachroad segment. Second, based on the observation that oneanomaly from a road segment can trigger other anomaliesfrom the road segments nearby, a diffusion-based methodbased on a heat diffusion model is proposed to infer themajor causes of anomalies on the whole road network. Tovalidate our methods, we conduct intensive experimentson a large real-world GPS dataset of about 23,000 taxis inShenzhen, China to demonstrate the performance of ouralgorithms.

Jinsong LanComputer Network Information Center,Chinese Academy of [email protected]

Cheng LongDepartment of Computer Science and Engineering,The Hong Kong University of Science and [email protected]

Raymond C.-W. WongDepartment of Computer Science and EngineeringThe Hong Kong University of Science and [email protected]

Youyang ChenSchool of Computer Science,Beijing Institute of [email protected]

Yanjie FuRutgers [email protected]

Danhuai GuoComputer Network Information Center,Chinese Academy of [email protected]

Page 24: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 51

Shuguang LiuChongqing Institutes of Green and Intelligent TechnologyChinese Academy of [email protected]

Yong GeUNC [email protected]

Yuanchun Zhou, Jianhui LiComputer Network Information Center,Chinese Academy of [email protected], [email protected]

PP1

A Graphical Model Approach to Atlas-Free Miningof Mri Images

We propose a graphical model framework based on condi-tional random fields (CRFs) to mine MRI brain images. Asa proof-of-concept, we apply CRFs to the problem of braintissue segmentation. Experimental results show robust andaccurate performance on tissue segmentation comparableto other state-of-the-art segmentation methods. In addi-tion, results show that our algorithm generalizes well acrossdata sets and is less susceptible to outliers. Our methodrelies on minimal prior knowledge unlike atlas-based tech-niques, which assume images map to a normal template.Our results show that CRFs are a promising model for tis-sue segmentation, as well as other MRI data mining prob-lems such as anatomical segmentation and disease diag-nosis where atlas assumptions are unreliable in abnormalbrain images.

Chris Magnano, Ameet SoniSwarthmore CollegeComputer Science [email protected], [email protected]

Sriraam NatarajanIndiana UniversitySchool of Informatics and [email protected]

Gautam KunapuliUtopiaCompression [email protected]

PP1

ROCsearch — An ROC-guided Search Strategy forSubgroup Discovery

ROCsearch, an ROC-based beam search variant, automat-ically adapts its search behavior to the properties and re-sulting search landscape of a given dataset. A sensiblesearch width for each search level is automatically deter-mined by analyzing previous level results in ROC space.While producing equivalent results, ROCsearch is an orderof magnitude more efficient than traditional beam search,and domain experts can now eschew the hard task of as-certaining a proper beam width parameter.

Marvin MeengLIACS, Leiden [email protected]

Wouter DuivesteijnFakultat fur Informatik , Technische Universitat

[email protected]

Arno KnobbeLIACS, Leiden [email protected]

PP1

Community Discovery in Social Networks Via Het-erogeneous Link Association and Fusion

In this paper, we adapt a heterogeneous data clusteringalgorithm, called Generalized Heterogeneous Fusion Adap-tive Resonance Theory (GHF-ART), for discovering usercommunities in heterogeneous social networks. Compar-ing with existing algorithms, GHF-ART has several ad-vantages, including 1) performing real-time matching ofpatterns and one-pass learning which guarantee low com-putational cost; 2) Do not need the number of clusters aprior and 3) employing a weighting function to incremen-tally assess the importance of all the feature channels.

Lei Meng, Ah Hwee TanNanyang Technological [email protected], [email protected]

PP1

Discriminative Density-Ratio Estimation

In order to deal with covariate shift in classification prob-lems, we propose a novel method to discriminativelyreweight the training samples through estimating the den-sity ratio between the joint distributions of the trainingand test data for each class. The proposed algorithm isan iterative procedure that alternates between estimatingthe class information for the test data and estimating thenew class-wise density ratios between joint distributions.Experiments on synthetic and benchmark datasets demon-strate the superiority of the proposed method.

Yun-Qian Miao, Ahmed Farahat, Mohamed KamelUniversity of [email protected], [email protected],[email protected]

PP1

Accelerating Graph Adjacency Matrix Multiplica-tions with Adjacency Forest

We propose a method for accelerating matrix multiplica-tions that are iteratively performed with a sparse adjacencymatrix. We exploit the fact that the intermediate compu-tational results for the matrix multiplication of equivalentpartial row vectors of a matrix are the same. Our new datastructure, the adjacency forest, uses this property and rep-resents an adjacency matrix as a rooted forest that is madeby sharing the common suffixes of the row vectors of thematrix. We confirmed experimentally that our approachcan speed up the computation of Personalized PageRankand Non-negative matrix factorization up to 300%.

Masaaki NishinoNTT Communication Science LaboratoriesNTT [email protected]

Norihito YasudaJapan Science and Technology Agency

Page 25: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

52 DT14 Abstracts

[email protected]

Shin-Ichi MinatoHokkaido [email protected]

Masaaki NagataNTT Communication Science [email protected]

PP1

Graphical Models for Identifying Fraud and Wastein Healthcare Claims

We describe graphical model based methods for analyz-ing prescription and medical claims data in order to iden-tify fraud and waste. Our approach draws on ideas fromspeech recognition and language modeling to identify pa-tients, doctors and pharmacies whose prescription encoun-ters show significant departure from normative behavior.We have analyzed claims data from a large healthcareprovider, consisting of over 53 million individual prescrip-tion claims in the calendar year 2011.

Peder A. OlsenIBM, TJ Watson Research [email protected]

Ramesh NatarajanIBM Thomas J. Watson Research [email protected]

Sholom WeissIBM, TJ Watson Research [email protected]

PP1

An Optimization-Based Framework to Learn Con-ditional Random Fields for Multi-Label Classifica-tion

This paper studies multi-label classification problem inwhich data instances are associated with multiple, possiblyhigh-dimensional, label vectors. This problem is especiallychallenging when labels are dependent and one cannot de-compose the problem into a set of independent classifica-tion problems. To address the problem and properly rep-resent label dependencies we propose and study a pairwiseconditional random Field (CRF) model. We develop a newapproach for learning the structure and parameters of theCRF from data. The approach maximizes the pseudo likeli-hood of observed labels and relies on the fast proximal gra-dient descend for learning the structure and limited mem-ory BFGS for learning the parameters of the model. Em-pirical results on several datasets show that our approachoutperforms several multi-label classification baselines, in-cluding recently published state-of-the-art methods.

Mahdi Pakdaman NaeiniIntelligent Systems ProgramUniversity of [email protected]

Iyad BatalGeneral Electric Global Research [email protected]

Zitao Liu

University of [email protected]

CharmGil Hong, Milos HauskrechtComputer Science DepartmentUniversity of [email protected], [email protected]

PP1

Extracting Researcher Metadata with Labeled Fea-tures

Due to inherent diversity in values for certain metadatafields on researcher homepages (e.g., affiliation) supervisedalgorithms require a large number of labeled examples foraccurately identifying values for these fields. We addressthis issue with feature labeling, a recent semi-supervisedmachine learning technique. We apply feature labeling toresearcher metadata extraction from homepages by com-bining a small set of expert-provided feature distributionswith few fully-labeled examples. We study “dictionary”and “proximity” labeled features for our task in two stages.We experimentally show that this two-stage approach pro-vides significant improvements in the tagging performance.In one experiment with only ten labeled homepages and 22expert-specified labeled features, we obtained a 45% rela-tive increase in the F1 value for the affiliation field, whilethe overall F1 improves by 9%.

Sujatha Das GollapalliThe Pennsylvania State [email protected]

Yanjun QiUniversity of [email protected]

Prasenjit Mitra, Lee GilesCollege of Information Sciences and TechnologyThe Pennsylvania State [email protected], [email protected]

PP1

Multi-Task Clustering Using Constrained Symmet-ric Non-Negative Matrix Factorization

A promising new approach to improve clustering quality isto combine data from multiple related datasets (tasks) andapply multi-task clustering. We present a novel frameworkthat can simultaneously cluster multiple tasks through bal-anced Intra-Task (within-task) and Inter-Task (between-task) knowledge sharing. We propose an effective and flex-ible geometric affine transformation (contraction or expan-sion) of the distances between Inter-Task and Intra-Taskinstances for an improved Intra-Task clustering withoutoverwhelming the individual tasks with the bias accumu-lated from other tasks. We impose an Intra-Task soft or-thogonality constraint to a Symmetric Non-Negative Ma-trix Factorization (NMF) based formulation to generatebasis vectors that are near orthogonal within each task. In-ducing orthogonal basis vectors within each task imposesthe prior knowledge that a task should have orthogonal(independent) clusters.

Samir Al-Stouhi, Chandan ReddyWayne State University

Page 26: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 53

[email protected], [email protected]

PP1

A Weighted Adaptive Mean Shift Clustering Algo-rithm

The performance of the mean shift algorithm significantlydeteriorates with high dimensional data due to the spar-sity of the input space. Noisy features can also disguise themean shift procedure. In this paper we extend the meanshift algorithm to overcome these limitations, while main-taining its desirable properties. To achieve this goal, wefirst estimate the relevant subspace for each data point,and then embed such information within the mean shiftalgorithm, thus avoiding computing distances in the fulldimensional input space. The resulting approach achievesthe best-of-two-worlds: effective management of high di-mensional data and noisy features, while preserving a non-parametric nature. Our approach can also be combinedwith random sampling to speedup the clustering processwith large scale data, without sacrificing accuracy. Exten-sive experimental results on both synthetic and real-worlddata demonstrate the effectiveness of the proposed method.

Yazhou Ren, Carlotta DomeniconiGeorge Mason [email protected], [email protected]

Guoji ZhangSouth China University of [email protected]

Guoxian YuSouthwest [email protected]

PP1

Generalized Outlier Detection with Flexible KernelDensity Estimates

We analyse the interplay of density estimation and out-lier detection in density-based outlier detection. By clearand principled decoupling of both steps, we formulate ageneralization of density-based outlier detection methodsbased on kernel density estimation. Embedded in a broaderframework for outlier detection, the resulting method canbe easily adapted to detect novel types of outliers: whilecommon outlier detection methods are designed for detect-ing objects in sparse areas of the data set, our method canbe modified to also detect unusual local concentrations ortrends in the data set if desired. It allows for the integra-tion of domain knowledge and specific requirements. Wedemonstrate the flexible applicability and scalability of themethod on large real world data sets.

Arthur Zimek, Erich SchubertLMU [email protected], [email protected]

Hans-Peter KriegelLudwig-Maximilians University [email protected]

PP1

Membership Detection Using Cooperative DataMining Algorithms

More and more companies are providing data mining and

analytics solutions to customers using social media data.Unfortunately, given the exponential increase in the vol-ume of social media data, building local database snapshotsand running computationally expensive algorithms is notalways plausible. As an alternative to the centralized ap-proach, we study the feasibility of cooperative algorithmswhere data never leaves the mined social media network,and instead the network users themselves work together,using only the communication primitives provided by thesocial media site, to solve data mining problems. We focuson the task of group membership detection. After validat-ing the potential of cooperative solutions on Twitter, weempirically evaluate a collection of cooperative strategieson a snapshot of the Twitter network containing over 50million users. Our best solution, brokered token passing,can reliably and efficiently detect group membership.

Lisa Singh, Calvin Newport, Yiqing RenGeorgetown [email protected], [email protected],[email protected]

PP1

A Guide to Selecting a Network Similarity Method

We consider the problem of determining how similar twonetworks are. Many network-similarity methods exist; andit is unclear how one can select a method. We provide thefirst empirical study on the relationships between differentnetwork-similarity methods. We demonstrate that (1) dif-ferent network-similarity methods are well correlated, (2)some complex network-similarity methods can be closelyapproximated by a much simpler method, and (3) a fewnetwork-similarity methods produce rankings that are veryclose to the consensus ranking.

Sucheta Soundarajan, Tina Eliassi-RadDepartment of Computer ScienceRutgers [email protected], [email protected]

Brian GallagherLawrence Livermore National [email protected]

PP1

Discriminant Analysis for Unsupervised FeatureSelection

Feature selection has been proven to be efficient in handlinghigh-dimensional data. As most data is unlabeled, unsu-pervised feature selection has attracted more and more at-tention in recent years. Discriminant analysis has beenproven to be a powerful technique to select discrimina-tive features. To apply discriminant analysis, we usuallyneed label information which is absent for unlabeled data.This gap makes it challenging to apply discriminant anal-ysis for unsupervised feature selection. In this paper, weinvestigate how to exploit discriminant analysis in unsu-pervised scenarios to select discriminative features. Weintroduce the concept of pseudo labels, which enable dis-criminant analysis on unlabeled data, propose a novel un-supervised feature selection framework DisUFS which in-corporates learning discriminative features with generatingpseudo labels, and develop an effective algorithm for Dis-UFS. Experimental results demonstrate the effectiveness ofDisUFS.

Jiliang Tang

Page 27: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

54 DT14 Abstracts

Arizona State UniversityARIZONA STATE [email protected]

Xia Hu, Huiji Gao, Huan LiuArizona State [email protected], [email protected], [email protected]

PP1

Factor Matrix Trace Norm Minimization for Low-Rank Tensor Completion

Most existing low-n-rank minimization algorithms for ten-sor completion suffer from high computational cost. Toaddress this issue, we propose a novel factor matrix tracenorm minimization method. We introduce a tractable re-laxation of our rank function, which leads to a convexcombination problem of much smaller scale matrix nuclearnorm minimization. Then we develop an efficient ADMMscheme to solve the proposed problem. Experimental re-sults on both synthetic and real-world data validate theeffectiveness of our approach.

Yuanyuan LiuDepartment of Systems Engineering and EngineeringManagementThe Chinese University of Hong [email protected]

Fanhua ShangDepartment of Computer Science and EngineeringThe Chinese University of Hong [email protected]

Hong ChengDepartment of Systems Engineering and EngineeringManagementThe Chinese University of Hong [email protected]

James ChengThe Chinese University of Hong [email protected]

Hanghang TongCity Collge, [email protected]

PP1

Auto-Play: A Data Mining Approach to OdiCricket Simulation and Prediction

Cricket is a popular sport played by 16 countries, is thesecond most watched sport in the world after soccer, andenjoys a multi-million dollar industry. There is tremendousinterest in simulating games and predicting their outcome.In this paper, we build a prediction system that takes inhistorical match data as well as the instantaneous state ofa match, and predicts future match events culminating ina victory or loss.

Vignesh Veppur Sankaranarayanan, Junaed Sattar, LaksV.S. LakshmananUniversity of British Columbia

[email protected], [email protected], [email protected]

PP1

Relational Regularization and Feature Ranking

We propose a regularization and feature selection techniquefor logical and relational learning. To this end, we intro-duce a notion of locality that ties together features accord-ing to their proximity in a transformed representation ofthe relational learning problem obtained via a procedurethat we call “graphicalization’. We present a wrapper andan embedded approach to identify the most relevant sets ofpredicates which yields more readily interpretable resultsthan selecting low-level propositionalized features.

Mathias VerbekeDepartment of Computer ScienceKU [email protected]

Fabrizio CostaInstitut fur InformatikAlbert-Ludwigs-Universitatcosta@informatik.uni-freiburg.de

Luc De RaedtKatholieke Universiteit [email protected]

PP1

Future Influence Ranking of Scientific Literature

A new model called MRFRank is proposed to rank the fu-ture popularity of new publications and young researchers.First, words and words co-occurrence are extracted to char-acterize innovative papers and authors. Then, time-awareweighted graphs are constructed to distinguish the vari-ous importance of links. Finally, by leveraging above twofeatures, MRFRank ranks the future importance of papersand authors simultaneously. Experimental results on theArnetMiner dataset demonstrate the effectiveness of MR-FRank.

Senzhang WangBeihang UniversityUniversity of Illinois at [email protected]

Sihong XieUniversity of Illinois at [email protected]

Xiaoming ZhangBeihang [email protected]

Zhoujun LiBeihang [email protected]

Philip. S YuUniversity of Illinois at [email protected]

Xinyu ShuChina Agricultural University

Page 28: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 55

[email protected]

PP1

Kaleido: Network Traffic Attribution Using Multi-faceted Footprinting

Network traffic attribution, namely, inferring users re-sponsible for activities observed on network interfaces, isone fundamental yet challenging task in security foren-sics. Compared with other user-system interaction records,network traces are inherently coarse-grained, context-sensitive, and detached from user ends. This paperpresents KALEIDO, a new network traffic attribution toolwith a series of key features: a) it adopts a new class of in-ductive discriminant models to capture user- and context-specific patterns (“footprints’) from different aspects ofnetwork traffic; b) it applies efficient learning methods toextracting and aggregating such footprints from noisy his-torical traces; c) with the help of novel indexing structures,it performs efficient, runtime traffic attribution over high-volume network traces. The efficacy of KALEIDO is eval-uated using the real network traces collected over threemonths in a large enterprise network.

Ting WangIBM T.J. Watson Research [email protected]

Fei WangIBM T J Watson Research [email protected]

Reiner Sailer, Douglas SchalesIBM [email protected], [email protected]

PP1

Modeling Asymmetry and Tail Dependence amongMultiple Variables by Using Partial Regular Vine

Modelling high-dimensional dependence is widely studiedto explore deep relations in multiple variables particularlyuseful for financial risk assessment. Very often, strong re-strictions are applied on a dependence structure by existinghigh-dimensional dependence models. These restrictionsdisabled the detection of sophisticated structures such asasymmetry between multiple variables. The paper pro-poses a partial regular vine copula model to relax theserestrictions. The new model employs partial correlation toconstruct the regular vine structure, which is algebraicallyindependent. Our method is tested on a cross-countrystock market data set to construct and analyse the asym-metry and tail dependence.

Wei WeiAdvanced Analytic Institution, FEIT,University of Technology, [email protected]

Junfu Yin, Jinyan LiAdvanced Analytic Institution, FEITUniversity of Technology, [email protected], [email protected]

Longbing CaoUniversity of Technology, Sydney

[email protected]

PP1

WCAMiner: A Novel Knowledge Discovery Sys-tem for Mining Concept Associations UsingWikipedia

This paper presents WCAMiner, a system focusing ondetecting how concepts are associated by incorporatingWikipedia knowledge. We propose to combine contentanalysis and link analysis techniques over Wikipedia re-sources, and define various association mining models tointerpret such queries. Specifically, our algorithm can auto-matically build a Concept Association Graph (CAG) fromWikipedia for two given topics of interest, and generate aranked list of concept chains as potential associations be-tween the two given topics. In comparison to traditionalcross-document mining models where documents are usu-ally domain-specific, the system proposed here is capable ofhandling different query scenarios across domains withoutbeing limited to the given documents. We highlight theimportance of this problem in various domains, present ex-periments on different datasets and compare the mining re-sults with two competitive baseline models to demonstratethe improved performance of our system.

Peng Yan, Wei JinNorth Dakota State [email protected], [email protected]

PP1

Subgraph Search in Large Graphs with Result Di-versification

The problem of subgraph search in large graphs has wideapplications in both nature and social science. The sub-graph search results are typically ordered based on graphsimilarity score. In this paper, we study the problem ofranking the subgraph search results based on diversifica-tion. We design two ranking measures based on bothsimilarity and diversity, and formalize the problem as anoptimization problem. We give two efficient algorithms,the greedy selection and the swapping selection with prov-able performance guarantee. We also propose a novel localsearch heuristic with at least 100 times speedup and a sim-ilar solution quality. We demonstrate the efficiency andeffectiveness of our approaches via extensive experiments.

Huiwen Yu, Dayu YuanDepartment of Computer Science and EngineeringThe Pennsylvania State [email protected], [email protected]

PP1

To Sample Or To Smash? Estimating Reachabilityin Large Time-Varying Graphs

Time-varying graphs (T-graph) consist of a time-evolvingset of graph snapshots (or graphlets). A T-graph prop-erty with potential applications in both computer and so-cial network forensics is T-reachability, which identifies thenodes reachable from a source node using the T-graphedges over time period T. In this paper, we consider theproblem of estimating the T-reachable set of a source nodein two different settings – when a time-evolution of a T-graph is specified by a probabilistic model, and when theactual T-graph snapshots are known and given to us offline

Page 29: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

56 DT14 Abstracts

(”data aware” setting).

Feng YuGraduate Center of [email protected]

Prithwish BasuRaytheon BBN [email protected]

Amotz Bar-Noy, Dror RawitzGraduate Center of [email protected], [email protected]

PP1

Finding Friends on a New Site Using Minimum In-formation

With the emergence of numerous social media sites, indi-viduals, with their limited time, often face a dilemma ofchoosing a few sites over others. Users prefer more engag-ing sites, where they can find familiar faces such as friends,relatives, or colleagues. Link prediction methods help findfriends using link or content information. Unfortunately,whenever users join any site, they have no friends or anycontent generated. In this case, sites have no chance otherthan recommending random influential users to individu-als hoping that users by befriending them create sufficientinformation for link prediction techniques to recommendmeaningful friends. In this study, by considering socialforces that form friendships, namely, influence, homophily,and confounding, and by employing minimum informationavailable for users, we demonstrate how one can signifi-cantly improve random predictions without link or contentinformation.

Reza Zafarani, Huan LiuArizona State [email protected], [email protected]

PP1

Unsupervised Selection of Robust Audio FeatureSubsets

We propose an unsupervised, filter-based feature selectionapproach that preserves the natural assignment of featurecomponents to semantically meaningful features. Experi-ments on different tasks in the audio domain show that theproposed approach outperforms well-established feature se-lection methods in terms of retrieval performance and run-time. Results achieved on different audio datasets for thesame retrieval task indicate that the proposed method ismore robust in selecting consistent feature sets across dif-ferent datasets than compared approaches.

Maia ZaharievaTU Wien, [email protected]

Gerhard SagederUniversity of Vienna, [email protected]

Matthias ZeppelzauerVienna University of Technology, Austria

[email protected]

PP1

Towards Breaking the Curse of Dimensionality forHigh-Dimensional Privacy

We present a general approach to cope the problem ofcurse of dimensionality in privacy models k-anonymity, �-diversity, and t-closeness. In the presence of inter-attributecorrelations, such an approach continues to be much morerobust in higher dimensionality, without losing accuracy.We present experimental results illustrating the effective-ness of the approach. This approach is resilient enoughto prevent identity, attribute, and membership disclosureattack.

Hessam ZakerzadehUniversity of [email protected]

Charu C. AggarwalIBM T. J. Watson Research [email protected]

Ken BarkerUniversity of [email protected]

PP1

Towards Accurate Histogram Publication underDifferential Privacy

This paper considers the problem of publishing histogramsunder differential privacy, one of the strongest privacymodels. Existing differentially private histogram publica-tion schemes have shown that clustering is a promising ideato improve the accuracy of sanitized histograms. However,none of them fully exploits the benefit of clustering. Weintroduce a new clustering framework, which features thetrade-off between the approximation error due to cluster-ing and the Laplace error due to Laplace noise injected.

Xiaojian ZhangSchool of Information, Renmin University of [email protected]

Rui Chen, Jianliang XuDepartment of Computer Science, Hong Kong [email protected], [email protected]

Xiaofeng MengSchool of Information, Renmin University of [email protected]

Yingtao XieExperiment Center, China West Normal [email protected]

PP1

Addressing Human Subjectivity Via TransferLearning: An Application to Predicting DiseaseOutcome in Multiple Sclerosis Patients

Predicting disease course in chronic progressive diseasessuch as multiple sclerosis (MS) is complicated by the im-

Page 30: DT14 Abstracts · rank-1 components, known variously as canonical decom-position, CANDECOMP, or PARAFAC. In any data or graph task, we are seeking to model real-world observa-tions.

DT14 Abstracts 57

pact of physician subjectivity in the data and because ofpatients biases in their choice of physician. We introducea new transfer learning approach to address both issues.For a longitudinal MS dataset, our transfer learning ap-proach performs significantly better than either forming aclassifier for the entire dataset or from a single physician’sdataset alone.

Yijun ZhaoTufts [email protected]

Carla BrodleyDepartment of Computer Science, Tufts [email protected]

Tanuja Chitnis, Brian HealyHarvard Medical SchoolPartners MS Center, Brigham and Womens [email protected], [email protected]