p09_1028_9768

download p09_1028_9768

of 7

Transcript of p09_1028_9768

  • 8/13/2019 p09_1028_9768

    1/7

    Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP , pages 244252,Suntec, Singapore, 2-7 August 2009. c 2009 ACL and AFNLP

    A Non-negative Matrix Tri-factorization Approach toSentiment Classication with Lexical Prior Knowledge

    Tao Li Yi ZhangSchool of Computer Science

    Florida International University{taoli,yzhan004 }@cs.fiu.edu

    Vikas SindhwaniMathematical Sciences

    IBM T.J. Watson Research [email protected]

    Abstract

    Sentiment classication refers to the task of automatically identifying whether agiven piece of text expresses positive ornegative opinion towards a subject at hand.The proliferation of user-generated webcontent such as blogs, discussion forumsand online review sites has made it possi-ble to perform large-scale mining of pub-lic opinion. Sentiment modeling is thusbecoming a critical component of marketintelligence and social media technologiesthat aim to tap into the collective wis-dom of crowds. In this paper, we considerthe problem of learning high-quality senti-ment models with minimal manual super-vision. We propose a novel approach tolearn from lexical prior knowledge in theform of domain-independent sentiment-laden terms, in conjunction with domain-dependent unlabeled data and a few la-beled documents. Our model is based on aconstrained non-negative tri-factorizationof the term-document matrix which canbe implemented using simple update rules.Extensive experimental studies demon-strate the effectiveness of our approach ona variety of real-world sentiment predic-tion tasks.

    1 Introduction

    Web 2.0 platforms such as blogs, discussion fo-rums and other such social media have now givena public voice to every consumer. Recent sur-veys have estimated that a massive number of in-ternet users turn to such forums to collect rec-ommendations for products and services, guid-ing their own choices and decisions by the opin-ions that other consumers have publically ex-pressed. Gleaning insights by monitoring and an-alyzing large amounts of such user-generated data

    is thus becoming a key competitive differentia-tor for many companies. While tracking brandperceptions in traditional media is hardly a newchallenge, handling the unprecedented scale of unstructured user-generated web content requiresnew methodologies. These methodologies are

    likely to be rooted in natural language processingand machine learning techniques.

    Automatically classifying the sentiment ex-pressed in a blog around selected topics of interestis a canonical machine learning task in this dis-cussion. A standard approach would be to manu-ally label documents with their sentiment orienta-tion and then apply off-the-shelf text classicationtechniques. However, sentiment is often conveyedwith subtle linguistic mechanisms such as the useof sarcasm and highly domain-specic contextual

    cues. This makes manual annotation of sentimenttime consuming and error-prone, presenting a bot-tleneck in learning high quality models. Moreover,products and services of current focus, and asso-ciated community of bloggers with their idiosyn-cratic expressions, may rapidly evolve over timecausing models to potentially lose performanceand become stale. This motivates the problem of learning robust sentiment models from minimalsupervision.

    In their seminal work, (Pang et al., 2002)demonstrated that supervised learning signi-cantly outperformed a competing body of work where hand-crafted dictionaries are used to assignsentiment labels based on relative frequencies of positive and negative terms. As observed by (Ng etal., 2006), most semi-automated dictionary-basedapproaches yield unsatisfactory lexicons, with ei-ther high coverage and low precision or vice versa.However, the treatment of such dictionaries asforms of prior knowledge that can be incorporatedin machine learning models is a relatively less ex-plored topic; even lesser so in conjunction withsemi-supervised models that attempt to utilize un-

    244

  • 8/13/2019 p09_1028_9768

    2/7

    labeled data. This is the focus of the current paper.Our models are based on a constrained non-

    negative tri-factorization of the term-documentmatrix, which can be implemented using simpleupdate rules. Treated as a set of labeled features ,the sentiment lexicon is incorporated as one set of constraints that enforce domain-independent priorknowledge. A second set of constraints introducedomain-specic supervision via a few documentlabels. Together these constraints enable learningfrom partial supervision along both dimensions of the term-document matrix, in what may be viewedmore broadly as a framework for incorporatingdual-supervision in matrix factorization models.We provide empirical comparisons with severalcompeting methodologies on four, very differentdomains blogs discussing enterprise softwareproducts, political blogs discussing US presiden-tial candidates, amazon.com product reviews andIMDB movie reviews. Results demonstrate the ef-fectiveness and generality of our approach.

    The rest of the paper is organized as follows.We begin by discussing related work in Section 2.Section 3 gives a quick background on Non-negative Matrix Tri-factorization models. In Sec-tion 4, we present a constrained model and compu-tational algorithm for incorporating lexical knowl-edge in sentiment analysis. In Section 5, we en-hance this model by introducing document labelsas additional constraints. Section 6 presents anempirical study on four datasets. Finally, Section 7concludes this paper.

    2 Related Work

    We point the reader to a recent book (Pang andLee, 2008) for an in-depth survey of literature onsentiment analysis. In this section, we brisklycover related work to position our contributions

    appropriately in the sentiment analysis and ma-chine learning literature.

    Methods focussing on the use and generation of dictionaries capturing the sentiment of words haveranged from manual approaches of developingdomain-dependent lexicons (Das and Chen, 2001)to semi-automated approaches (Hu and Liu, 2004;Zhuang et al., 2006; Kim and Hovy, 2004), andeven an almost fully automated approach (Turney,2002). Most semi-automated approaches have metwith limited success (Ng et al., 2006) and super-vised learning models have tended to outperformdictionary-based classication schemes (Pang et

    al., 2002). A two-tier scheme (Pang and Lee,2004) where sentences are rst classied as sub- jective versus objective , and then applying the sen-timent classier on only the subjective sentencesfurther improves performance. Results in thesepapers also suggest that using more sophisticatedlinguistic models, incorporating parts-of-speechand n-gram language models, do not improve overthe simple unigram bag-of-words representation.In keeping with these ndings, we also adopt aunigram text model. A subjectivity classicationphase before our models are applied may furtherimprove the results reported in this paper, but ourfocus is on driving the polarity prediction stagewith minimal manual effort.

    In this regard, our model brings two inter-

    related but distinct themes from machine learningto bear on this problem: semi-supervised learn-ing and learning from labeled features . The goalof the former theme is to learn from few labeledexamples by making use of unlabeled data, whilethe goal of the latter theme is to utilize weak prior knowledge about term-class afnities (e.g.,the term awful indicates negative sentiment andtherefore may be considered as a negatively la-beled feature). Empirical results in this paperdemonstrate that simultaneously attempting boththese goals in a single model leads to improve-ments over models that focus on a single goal.(Goldberg and Zhu, 2006) adapt semi-supervisedgraph-based methods for sentiment analysis butdo not incorporate lexical prior knowledge in theform of labeled features. Most work in machinelearning literature on utilizing labeled features hasfocused on using them to generate weakly labeledexamples that are then used for standard super-vised learning: (Schapire et al., 2002) propose onesuch framework for boosting logistic regression;(Wu and Srihari, 2004) build a modied SVMand (Liu et al., 2004) use a combination of clus-tering and EM based methods to instantiate simi-lar frameworks. By contrast, we incorporate lex-ical knowledge directly as constraints on our ma-trix factorization model. In recent work, Druck etal. (Druck et al., 2008) constrain the predictions of a multinomial logistic regression model on unla-beled instances in a Generalized Expectation for-mulation for learning from labeled features. Un-like their approach which uses only unlabeled in-stances, our method uses both labeled and unla-beled documents in conjunction with labeled and

    245

  • 8/13/2019 p09_1028_9768

    3/7

  • 8/13/2019 p09_1028_9768

    4/7

    terms after stemming. We eliminated terms thatwere ambiguous and dependent on context, suchas dear and ne. It should be noted, that thislist was constructed without a specic domain inmind; which is further motivation for using train-ing examples and unlabeled data to learn domainspecic connotations.

    Lexical knowledge in the form of the polarityof terms in this lexicon can be introduced in thematrix factorization model. By partially specify-ing term polarities via F , the lexicon inuencesthe sentiment predictions G over documents.

    4.1 Representing Knowledge in Word Space

    Let F 0 represent prior knowledge about sentiment-

    laden words in the lexicon, i.e., if word i is apositive word (F 0 )i1 = 1 while if it is negative(F 0 )i2 = 1. Note that one may also use soft sen-timent polarities though our experiments are con-ducted with hard assignments. This informationis incorporated in the tri-factorization model via asquared loss term,

    minF , G , S

    X FSG T 2 + Tr (F F 0 )T C 1 (F F 0 )

    (4)where the notation Tr

    ( A

    )means trace of the matrix

    A. Here, > 0 is a parameter which determinesthe extent to which we enforce F F 0 , C 1 is a mm diagonal matrix whose entry (C 1 )ii = 1 if thecategory of the i-th word is known (i.e., speciedby the i-th row of F 0 ) and (C 1 )ii = 0 otherwise.The squared loss terms ensure that the solution forF in the otherwise unsupervised learning problembe close to the prior knowledge F 0 . Note that if C 1 = I , then we know the class orientation of allthe words and thus have a full specication of F 0 ,

    Eq.(4) is then reduced to

    minF , G , S

    X FSG T 2 + F F 0 2 (5)

    The above model is generic and it allows certainexibility. For example, in some cases, our priorknowledge on F 0 is not very accurate and we usesmaller so that the nal results are not depen-dent on F 0 very much, i.e., the results are mostlyunsupervised learning results. In addition, the in-troduction of C 1 allows us to incorporate partialknowledge on word polarity information.

    4.2 Computational Algorithm

    The optimization problem in Eq.( 4) can be solvedusing the following update rules

    G jk G jk ( X T FS ) jk

    (GG T X T FS ) jk , (6)

    S ik S ik (F T XG)ik

    (F T FSG T G)ik . (7)

    F ik F ik ( XGS T + C 1 F 0 )ik

    (FF T XGS T + C 1 F )ik . (8)

    The algorithm consists of an iterative procedureusing the above three rules until convergence. Wecall this approach Matrix Factorization with Lex-ical Knowledge (MFLK) and outline the precisesteps in the table below.

    Algorithm 1 Matrix Factorization with LexicalKnowledge (MFLK)

    begin1. Initialization:

    Initialize F = F 0G to K-means clustering results,S = ( F T F ) 1 F T XG(GT G) 1 .

    2. Iteration:Update G: xing F , S , updating GUpdate F: xing S , G , updating F Update S: xing F , G , updating S

    end

    4.3 Algorithm Correctness and Convergence

    Updating F , G , S using the rules above leads to anasymptotic convergence to a local minima. Thiscan be proved using arguments similar to (Dinget al., 2006). We outline the proof of correctnessfor updating F since the squared loss term that in-

    volves F is a new component in our models.Theorem 1 The above iterative algorithm con-verges.

    Theorem 2 At convergence, the solution satisesthe Karuch, Kuhn, Tucker optimality condition,i.e., the algorithm converges correctly to a localoptima.

    Theorem 1 can be proved using the standardauxiliary function approach used in (Lee and Se-ung, 2001).Proof of Theorem 2 . Following the theory of con-strained optimization (Nocedal and Wright, 1999),

    247

  • 8/13/2019 p09_1028_9768

    5/7

    we minimize the following function

    L(F ) = || X FSG T || 2 + Tr (F F 0 )T C 1 (F F 0)

    Note that the gradient of L is,

    LF = 2 XGS

    T

    + 2FSGT

    GS T

    + 2C 1 (F F 0 ).(9)

    The KKT complementarity condition for the non-negativity of F ik gives

    [ 2 XGS T + FSG T GS T + 2C 1 (F F 0 )]ik F ik = 0.(10)

    This is the xed point relation that local minimafor F must satisfy. Given an initial guess of F , thesuccessive update of F using Eq.(8) will convergeto a local minima. At convergence, we have

    F ik = F ik ( XGS T + C 1 F 0 )ik

    (FF T XGS T + C 1 F )ik .

    which is equivalent to the KKT condition of Eq.(10). The correctness of updating rules for G inEq.(6) and S in Eq.(7) have been proved in (Dinget al., 2006).

    Note that we do not enforce exact orthogonalityin our updating rules since this often implies softerclass assignments.

    5 Semi-Supervised Learning WithLexical Knowledge

    So far our models have made no demands on hu-man effort, other than unsupervised collection of the term-document matrix and a one-time effort incompiling a domain-independent sentiment lexi-con. We now assume that a few documents aremanually labeled for the purposes of capturingsome domain-specic connotations leading to amore domain-adapted model. The partial labelson documents can be described using G0 where

    (G 0 )i1 = 1 if the document expresses positive sen-timent, and (G 0 )i2 = 1 for negative sentiment. Aswith F 0 , one can also use soft sentiment labelingfor documents, though our experiments are con-ducted with hard assignments.

    Therefore, the semi-supervised learning withlexical knowledge can be described as

    minF , G , S

    X FSG T 2 + Tr (F F 0 )T C 1 (F F 0 ) +

    Tr (G G 0 )T C 2 (G G 0 )

    Where > 0, > 0 are parameters which deter-mine the extent to which we enforce F F 0 and

    G G 0 respectively, C 1 and C 2 are diagonal ma-trices indicating the entries of F 0 and G 0 that cor-respond to labeled entities. The squared loss termsensure that the solution for F , G , in the otherwiseunsupervised learning problem, be close to theprior knowledge F 0 and G0 .

    5.1 Computational Algorithm

    The optimization problem in Eq.( 4) can be solvedusing the following update rules

    G jk G jk ( X T FS + C 2 G 0 ) jk

    (GG T X T FS + GG T C 2 G 0 ) jk (11)

    S ik S ik (F T XG)ik

    (F T FSG T G)ik . (12)

    F ik F ik ( XGS T + C 1 F 0 )ik

    (FF T XGS T + C 1 F )ik . (13)Thus the algorithm for semi-supervised learning

    with lexical knowledge based on our matrix fac-torization framework, referred as SSMFLK, con-sists of an iterative procedure using the above threerules until convergence. The correctness and con-vergence of the algorithm can also be proved usingsimilar arguments as what we outlined earlier forMFLK in Section 4.3.

    A quick word about computational complexity.

    The term-document matrix is typically very sparsewith z nm non-zero entries while k is typicallyalso much smaller than n, m. By using sparse ma-trix multiplications and avoiding dense intermedi-ate matrices, the updates can be very efcientlyand easily implemented. In particular, updatingF , S , G each takes O(k 2 (m + n) + kz) time per it-eration which scales linearly with the dimensionsand density of the data matrix. Empirically, thenumber of iterations before practical convergenceis usually very small (less than 100). Thus, com-

    putationally our approach scales to large datasetseven though our experiments are run on relativelysmall-sized datasets.

    6 Experiments

    6.1 Datasets Description

    Four different datasets are used in our experi-ments.

    Movies Reviews : This is a popular dataset insentiment analysis literature (Pang et al., 2002).It consists of 1000 positive and 1000 negativemovie reviews drawn from the IMDB archive of the rec.arts.movies.reviews newsgroups.

    248

  • 8/13/2019 p09_1028_9768

    6/7

    Lotus blogs : The data set is targeted at detect-ing sentiment around enterprise software, specif-ically pertaining to the IBM Lotus brand (Sind-hwani and Melville, 2008). An unlabeled setof blog posts was created by randomly sampling2000 posts from a universe of 14,258 blogs thatdiscuss issues relevant to Lotus software. In ad-dition to this unlabeled set, 145 posts were cho-sen for manual labeling. These posts came from14 individual blogs, 4 of which are actively post-ing negative content on the brand, with the resttending to write more positive or neutral posts.The data was collected by downloading the lat-est posts from each bloggers RSS feeds, or ac-cessing the blogs archives. Manual labeling re-sulted in 34 positive and 111 negative examples.Political candidate blogs : For our second blogdomain, we used data gathered from 16,742 polit-ical blogs, which contain over 500,000 posts. Aswith the Lotus dataset, an unlabeled set was cre-ated by randomly sampling 2000 posts. 107 postswere chosen for labeling. A post was labeled ashaving positive or negative sentiment about a spe-cic candidate (Barack Obama or Hillary Clinton)if it explicitly mentioned the candidate in posi-tive or negative terms. This resulted in 49 posi-tively and 58 negatively labeled posts. AmazonReviews : The dataset contains product reviewstaken from Amazon.com from 4 product types:Kitchen, Books, DVDs, and Electronics (Blitzeret al., 2007). The dataset contains about 4000 pos-itive reviews and 4000 negative reviews and canbe obtained from http://www.cis.upenn.edu/mdredze/datasets/sentiment/ .

    For all datasets, we picked 5000 words withhighest document-frequency to generate the vo-cabulary. Stopwords were removed and a nor-malized term-frequency representation was used.Genuinely unlabeled posts for Political and Lo-tus were used for semi-supervised learning experi-ments in section 6.3; they were not used in section6.2 on the effect of lexical prior knowledge. In theexperiments, we set , the parameter determiningthe extent to which to enforce the feature labels,to be 1/2, and , the corresponding parameter forenforcing document labels, to be 1.

    6.2 Sentiment Analysis with LexicalKnowledge

    Of course, one can remove all burden on hu-man effort by simply using unsupervised tech-

    niques. Our interest in the rst set of experi-ments is to explore the benets of incorporating asentiment lexicon over unsupervised approaches.Does a one-time effort in compiling a domain-independent dictionary and using it for differentsentiment tasks pay off in comparison to simplyusing unsupervised methods? In our case, matrixtri-factorization and other co-clustering methodsform the obvious unsupervised baseline for com-parison and so we start by comparing our method(MFLK) with the following methods:

    Four document clustering methods: K-means, Tri-Factor Nonnegative Ma-trix Factorization (TNMF) (Ding et al.,2006), Information-Theoretic Co-clustering(ITCC) (Dhillon et al., 2003), and EuclideanCo-clustering algorithm (ECC) (Cho et al.,2004). These methods do not make use of the sentiment lexicon.

    Feature Centroid (FC): This is a simpledictionary-based baseline method. Recallthat each word can be expressed as a bag-of-documents vector. In this approach, wecompute the centroids of these vectors, onecorresponding to positive words and anothercorresponding to negative words. This yieldsa two-dimensional representation for docu-ments, on which we then perform K-meansclustering.

    Performance Comparison Figure 1 shows theexperimental results on four datasets using accu-racy as the performance measure. The results areobtained by averaging 20 runs. It can be observedthat our MFLK method can effectively utilize thelexical knowledge to improve the quality of senti-ment prediction.

    Movies Lotus Political Amazon0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    A c c u r a c y

    MFLK

    FCTNMFECC

    ITCCKMeans

    Figure 1: Accuracy results on four datasets

    249

  • 8/13/2019 p09_1028_9768

    7/7

    Size of Sentiment Lexicon We also investigatethe effects of the size of the sentiment lexicon onthe performance of our model. Figure 2 showsresults with random subsets of the lexicon of in-creasing size. We observe that generally the per-formance increases as more and more lexical su-pervision is provided.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

    0.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    Fraction of sentiment words labeled

    A c c u r a c y

    Movies

    Lotus

    Political

    Amazon

    Figure 2: MFLK accuracy as size of sentimentlexicon (i.e., number of words in the lexicon) in-creases on the four datasets

    Robustness to Vocabulary Size High dimen-sionality and noise can have profound impact onthe comparative performance of clustering andsemi-supervised learning algorithms. We simu-late scenarios with different vocabulary sizes byselecting words based on information gain. It

    should, however, be kept in mind that in a tru-ely unsupervised setting document labels are un-available and therefore information gain cannotbe practically computed. Figure 3 and Figure 4show results for Lotus and Amazon datasets re-spectively and are representative of performanceon other datasets. MLFK tends to retain its po-sition as the best performing method even at dif-ferent vocabulary sizes. ITCC performance is alsonoteworthy given that it is a completely unsuper-vised method.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

    0.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    Fraction of Original Vocabulary

    A c c u r a c y

    MFLKFCTNMFKMeansITCCECC

    Figure 3: Accuracy results on Lotus dataset withincreasing vocabulary size

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

    0.52

    0.54

    0.56

    0.58

    0.6

    0.62

    0.64

    0.66

    0.68

    Fraction of Original Vocabulary

    A c c u r a c y

    MFLKFCTNMFKMeansITCCECC

    Figure 4: Accuracy results on Amazon datasetwith increasing vocabulary size

    6.3 Sentiment Analysis with DualSupervision

    We now assume that together with labeled features

    from the sentiment lexicon, we also have access toa few labeled documents. The natural question iswhether the presence of lexical constraints leadsto better semi-supervised models. In this section,we compare our method (SSMFLK) with the fol-lowing three semi-supervised approaches: (1) Thealgorithm proposed in (Zhou et al., 2003) whichconducts semi-supervised learning with local andglobal consistency (Consistency Method); (2) Zhuet al.s harmonic Gaussian eld method coupledwith the Class Mass Normalization (Harmonic-

    CMN) (Zhu et al., 2003); and (3) Greens functionlearning algorithm (Greens Function) proposedin (Ding et al., 2007).

    We also compare the results of SSMFLK withthose of two supervised classication methods:Support Vector Machine (SVM) and Naive Bayes.Both of these methods have been widely used insentiment analysis. In particular, the use of SVMsin (Pang et al., 2002) initially sparked interestin using machine learning methods for sentimentclassication. Note that none of these competing

    methods utilizes lexical knowledge.The results are presented in Figure 5, Figure 6,Figure 7, and Figure 8. We note that our SSMFLKmethod either outperforms all other methods overthe entire range of number of labeled documents(Movies, Political), or ultimately outpaces othermethods (Lotus, Amazon) as a few document la-bels come in.

    Learning Domain-Specic Connotations Inour rst set of experiments, we incorporated thesentiment lexicon in our models and learnt thesentiment orientation of words and documents viaF , G factors respectively. In the second set of

    250