Ankit presentation

26
Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References Classification of Sentimental Reviews Using Machine Learning Techniques ICRTC-2015 :3 rd International Conference On Recent Trends In Computing Presented At SRM University Delhi-NCR Campus, Ghaziabad(U.P) Ankit Agrawal Department of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela - 769008, Odisha, India March 9, 2015 1 / 26

Transcript of Ankit presentation

Page 1: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Classification of Sentimental Reviews Using

Machine Learning TechniquesICRTC-2015 : 3rd International Conference On Recent Trends In

Computing

Presented At

SRM University

Delhi-NCR Campus, Ghaziabad(U.P)

Ankit AgrawalDepartment of Computer Science and Engineering,

National Institute of Technology Rourkela,Rourkela - 769008, Odisha, India

March 9, 20151 / 26

Page 2: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Sentiment Analysis

Sentiment mainly refers to feelings, emotions, opinion or atti-tude (Argamon et al., 2009).

With the rapid increase of world wide web, people often expresstheir sentiments over internet through social media, blogs, rat-ing and reviews.

Business owners and advertising companies often employ sen-timent analysis to discover new business strategies and adver-tising campaign.

2 / 26

Page 3: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Machine Learning Techniques

Machine leaning algorithms are used to classify and predict whethera document represents positive or negative sentiment. Differenttypes of machine learning algorithms are:

Supervised algorithm uses a labeled dataset where each doc-ument of training set is labeled with appropriate sentiment.

Unsupervised algorithm include unlabeled dataset (Singh et al.,2007).

This study mainly concerns with supervised learning techniques ona labeled dataset.

3 / 26

Page 4: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Types of Sentiment Analysis

Different Types of Sentiment analysis are as follows:

Document Level: Document Level sentiment classificationaims at classifying the entire document or review as either pos-itive or negative.

Sentence Level: Sentence level sentiment classification con-siders the polarity of individual sentence of a document.

Aspect Level: Aspect level sentiment classification first iden-tifies the different aspects of a corpus and then for each doc-ument, the polarity is calculated with respect to obtained as-pects.

Document level sentiment analysis is being considered for analysisin this study.

4 / 26

Page 5: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Review of Related work

Author(s) Sentiment Analysis Approach

(Pang et al.,2002)

They have considered sentiment classification basedon categorization aspect with positive and negativesentiments . They used three different machine learn-ing algorithms i.e., Naive Bayes, SVM , and MaximumEntropy classification applied over the n-gram tech-nique.

(Turney, 2002). He presented unsupervised algorithm to classify re-view as either recommended (positive) or not rec-ommended (negative). The author has used Part ofSpeech (POS) tagger to identify phrases which con-tain adjectives or adverbs.

(Read, 2005). He used emotions for labeling of dataset. He usedemotions for labeling because they are independent oftime, topic and domain. He applied machine learningclassifiers on this labeled dataset.

5 / 26

Page 6: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

continue...

Author(s) Sentiment Analysis Approach

(Dave et al.,2003).

They have used structured review for testing andtraining, identifying features and score methods to de-termine whether the reviews are positive or negative.They used classifier to classify the sentences obtainedfrom web search through search query using productname as search condition.

(Whitelawet al., 2005).

They have presented a sentiment classification tech-nique on the basis of analysis and extraction of ap-praisal groups. Appraisal group represents a set ofattribute values in task independent semantic tax-onomies.

(Li et al.,2011).

They have proposed various semi-supervised tech-niques to solve the issue of shortage of labeled datafor sentiment classification . They have used undersampling technique to deal with the problem of sen-timent classification i.e., imbalance problem.

6 / 26

Page 7: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Types of Classification

Binary sentiment classification: Each document or reviewof the corpus is classified into two classes either positive ornegative.

Multi-class sentiment classification: Each review can beclassified into more than two classes (strong positive, positive,neutral , negative, strong negative).

Generally, the binary classification is useful when two productsneed to be compared. In this study, implementation is done withrespect to binary sentiment classification.

7 / 26

Page 8: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Preparation of Data

The unstructured textual review data need to be converted to mean-ingful data in order to apply machine learning algorithms.Following methods have been used to transform textual data tonumerical vectors.

CountVectorizer: Based on the number of occurrences of afeature in the review, a sparse matrix is created (Garreta andMoncecchi, 2013).

Term Frequency - Inverse Document frequency (TF-IDF):The TF-IDF score is helpful in balancing the weight betweenmost frequent or general words and less commonly used words.Term frequency calculates the frequency of each token in thereview but this frequency is offset by frequency of that tokenin the whole corpus (Garreta and Moncecchi, 2013). TF-IDFvalue shows the importance of a token to a document in thecorpus.

8 / 26

Page 9: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Machine Learning Algorithms Used

Naive Bayes Algorithm: It is a probabilistic classifier which usesthe properties of Bayes theorem assuming the strong independencebetween the features (McCallum et al., 1998).For a given textual review ‘d’ and for a class ‘c’ (positive,negative),the conditional probability for each class given a review is P(c |d) .According to Bayes theorem this quantity can be computed usingthe following equation 1

P(c |d) =P(d |c) ∗ P(c)

P(d)(1)

9 / 26

Page 10: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Continue...

Support Vector Machine Algorithm: SVM is a non probabilisticbinary linear classifier (Turney, 2002). SVM Model represents eachreview in vectorized form as a data point in the space. This methodis used to analyze the complete vectorized data and the key ideabehind the training of model is to find a hyperplane.The set of textual data vectors are said to be optimally separated byhyperplane only when it is separated without error and the distancebetween closest points of each class and hyperplane is maximum.

10 / 26

Page 11: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Confusion Matrix

Confusion matrix is generated to tabulate the performance of anyclassifier.

Correct Labels

Positive Negative

Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

Table: Confusion Matrix

TP(True Positive) is the number of positive reviews that arecorrectly predicted and FP(False positive) is the number ofpositive reviews predicted as negative.

TN(True Negative) is number of negative reviews correctlypredicted and FN(False Negative) is number of negative re-views predicted as positive.

11 / 26

Page 12: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Evaluation Parameters

Precision : It gives the exactness of the classifier. It is theratio of number of correctly predicted positive reviews to thetotal number of reviews predicted as positive.

precision =TP

TP + FP(2)

Recall: It measures the completeness of the classifier. It is theratio of number of correctly predicted positive reviews to theactual number of positive reviews present in the corpus.

Recall =TP

TP + FN(3)

12 / 26

Page 13: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Continue...

F-measure: It is the harmonic mean of precision and recall.F-measure can have best value as 1 and worst value as 0. Theformula for calculating F-measure is given below in equation 4

FMeasure =2 ∗ Precision ∗ Recall

Precision + Recall(4)

Accuracy: It is one of the most common performance eval-uation parameter and it is calculated as the ratio of numberof correctly predicted reviews to the number of total numberof reviews present in the corpus. The formula for calculatingaccuracy is given as equation 5

Accuracy =TP + TN

TP + FP + TN + FN(5)

13 / 26

Page 14: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Proposed Approach

Dataset

Preprocessing: Stopword, Numerical and Special character removal

Vectorization

Train using machine learning algorithm

Classification

Result

Figure: Steps to obtain the required output

14 / 26

Page 15: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Dataset

In this study, labeled aclIMDb movie dataset (IMDb, 2006) isconsidered .

Dataset contain 12500 labeled positive and negative reviews fortraining of model

It also contain 12500 positive and negative reviews for testingof model as well.

15 / 26

Page 16: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Preprocessing

The review contains a large amount of vague information whichneeded to be eliminated.

In preprocessing step, firstly, all the special characters used like(!@) and the unnecessary blank spaces are removed.

It is observed that reviewers often repeat a particular characterof a word to give more emphasis to an expression or to makethe review trendy (Amir et al., 2014).

second step in preprocessing involves the removal of all thestopwords of English language.

16 / 26

Page 17: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Vectorization

CountVectorizer: It transforms the review to token count ma-trix. First, it tokenizes the review and according to number ofoccurrence of each token, a sparse matrix is created.

TF-IDF: Its value represents the importance of a word to adocument in a corpus. TF-IDF value is proportional to thefrequency of a word in a document; but it is limited by thefrequency of the word in the corpus.

Calculation of TF-IDF value : suppose a movie review contain100 words wherein the word Awesome appears 5 times. The termfrequency (i.e., TF) for Awesome then (5 / 100) = 0.05. Again, sup-pose there are 1 million reviews in the corpus and the word Awesome

appears 1000 times in whole corpus. Then, the inverse documentfrequency (i.e., IDF) is calculated as log(1,000,000 / 1,000) = 3.Thus, the TF-IDF value is calculated as: 0.05 * 3 = 0.15.

17 / 26

Page 18: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Training and Classification

Naive Bayes(NB) algorithm: Using probabilistic analysis, fea-tures are extracted from numeric vectors. These features helpin training of the Naive Bayes classifier model (McCallum et al.,1998).

Support vector machine (SVM) algorithm: SVM plots all thenumeric vectors in space and defines decision boundaries byhyperplanes. This hyperplane separates the vectors in two cat-egories such that, the distance from the closest point of eachcategory to the hyperplane is maximum (Turney, 2002).

After training of model using above mentioned algorithms, the12500 positive and negative reviews given for testing are classifiedbased on training of model.

18 / 26

Page 19: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Result NB

The confusion matrix obtained after Naive Bayes classification isshown in table 2 and evaluation parameters Precision, Recall andF-Measure are shown in table 3 as follows:

Table: Confusion matrix forNaive Bayes classifier

Correct Labels

Positive Negative

Positive 11025 1475

Negative 2612 9888

Table: Evaluation parameter forNaive Bayes classifier

Precision Recall F-Measure

Negative 0.81 0.88 0.84

Positive 0.87 0.79 0.83

The accuracy for Naive Bayes Classifier is 0.83652

19 / 26

Page 20: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Result SVM

The confusion matrix obtained after Support Vector Machineclassification is shown in table 4 and evaluation parametersPrecision, Recall and F-Measure are shown in table 5 as follows:

Table: Confusion matrix forSupport Vector Machineclassifier

Correct Labels

Positive Negative

Positive 10993 1507

Negative 1749 10751

Table: Evaluation parameter forSupport Vector Machineclassifier

Precision Recall F-Measure

Negative 0.86 0.88 0.87

Positive 0.88 0.86 0.87

The accuracy for Support Vector Machine classifier for unigram is0.86976

20 / 26

Page 21: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Comparison of Work

Table: Comparative result obtained among different literature usingIMDb dataset

Classifier

Classification Accuracy

Pang et al. (2002) Salvetti et al. (2004) Mullen and Collier (2004) Beineke et al. (2004) Matsumoto et al. (2005) Proposed approach

Naive Bayes 0.815 0.796 x 0.659 x 0.83

Support Vector Machine 0.659 x 0.86 x 0.883 0.884

The ‘x’ mark indicates that the algorithm is not considered by the authorin their respective paper.

21 / 26

Page 22: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Conclusion

In this study, an attempt has been made to classify sentimentalmovie reviews using machine learning techniques.

Two different algorithms namely Naive Bayes (NB) and SupportVector Machine (SVM) are implemented.

It is observed that SVM classifier outperforms every other clas-sifier in predicting the sentiment of a review.

The result obtained in this study is comparatively better thanother literatures on same dataset.

22 / 26

Page 23: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Future Work

In this study, only two different classifiers have been imple-mented.

In future, other classification strategies under supervised learn-ing methodology like Maximum Entropy classifier, StochasticGradient Classifier, K Nearest Neighbor and others can be con-sidered for implementation.

Finally, comparison of results can be presented with SVM, whichis currently the best classifier, for the sentiment analysis.

23 / 26

Page 24: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Reference I

Amir, S., Almeida, M., Martins, B., Filgueiras, J., and Silva, M. J. (2014). Tugas: Exploiting unlabelled data fortwitter sentiment analysis. SemEval 2014, page 673.

Argamon, S., Bloom, K., Esuli, A., and Sebastiani, F. (2009). Automatically determining attitude type and forcefor sentiment analysis. In Human Language Technology. Challenges of the Information Society, pages 218–231.Springer.

Beineke, P., Hastie, T., and Vaithyanathan, S. (2004). The sentimental factor: Improving review classification viahuman-provided information. In Proceedings of the 42nd Annual Meeting on Association for Computational

Linguistics, page 263. Association for Computational Linguistics.

Dave, K., Lawrence, S., and Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semanticclassification of product reviews. In Proceedings of the 12th international conference on World Wide Web, pages519–528. ACM.

Garreta, R. and Moncecchi, G. (2013). Learning scikit-learn: Machine Learning in Python. Packt Publishing Ltd.

IMDb (2006). Imdb, internet movie database sentiment analysis dataset.

Li, S., Wang, Z., Zhou, G., and Lee, S. Y. M. (2011). Semi-supervised learning for imbalanced sentiment classification.In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1826.

Matsumoto, S., Takamura, H., and Okumura, M. (2005). Sentiment classification using word sub-sequences anddependency sub-trees. In Advances in Knowledge Discovery and Data Mining, pages 301–311. Springer.

McCallum, A., Nigam, K., et al. (1998). A comparison of event models for naive bayes text classification. In AAAI-98

workshop on learning for text categorization, volume 752, pages 41–48. Citeseer.

Mullen, T. and Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources.In EMNLP, volume 4, pages 412–418.

Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learningtechniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-

Volume 10, pages 79–86. Association for Computational Linguistics.

24 / 26

Page 25: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

Reference II

Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification.In Proceedings of the ACL Student Research Workshop, pages 43–48. Association for Computational Linguistics.

Salvetti, F., Lewis, S., and Reichenbach, C. (2004). Automatic opinion polarity classification of movie. Colorado

research in linguistics, 17:2.

Singh, Y., Bhatia, P. K., and Sangwan, O. (2007). A review of studies on machine learning techniques. InternationalJournal of Computer Science and Security, 1(1):70–84.

Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification ofreviews. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 417–424.Association for Computational Linguistics.

Whitelaw, C., Garg, N., and Argamon, S. (2005). Using appraisal groups for sentiment analysis. In Proceedings of

the 14th ACM international conference on Information and knowledge management, pages 625–631. ACM.

25 / 26

Page 26: Ankit presentation

Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

ThankYou!

26 / 26