FLIPKART SAMSUNG

25
1 TEXT ANALYSIS TECHNIQUES TO ANALYZE REVIEWS OF SAMSUNG GALAXY MEGA 5.8 I9152 SUBMITTED BY KOUSHIK RAKSHIT ROLL NO:-A14034

Transcript of FLIPKART SAMSUNG

Page 1: FLIPKART SAMSUNG

1

TEXT ANALYSIS TECHNIQUES TO

ANALYZE REVIEWS OF SAMSUNG

GALAXY MEGA 5.8 I9152

SUBMITTED BY

KOUSHIK RAKSHIT

ROLL NO:-A14034

Page 2: FLIPKART SAMSUNG

2

CONTENTS

1. Introduction------------------------------------------------------------------------ 3

2. Problem Statement---------------------------------------------------------------- 3

3. Key features------------------------------------------------------------------------ 3

4. Research Design------------------------------------------------------------------- 4

5. Research Methodology---------------------------------------------------------- 4

A. Insights from Web Crawling & Word Cloud--------------------------5

B. Latent Semantic Analysis (LSA) and Cluster Analysis--------------7

C. Reviews Ratings Analysis-----------------------------------------------14

D. Classification using Support Vector Machine(SVM)---------------14

E. Reviews Sentiment Analysis-------------------------------------------16

6. Business Perspective------------------------------------------------------------18

7. Appendix-------------------------------------------------------------------------19

Page 3: FLIPKART SAMSUNG

3

1. INTRODUCTION:- TEXT ANALYTICS

Text mining,referred to as text data mining, roughly equivalent to text analytics,

refers to the process of deriving high-quality information from text. High-quality

information is typically derived through the devising of patterns and trends through

means such as statistical pattern learning.

2. PROBLEM STATEMENT ANALYZING REVIEWS FOR SAMSUNG GALAXY MEGA 5.8 I9152 (BLACK, WITH BLACK)

From flipkart.com reviews for Samsung Galaxy Mega 5.8 I9152 (at least 100 reviews) were downloaded and a thorough analysis using text analysis techniques was carried out.

3. KEY FEATURES OF SAMSUNG GALAXY MEGA 5.8 I9152

Wi-Fi Enabled

Expandable Storage Capacity of 64 GB

5.8-inch TFT Capacitive Touchscreen

Android v4.2.2 (Jelly Bean) OS

8 MP Primary Camera

1.9 MP Secondary Camera

1.4 GHz Dual Core Processor

Full HD Recording

Page 4: FLIPKART SAMSUNG

4

4. RESEARCH DESIGN

To analyze the user’s responses we had to collect primary and secondary

information from user’s mobile reviews from the website http://www.flipkart.com.

To analyze the user’s perception the about the phone we took 100 reviews from

the review section from flipkart.

5. RESEARCH METHODOLOGY

To analyze the user reviews following research analysis procedures where undertaken:

A. Web Crawling & Word Cloud

B. Latent Semantic Analysis (LSA) and Clustering Analysis

C. Rating Analysis

D. Classification Analysis using Support Vector Machine(SVM)

E. Reviews Sentiment Analysis

A. INSIGHT FROM WEB CRAWLING & WORDCLOUD

A tag cloud (word cloud, or weighted list in visual design) is a visual representation

for text data, typically used to depict keyword metadata (tags) on websites, or to

visualize free form text. Tags are usually single words, and the importance of each tag

is shown with font size or color. This format is useful for quickly perceiving the most

prominent terms and for locating a term alphabetically to determine its relative

prominence. When used as website navigation aids, the terms are hyperlinked to items

associated with the tag.

R packages used for Word cloud:-RCurl , XML , rvest , word cloud , tm

Page 5: FLIPKART SAMSUNG

5

1. Fetching reviews from FLIPKART.COM

FLIPKART<-"http://www.flipkart.com/samsung-galaxy-mega-5-8-i9152/product-

reviews/ITMEYFRTWAXZXTUT?pid=MOBDZSDJAPQXGAWN&type=all"

2.Word Cloud Creation

wordcloud(d$words,d$freq,max.words=300,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)

INFERENCE DRAWN:- The word that took prominence in this Word Cloud gave a

clear idea that the mobile at the point of discussion is good & may be known for

screen size , display, camera ,battery . But it does not give proper idea if the product is

worth buying or users of the said mobile are satisfied or not. So to gain more insight

into our data we had to analyze their ratings (based out of 5).

B.LATENT SEMANTIC ANALYSIS AND CLUSTERANALYSIS:

For Latent Semantic Analysis, in which we break the term document matrix into 3 matrices:

Page 6: FLIPKART SAMSUNG

6

Word-Dimension Matrix

Documents dimension Matrix

Diagonal Matrix(Identity)

Word-Dimension Matrix:-

PLOTTING X1 vs X2

Inferences:

When we break the term document matrix into dimension-word vector space chart it is

clearly visible that the positive words like good, features like screen, battery etc, are

occurring mostly at dimension-1.

Grand, Mega , phone occurring mostly in dimension-2

Display, price, money, quality of phone are more or less occurring in both the

dimensions.

Page 7: FLIPKART SAMSUNG

7

PLOTTING X1 vs X3

Inferences:

Grand, Mega , phone occurring mostly in dimension-1

quality related words are more or less occurring in both the dimensions.

PLOTTING X2 vs X3

Page 8: FLIPKART SAMSUNG

8

Inferences:

Grand, Mega occurring mostly in dimension-1

Camera, samsung are more or less occurring in both the dimensions.

PLOTTING FOR DOCUMENT MATRIX

Inference:-

Document no. 71, 67 49 are close to dimension-1.

Document no.68 is close to dimension-2.

HIERARCHIAL CLUSTERING TO DETERMINE OPTIMUM NO. OF

CLUSTERS

In data mining, hierarchical clustering (also called hierarchical cluster analysis or

HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.

Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its

own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster,

and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The

results of hierarchical clustering are usually presented in a dendrogram.

Page 9: FLIPKART SAMSUNG

9

As per the plot, optimum values could be either 3,4 or 5.

CLUSTER ANALYSIS:

From the above LSA analysis we got 5 optimum numbers of clusters,

which shows that there are 5 categories of reviews out of total 100

reviews.

So as if now we will concentrate on 5 review-clusters which are helpful to

club different types of reviews from the users.

CLUSTER-1

Page 10: FLIPKART SAMSUNG

10

There are total of 57 observations in this cluster.

This cluster consists of words related to price and the look of the phone

CLUSTER-2

There are 32 observations in this cluster

This cluster consists of words from reviews that are from customers who have good

experience with this mobile.

WORD CLOUD FOR CLUSTER-1

Page 11: FLIPKART SAMSUNG

11

CLUSTER-3

There are 38 observations in this cluster

This cluster consists of words from reviews that are from customers who have good faith in

the company.

WORD CLOUD FOR CLUSTER-2

Page 12: FLIPKART SAMSUNG

12

CLUSTER-4

There are only 2 observations in this cluster

This cluster doesnot throw any light on the nature of the cluster.

WORD CLOUD FOR CLUSTER-4

WORD CLOUD FOR CLUSTER-3

Page 13: FLIPKART SAMSUNG

13

CLUSTER-5

The above cluster has 1449 number of observations.

This cluster consists of words related to product features & quality.

WORD CLOUD FOR CLUSTER-5

Page 14: FLIPKART SAMSUNG

14

INFERENCE DRAWN FROM CLUSTERING

Apart from Cluster 1 , all other cluster does not give sufficient information about the

customer base/type. Moreover Cluster 2,3,4,5 is substantially smaller than Cluster 1

and no constructive storyline can be curved out of them.

Whereas Cluster 1 reflects almost anything & everything about the various features of the phone that the customers might have liked.

C.REVIEWS RATINGS ANALYSIS

Total Reviews=100

Satisfied Reviews: 73

Dissatisfied Reviews: 27

Checking The Ratings Gives A Better Idea That Most Users Are Satisfied With This Mobile.

D. CLASSIFICATION USING SUPPORT VECTOR MACHINE(SVM) In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. The generalization properties of an SVM do not depend on the dimensionality of the space. You can bound the generalization error by a term depending on the quotient of radius of a ball which contains all the data and the margin realized on that data, but not on the dimensionality of the space. Many extensions exist, but the answer is essentially the same: The generalization does not depend on the

Page 15: FLIPKART SAMSUNG

15

dimensionality. An extended explanation is that you can generalize well even in high-dimensional spaces because the data occupies only a low-dimensional subspace of the feature space, and regularization results in the learner dealing only with that subspace. You can see this for yourself if you look at the eigenvalues of the kernel matrix which typically decay quickly, meaning that you can project your data to a low-dimensional subspace with negligible error. So even if you have, for example, a Gaussian kernel, where the feature space is infinite-dimensional, you are actually dealing with an essentially finite dimensional kernel feature space where you are learning a linear decision function, which is statistically tractable. Note that you need to regularize, though.

From the above fig. we can infer that out of 100 data points , 95 data points contribute

to the formation of marginal plane .

6 words displayed at the head has negative coefficient

6 words displayed at the tail has positive coefficient

Page 16: FLIPKART SAMSUNG

16

Using SVM, we have classified reviews into two categories .

Since “dissatisfied” is the first level, the words with negative coefficient have positive

impact & vice versa

Snapshot of Data Frame containing list of words & their

frequency count

SENTIMENT ANALYSIS

Sentiment essentially relates to feelings; attitudes, emotions and opinions.

Sentiment Analysis refers to the practice of applying Natural Language

Processing and Text Analysis techniques to identify and extract subjective

information from a piece of text. A person‟s opinion or feelings are for the

most part subjective and not facts. Which means to accurately analyze an

individual‟s opinion or mood from a piece of text can be extremely difficult.

With Sentiment Analysis from a text analytics point of view, we are essentially

looking to get an understanding of the attitude of a writer with respect to a

Page 17: FLIPKART SAMSUNG

17

topic in a piece of text and its polarity; whether it‟s positive, negative or

neutral.

In recent years there has been a steady increase in interest from brands,

companies and researchers in Sentiment Analysis and its application to

business analytics.

The business world today, as is the case in many data analytics streams, is

looking for “business insight.”

● Installing „qdap‟ package

● We decided the threshold value for polarity to classify between satisfied

& dissatisfied on the basis of the plot in the next page.

● This tree plot was done using library “party”.

● Output of the popularity check or sentiment analysis gives the clear message that 65% of the buyers are satisfied with the purchase of the mobile

Page 18: FLIPKART SAMSUNG

18

CONCLUSION & BUSINESS PERSPECTIVE

• Output of our text analytics techniques brings out the fact that Samsung Galaxy Mega 5.8 I9152 is a mobile worth buying.

• Most of the customers who bought it are extremely satisfied with the various features it offers.

• Customer segmentation is possible but the very clear classification is not possible as there are lot many features that are equally liked by all across clusters.

• But of course buyers can be classified in terms of their satisfaction level.

Page 19: FLIPKART SAMSUNG

19

APPENDIX

CODES

#WEB CRAWLING & WORD CLOUD

#install.packages("RCurl")

library(RCurl)

#install.packages("XML")

library(XML)

#install.packages("rvest")

library(rvest)

library(wordcloud)

library(tm)

FLIPKART<-"http://www.flipkart.com/samsung-galaxy-core-18262/product-reviews/ITMDV6F6KYTTPGU4"

d = getURL(FLIPKART)

doc=htmlParse(d)

list=getNodeSet(doc,"//a")

list_href=sapply(list,function(x)xmlGetAttr(x,"href"))

page_link=grep("start=",list_href)

page_links<-list_href[page_link]

page_links<-unique(page_links)

crawl_candidate<-"start="

base="http://www.flipkart.com"

num<-10

doclist=list()

anchorlist=vector()

j=0

while(j<num)

{

if(j==0)

{

doclist[j+1]<-getURL(FLIPKART)

Page 20: FLIPKART SAMSUNG

20

}

else

{

doclist[j+1]=getURL(paste(base,anchorlist[j+1],sep=""))

}

doc<-htmlParse(doclist[[j+1]])

anchor<-getNodeSet(doc,"//a")

anchor<-sapply(anchor,function(x)xmlGetAttr(x,"href"))

anchor<-anchor[grep(crawl_candidate,anchor)]

anchorlist=c(anchorlist,anchor)

anchorlist=unique(anchorlist)

j=j+1

}

reviews=c()

for(i in 1:10)

{

doc=htmlParse(doclist[[i]])

l=getNodeSet(doc,"//div/p/span[@class='review-text']")

l1=html_text(l)

#r=l1[nchar(l1)>200]

reviews=c(reviews,l1)

}

save(reviews,file="C:\\Users\\Koushik\\Desktop\\New folder\\Areviews.RData")

#install.packages("wordcloud")

library(wordcloud)

corpus=Corpus(VectorSource(reviews[1:100]))

corpus=tm_map(corpus,tolower)

corpus=tm_map(corpus,removePunctuation)

Page 21: FLIPKART SAMSUNG

21

corpus=tm_map(corpus,removeNumbers)

corpus=tm_map(corpus,removeWords,stopwords("en"))

corpus=Corpus(VectorSource(corpus))

tdm=TermDocumentMatrix(corpus)

m=as.matrix(tdm)

v=sort(rowSums(m),decreasing=T)

d=data.frame(words=names(v),freq=v)

wordcloud(d$words,d$freq,max.words=300,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)

#REVIEW RATINGS

reviews=c()

ratings=c()

missingRating=data.frame(Page=0,missing=0)

for(i in 1:10){

doc=htmlParse(doclist[[i]])

l=getNodeSet(doc,"//div/p/span")

rateNodes=getNodeSet(doc,"//div[@class='fk-stars']")

rates=sapply(rateNodes,function(x)xmlGetAttr(x,"title"))

ratings=c(ratings,rates)

l1=html_text(l)

reviews=c(reviews,l1)

}

View(reviews)

View(ratings)

reviews100=reviews[1:100]

reviews100

ratings

rating=gsub(" star[s]?","",ratings)

rating=as.numeric(rating)

satisfaction=ifelse(rating>3,"satisfied","dissatisfied")

Page 22: FLIPKART SAMSUNG

22

satisfaction

dtmmobile=create_matrix(reviews100,removePunctuation=T,removeNumbers=T,weighting=weightTfIdf,stemWords=TRUE)

dtmmobile=as.matrix(dtmmobile)

data=as.data.frame(dtmmobile)

data=cbind(data,satisfaction)

#data1=na.omit(data)

data=data[,colSums(data[,-length(data)])>0]

View(data)

table(data$satisfaction)

svm=svm(satisfaction~.,data=data)

svm

#To get variable importance in prediction, SVM weights are evaluated as shown below

coef_imp=as.data.frame(t(svm$coefs)%*%svm$SV)

coef_imp1=data.frame(words=names(coef_imp),Importance=t(coef_imp))

coef_imp1=coef_imp1[order(coef_imp1$Importance),]

head(coef_imp1)

tail(coef_imp1)

View(coef_imp1)

#LSA & CLUSTERING

library(vegan)

install.packages("RTools")

library(RTools)

library(RTextTools)

library(mclust)

library(lsa)

library(cluster)

tdm=create_matrix(reviews,removeNumbers=T)

tdm_tfidf=weightTfIdf(tdm)

m=as.matrix(tdm)

m_tfidf=as.matrix(tdm_tfidf)

Page 23: FLIPKART SAMSUNG

23

lsa_m=lsa(t(m),dimcalc_share(share=0.8))

lsa_m_tk=as.data.frame(lsa_m$tk)

lsa_m_dk=as.data.frame(lsa_m$dk)

lsa_mtfidf=lsa(t(m_tfidf),dimcalc_share(share=0.8))

k50=kmeans(scale(lsa_m$dk),centers=50,nstart=20)

centers50=aggregate(cbind(V1,V2,V3)~k50$cluster,data=as.data.frame(lsa_m$dk),FUN=mean)

d=dist(centers50[,-1])

hc=hclust(d,method="ward.D")

plot(hc,hang=-1)

rect.hclust(hc,h=0.3)

rect.hclust(hc,h=0.4,border="blue")

rect.hclust(hc,h=1.0,border="cyan")

rect.hclust(hc,h=1.25,border="green")

rect.hclust(hc,h=1.7,border="black")

#As per the plot, optimum values could be either 3,4 or 5

k3=kmeans(scale(lsa_m$tk),centers=3,nstart=20)

centers3=aggregate(cbind(V1,V2,V3)~k3$cluster,data=as.data.frame(lsa_m$tk),FUN=mean)

k4=kmeans(scale(lsa_m$tk),centers=4,nstart=20)

centers4=aggregate(cbind(V1,V2,V3)~k4$cluster,data=as.data.frame(lsa_m$tk),FUN=mean)

k5=kmeans(scale(lsa_m$tk),centers=5,nstart=20)

centers5=aggregate(cbind(V1,V2,V3)~k5$cluster,data=as.data.frame(lsa_m$tk),FUN=mean)

lsa_tk=lsa_m$tk

v=sort(colSums(m),decreasing=T)

wordFreq=data.frame(words=names(v),freq=v)

k5_1=wordFreq[k5$cluster==1,]

k5_2=wordFreq[k5$cluster==2,]

k5_3=wordFreq[k5$cluster==3,]

k5_4=wordFreq[k5$cluster==4,]

k5_5=wordFreq[k5$cluster==5,]

Page 24: FLIPKART SAMSUNG

24

lsa_dk=as.data.frame(lsa_m$dk)

lsa_dk3=data.frame(words=rownames(lsa_dk),lsa_dk[,1:3])

plot(lsa_dk3$V1,lsa_dk3$V2)

text(lsa_dk3$V1,lsa_dk3$V2,label=lsa_dk3$words)

k50=kmeans(scale(lsa_m$tk),centers=50,nstart=20)

centers50=aggregate(cbind(V1,V2,V3)~k50$cluster,data=as.data.frame(lsa_m$tk),FUN=mean)

lsa_tk3=data.frame(words=rownames(lsa_tk),lsa_tk[,1:3])

plot(lsa_tk3$X1,lsa_tk3$X2)

text(lsa_tk3$X1,lsa_tk3$X2,label=lsa_tk3$words)

plot(lsa_tk3$X2,lsa_tk3$X3)

text(lsa_tk3$X2,lsa_tk3$X3,label=lsa_tk3$words)

plot(lsa_tk3$X3,lsa_tk3$X1)

text(lsa_tk3$X3,lsa_tk3$X1,label=lsa_tk3$words)

#sENTIMENT ANALYSIS

#qdap

data1=data

satisfaction1=as.data.frame(satisfaction)

for(i in 1:100)

{

sent=sent_detect(reviews[i])

pol=polarity(sent)

data1$polarity[i]=pol$group$stan.mean.polarity

satisfaction1$polarity_val[i]=pol$group$stan.mean.polarity

if(is.na(satisfaction1$polarity_val[i]))

{satisfaction1$polarity_val[i]=pol$group$ave.polarity

data1$polarity[i]=pol$group$ave.polarity}

Page 25: FLIPKART SAMSUNG

25

}

new_rate=cbind(rating,satisfaction1)

aggregate(polarity_val~rating,data=new_rate,FUN=mean)

tree=party::ctree(satisfaction~polarity_val, data=new_rate)

plot(tree)

new_rate$status=ifelse(new_rate$polarity_val>0.385,"Satisfied","Dissatisfied")

count_status1=as.data.frame(table(new_rate$status))

View(count_status1)