Introduction The Internet holds a wealth of new information, which is increasingly taking the form...

1
Introduction The Internet holds a wealth of new information, which is increasingly taking the form of user-generated content. Many are interested in knowing the general opinions on a given topic without having to read thousands of reviews and articles. Our goal is to find a technique that will allow us to identify the topics discussed in a group of documents, so that they can later be rated as positive or negative opinions. We chose to experiment on a dataset of digital camera reviews, about a particular Leica camera. This dataset was 146 reviews in size, from various web sites. Methods We created a three phase process called Aldteran to extract features from a dataset of reviews. 1. Find Keywords Keep the largest entry from all columns of W. Create a pool of next 9 highest entries from each column and return certain number as keywords. Select words with highest frequency ratio compared to general English 2. Graph Keywords Each node is a keyword. Distances are determined by the WordNet Distance and Word Proximity. WordNet is a hyperlinked dictionary connecting related words. WordNet distance considers shortest path and number of short paths between words. Word Proximity is a measure of how far away two words are on average in the dataset. These results were obtained by setting several parameters. When we combine the WordNet distance and the word proximity to create the graph weights, we combine them at equal weight. We selected the number of clusters out of the graph to be 15. Currently, the number of clusters is chosen by observing the outputs of various values. But, we're optimistic this can be automated. Conclusions Replace Word Proximity measure with more sophisticated natural language programming. Automate selection of number of clusters. Apply dictionary-based stemming. Improve measure of general use of English. Automatically tag each cluster as one feature. Moving to next level: rating each cluster as positive or negative. Acknowledgments Carl Meyer, Shaina Race, and Ralph Abbey for mentoring us North Carolina State University for organizing Funding by the National Science Patrick Moran, Jeffrey Salter, Bethany Herwaldt College of Charleston, Albany State University, University of Notre Dame 3. Cluster Graph Each cluster should represent a topic. Any graph clustering algorithm may be used. We chose an SVD-based method with Principal Direction Divisive Partitioning and k-means. Results 15 clusters were identified, and we interpret any non-singleton clusters as a review topic. The most optimistic entropy for this clustering is .79782, and the most pessimistic entropy is 1.2687. This is clearly better than random placement, which has an expected entropy of 1.718. Extracting topics from product reviews Review #1 Review #2 camera lens . . . . . . . . . . . 5 2 3 . . . . . . . . . . ~ ~ W H Create term-by-document matrix and use non-negative matrix factorization Under certain conditions, large entries in W tend to correspond to words that pertain to topics of our dataset. It is possible for a machine to detect the topics of textual documents. This detection can be done without human or domain- specific input. Far from perfect, the methods we've presented and thus the results they produce can be improved. image, images, color, quality, clarity lens, optics, decent, sturdy use, its delicate, shipping, raw, mode, ratio size, post, noise, flash, screen cannon, nikon, sony, packaging, mp feature, format, shoot, lightweight pictures, candid, landscapes digital, compact, complicate, swears options, menu, item, manual, settings, sensor, photos, photographer, worlds, shoots camera, cameras love, very, great, also, excellent, expensive everyday grandchildren aspect

Transcript of Introduction The Internet holds a wealth of new information, which is increasingly taking the form...

Page 1: Introduction The Internet holds a wealth of new information, which is increasingly taking the form of user- generated content. Many are interested in knowing.

IntroductionThe Internet holds a wealth of new information, which is increasingly taking the form of user-generated content. Many are interested in knowing the general opinions on a given topic without having to read thousands of reviews and articles. Our goal is to find a technique that will allow us to identify the topics discussed in a group of documents, so that they can later be rated as positive or negative opinions.We chose to experiment on a dataset of digital camera reviews, about a particular Leica camera. This dataset was 146 reviews in size, from various web sites.

MethodsWe created a three phase process called Aldteran to extract features from a dataset of reviews.

1. Find Keywords

• Keep the largest entry from all columns of W.

• Create a pool of next 9 highest entries from each column and return certain number as keywords.

• Select words with highest frequency ratio compared to general English

2. Graph Keywords

• Each node is a keyword.

• Distances are determined by the WordNet Distance and Word Proximity.

• WordNet is a hyperlinked dictionary connecting related words.

• WordNet distance considers shortest path and number of short paths between words.

• Word Proximity is a measure of how far away two words are on average in the dataset.

These results were obtained by setting several parameters.

When we combine the WordNet distance and the word proximity to create the graph weights, we combine them at equal weight.

We selected the number of clusters out of the graph to be 15.

•Currently, the number of clusters is chosen by observing the outputs of various values.

But, we're optimistic this can be automated.

Conclusions

• Replace Word Proximity measure with more sophisticated natural language programming.

• Automate selection of number of clusters.

• Apply dictionary-based stemming.

• Improve measure of general use of English.

• Automatically tag each cluster as one feature.

• Moving to next level: rating each cluster as positive or negative.

Acknowledgments• Carl Meyer, Shaina Race, and Ralph Abbey

for mentoring us

• North Carolina State University for organizing

• Funding by the National Science Foundation.

Patrick Moran, Jeffrey Salter, Bethany HerwaldtCollege of Charleston, Albany State University, University of Notre Dame

3. Cluster Graph

• Each cluster should represent a topic.

• Any graph clustering algorithm may be used.

• We chose an SVD-based method with Principal Direction Divisive Partitioning and k-means.

Results15 clusters were identified, and we interpret any

non-singleton clusters as a review topic.

The most optimistic entropy for this clustering is .79782, and the most pessimistic entropy is 1.2687. This is clearly better than random placement, which has an expected entropy of 1.718.

Extracting topics from product reviews R

evie

w #

1R

evie

w #

2

cameralens

. . . . . .

. . .

. .

5 23

. . .

. .

. . . . .

~~ W

H

• Create term-by-document matrix and use non-negative matrix factorization

• Under certain conditions, large entries in W tend to correspond to words that pertain to topics of our dataset.

It is possible for a machine to detect the topics of textual documents. This detection can be done without human or domain-specific input. Far from perfect, the methods we've presented and thus the results they produce can be improved.

image, images, color, quality, claritylens, optics, decent, sturdyuse, itsdelicate, shipping, raw, mode, ratiosize, post, noise, flash, screencannon, nikon, sony, packaging, mpfeature, format, shoot, lightweightpictures, candid, landscapesdigital, compact, complicate, swearsoptions, menu, item, manual, settings, sensor, photos,

photographer, worlds, shootscamera, cameraslove, very, great, also, excellent, expensiveeverydaygrandchildrenaspect