Post on 06-May-2015
description
Tweet Classification
Mentor: Romil Bansal
GROUP NO-37 Manish Jindal(201305578)Trilok Sharma(201206527)Yash Shah (201101127)
Guided by : Dr. Vasudeva Varma
Problem Statement : To automatically classify Tweets from Twitter into various genres based on predefined Wikipedia Categories. Motivation:o Twitter is a major social networking service with
over 200 million tweets made every day. o Twitter provides a list of Trending Topics in real
time, but it is often hard to understand what these trending topics are about.
o It is important and necessary to classify these topics into general categories with high accuracy for better information retrieval.
DataDataset : o Input Data is the static / real-time data consisting
of the user tweets.o Training dataset :
Fetched from twitter with twitter4j api.
Final Deliverable:o It will return list of all categories to which the
input tweet belongs.o It will also give the accuracy of the algorithm
used for classifying tweets.
CategoriesWe took following categories into consideration for classifying twitter data.
1)Business 5)Law 9)Politics
2)Education 6)Lifestyle 10)Sports
3)Entertainment 7)Nature 11)Technology
4)Health 8)Places
Concepts used for better performanceOutliers removal
To remove low frequent and high frequent words using Bag of words approach .
Stop words removalTo remove most common words, such as the,
is, at, which, and on.Keyword Stemming
To reduce inflected words to their stem, base or root form using porter stemming
Cleaning crawl data
Other Concepts used ..Spelling Correction
To correct spellings using Edit distance method.
Named Entity Recognition:For ranking result category and finding most
appropriate.
Synonym formIf feature(word) of test query not found as one of
dimension in feature space than replace that word with its synonym. Done using WordNet.
Tweets Classification AlgorithmsWe used 3 algorithms for classification
1) Naïve based2) SVM based Supervised3) Rule based
Crawl tweeter
dataTweets
Cleaning, Stop word removal
Create Index file
Of feature vector
Extract Features (Unique wordlist)
Create feature vector for each
tweet
Edit Distance, WordNet
(synonyms)
TestQuery/Tweet
Create Index fileOf feature
vectors
Create / Apply
Model files
Output Category
Tra
inin
g T
estin
g
Remove Outliers
Tweets Cleaning, Stop word removal
Create feature vector for test
tweet
Apply Named Entity
Recognition
Rank result category
Main idea for Supervised LearningAssumption: training set consists of
instances of different classes described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj C
Key idea: assign the most probable class using supervised learning algorithm.
Method 1 : Bayes ClassifierBayes rule states :
We used “WEKA” library for machine learning in Bayes Classifier for our project.
Normalization Constant
Likelihood Prior
Method 2 : SVM Classifier (Support Vector Machine)Given a new point x, we can score its
projection onto the hyperplane normal:I.e., compute score: wTx + b = Σαiyixi
Tx + b Decide class based on whether < or > 0
Can set confidence threshold t.
11
-10
1
Score > t: yes
Score < -t: no
Else: don’t know
12
Multi-class SVM
13
Multi-class SVM Approaches1-against-all
Each of the SVMs separates a single class from all remaining classes (Cortes and Vapnik, 1995)
1-against-1Pair-wise. k(k-1)/2, k Y SVMs are trained. Each
SVM separates a pair of classes (Fridman, 1996)
Advantages of SVMHigh dimensional input space Few irrelevant features (dense concept)Sparse document vectors (sparse instances)Text categorization problems are linearly
separable
For linearly inseparable data we can use kernels to map data into high dimensional space, so that it becomes linearly separable with hyperplane.
Method 3 : Rule BasedWe defined set of rule to classify a tweet
based on term frequency.a. Extract the features of a tweet.b. Count term frequency of each feature , the
feature having maximum term frequency from all categories mentioned above will be our first classification.
c. As it cannot be right all time so now we maintain count of categories in which tweet falls , category which is near to tweet will be our next classification.
Example-Tweet=sachin is a good player, who eats apple
and banana which is good for health.Feature- sachin,player,eats,apple,health,bananaStop word-is,a,good,he,was,for,which,and,whoClassification- Feature-category term-
frequencysachin-sports 2000
player-sports 900
eating-health 500
apple-technology 1000
health-health 800
banana-health 700
Max term-frequency - sachinSo our category is - sports
2nd approximation -Max feature is laying in health i.e. 3 times ,So our second approximation would be health.If both of these are in same category then we
have only one category.i.e. if here max feature would be laying in sports than we have only one result that is sports.
Cross-validation (Accuracy)Steps for k-fold cross-validation : Step 1: split data into k subsets of equal size
Step 2 : use each subset in turn for testing, the remainder for training
Often the subsets are stratified before the cross-validation is performedThe error estimates are averaged to yield an
overall error estimate
Accuracy Results ( 10 folds)Accuracy of Algorithm in %
Categories\ Algo. SVM Naïve Rule
Business 86.6 81.44 98.30
Education 85.71 76.07 81.8
Entertainment 86.8 79.1 87.49
Health 95.67 84.62 90.93
Law 81.17 73.38 75.25
Lifestyle 93.27 89.71 82.42
Nature 87.0 78.64 84.24
Places 81.01 75.35 80.73
Politics 81.91 81.88 76.31
Sports 87.11 83.57 81.87
Technology 83.64 82.44 77.05
Unique featuresWorked on latest Crawled tweeter data using tweeter4j
apiWorked on Eleven different Categories.Applied three different method of supervised learning
to classify in different categories. Achieved high performance speed with accuracy in
range of 85 to 95 %Done Tweets Cleaning , Stemming , Stop Word removal.Used Edit distance for spelling correction.Used Named entity recognition for ranking.Used WordNet for Query Expansion and Synonyms
finding.Validated using CrossFold (10 fold) validation.
Snapshot
Result
Accuracy
Thank You!