Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer
description
Transcript of Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer
![Page 1: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/1.jpg)
Team :Priya Iyer
Vaidy VenkatSonali Sharma
Mentor: Andy Schlaikjer
Twist : User Timeline Tweets Classifier
![Page 2: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/2.jpg)
Goal
Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology
Input: user timeline tweetsOutput: list of auto classified tweets
![Page 3: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/3.jpg)
Rationale
Twitter allows users to create custom Friend Lists based on the user handles.
![Page 4: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/4.jpg)
Rationale (contd.)
Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.
![Page 5: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/5.jpg)
Approach
Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for
the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct
classification
![Page 6: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/6.jpg)
Text Mining Process
Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to
lower case
![Page 7: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/7.jpg)
Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D
SF Giants! amaazzzing feelin’!!!! \/ :D
SF Giants amaazzzing feelin
SF Giants amazing feeling
SF Giants amazing feel meSF Giants amazing feel
Stopwords
Special chars
Spell check
Stemming
stopwords
![Page 8: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/8.jpg)
Choice of ML technique
Logistic Regression Classifier Reasons:
Most popular linear classification technique for text classification
Ability to handle multiple categories with ease
Gave the best cross-validation accuracy and precision-recall score
Library: LIBLINEAR for Python
![Page 9: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/9.jpg)
Creation of LIBLINEAR training inputSF Giants amazing feel
SF – 1 Giants -2 amazing-3 feel-4
SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)
1 1:1 2:1 3:1 4:1
Boolean
Training Input for the SVM
Indexing
![Page 10: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/10.jpg)
Demo
![Page 11: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/11.jpg)
THANK YOU
Andy,
Marti &
The Twitter Team
![Page 12: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/12.jpg)
Questions?
![Page 13: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/13.jpg)
Data Collection Challenges – Backup Slides Collected >2000 tweets from the “Who
to follow” interest lists on Twitter for “Sports” and “Business”
Tweets were not purely “Sports” or “Business” related
Personal messages were prominent
Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly
![Page 14: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/14.jpg)
Text Mining Challenges Noise in the data:
▪ Tweets are in inconsistent format▪ Lots of meaningless words▪ Misspellings▪ More of individual expression▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball , futball, % , :D,\m/, ^xoxo
Solution: Regular expressions and NLP toolkit
Different words, same rootPlaying , plays , playful - playSolution: Stemming
![Page 15: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/15.jpg)
Sample LIBLINEAR input format (Train)
![Page 16: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/16.jpg)
LIBLINEAR output for a test file of 20 tweets Mixed bag of sports(=1), finance(=2)
tweets, entertainment(=3) and technology (=4)
Comma separated values of the categories that each tweet
Accuracy here is 94%. Precision: 0.89 Recall: 0.89
Experiment with different kernels for a better accuracy
![Page 17: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer](https://reader035.fdocuments.net/reader035/viewer/2022070500/568168a5550346895ddf3d5b/html5/thumbnails/17.jpg)
Summary: Data Source/Software/Tools Category based tweets from
https://twitter.com/i/#!/who_to_follow/interests
Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit