Viral or Bust!Popularity Classification on News and Entertainment Media
The Data:
-40,000+ articles scraped from mashable.com
-Scraped and pre-processed with attention to linguistic features of each article
-56 resulting features to consider
The Data:
Among the 56 features, topics are:
-Words
-NLP
-Publication Time
-Digital Media Aspects
Goal:Create a model that will distinguish between popular and unpopular news
Exploring the Data:
Exploring the Data: Rate of +/- Words
Exploring the Data: +/- Polarity
Exploring the Data: Global Subjectivity
Exploring the Data: Self-reference Links
Exploring the Data: LDA Rank
Initial Analysis:
Model Accuracy Precision Recall F1
kNN 0.566000 0.594047 0.590866 0.592452
Naive Bayes 0.479654 0.623277 0.064094 0.116236
RandomForest 0.608804 0.640564 0.694331 0.666364
LogisticReg 0.591984 0.617579 0.668346 0.641960
SVC 0.533967 0.533928 1.000000 0.697104
Feature Reduction
● Principal Component Analysis to find distribution of variance in the data
Feature Reduction
● Eliminated features below variance threshhold (.8)● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal
parameters● Ran GridSearch CV with additional specified parameters and graph by feature importance
Most Important Features
Rank Feature
1 Average Keyword Score
2 Data Channel is Entertainment
3 Closeness to LDA topic 2
4 Average Token Length
5 Published on Weekend
6 Closeness to LDA topic 4
7 Data Channel is Technology
8 Max Keyword Score
9 Data Channel is World
Final Results:
Model Accuracy Precision Recall F1
kNN 0.562236 0.581848 0.591262 0.566142
Naive Bayes 0.523288 0.660920 0.140122 0.231222
RandomForest 0.662240 0.662520 0.695117 0.668421
LogisticReg 0.614035 0.638523 0.566057 0.600111
SVC 0.531645 0.532263 1.000000 0.697104
Final Results and Findings:
Small but consistent gain in accuracy:● Data well-processed● Correlation between features is minimal
Conclusion and Next Steps
● In spite of the difficulty in separating data, selected model performed fairly well
● In the future, would like to rely less on sentiment analysis and focus on word vector correlations
Top Related