Download - Project McNulty

Viral or Bust!Popularity Classification on News and Entertainment Media

The Data:

-40,000+ articles scraped from mashable.com

-Scraped and pre-processed with attention to linguistic features of each article

-56 resulting features to consider

The Data:

Among the 56 features, topics are:

-Words

-NLP

-Publication Time

-Digital Media Aspects

Goal:Create a model that will distinguish between popular and unpopular news

Exploring the Data:

Exploring the Data: Rate of +/- Words

Exploring the Data: +/- Polarity

Exploring the Data: Global Subjectivity

Exploring the Data: Self-reference Links

Exploring the Data: LDA Rank

Initial Analysis:

Model Accuracy Precision Recall F1

kNN 0.566000 0.594047 0.590866 0.592452

Naive Bayes 0.479654 0.623277 0.064094 0.116236

RandomForest 0.608804 0.640564 0.694331 0.666364

LogisticReg 0.591984 0.617579 0.668346 0.641960

SVC 0.533967 0.533928 1.000000 0.697104

Feature Reduction

● Principal Component Analysis to find distribution of variance in the data

Feature Reduction

● Eliminated features below variance threshhold (.8)● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal

parameters● Ran GridSearch CV with additional specified parameters and graph by feature importance

Most Important Features

Rank Feature

1 Average Keyword Score

2 Data Channel is Entertainment

3 Closeness to LDA topic 2

4 Average Token Length

5 Published on Weekend

6 Closeness to LDA topic 4

7 Data Channel is Technology

8 Max Keyword Score

9 Data Channel is World

Final Results:

Model Accuracy Precision Recall F1

kNN 0.562236 0.581848 0.591262 0.566142

Naive Bayes 0.523288 0.660920 0.140122 0.231222

RandomForest 0.662240 0.662520 0.695117 0.668421

LogisticReg 0.614035 0.638523 0.566057 0.600111

SVC 0.531645 0.532263 1.000000 0.697104

Final Results and Findings:

Small but consistent gain in accuracy:● Data well-processed● Correlation between features is minimal

Conclusion and Next Steps

● In spite of the difficulty in separating data, selected model performed fairly well

● In the future, would like to rely less on sentiment analysis and focus on word vector correlations