Download - Project McNulty

Transcript
Page 1: Project McNulty

Viral or Bust!Popularity Classification on News and Entertainment Media

Page 2: Project McNulty

The Data:

-40,000+ articles scraped from mashable.com

-Scraped and pre-processed with attention to linguistic features of each article

-56 resulting features to consider

Page 3: Project McNulty

The Data:

Among the 56 features, topics are:

-Words

-NLP

-Publication Time

-Digital Media Aspects

Page 4: Project McNulty

Goal:Create a model that will distinguish between popular and unpopular news

Page 5: Project McNulty

Exploring the Data:

Page 6: Project McNulty

Exploring the Data: Rate of +/- Words

Page 7: Project McNulty

Exploring the Data: +/- Polarity

Page 8: Project McNulty

Exploring the Data: Global Subjectivity

Page 9: Project McNulty

Exploring the Data: Self-reference Links

Page 10: Project McNulty

Exploring the Data: LDA Rank

Page 11: Project McNulty

Initial Analysis:

Model Accuracy Precision Recall F1

kNN 0.566000 0.594047 0.590866 0.592452

Naive Bayes 0.479654 0.623277 0.064094 0.116236

RandomForest 0.608804 0.640564 0.694331 0.666364

LogisticReg 0.591984 0.617579 0.668346 0.641960

SVC 0.533967 0.533928 1.000000 0.697104

Page 12: Project McNulty

Feature Reduction

● Principal Component Analysis to find distribution of variance in the data

Page 13: Project McNulty

Feature Reduction

● Eliminated features below variance threshhold (.8)● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal

parameters● Ran GridSearch CV with additional specified parameters and graph by feature importance

Page 14: Project McNulty

Most Important Features

Rank Feature

1 Average Keyword Score

2 Data Channel is Entertainment

3 Closeness to LDA topic 2

4 Average Token Length

5 Published on Weekend

6 Closeness to LDA topic 4

7 Data Channel is Technology

8 Max Keyword Score

9 Data Channel is World

Page 15: Project McNulty

Final Results:

Model Accuracy Precision Recall F1

kNN 0.562236 0.581848 0.591262 0.566142

Naive Bayes 0.523288 0.660920 0.140122 0.231222

RandomForest 0.662240 0.662520 0.695117 0.668421

LogisticReg 0.614035 0.638523 0.566057 0.600111

SVC 0.531645 0.532263 1.000000 0.697104

Page 16: Project McNulty

Final Results and Findings:

Small but consistent gain in accuracy:● Data well-processed● Correlation between features is minimal

Page 17: Project McNulty

Conclusion and Next Steps

● In spite of the difficulty in separating data, selected model performed fairly well

● In the future, would like to rely less on sentiment analysis and focus on word vector correlations