Project McNulty

download Project McNulty

of 17

  • date post

    22-Jan-2017
  • Category

    Documents

  • view

    82
  • download

    0

Embed Size (px)

Transcript of Project McNulty

  • Viral or Bust!Popularity Classification on News and Entertainment Media

  • The Data:

    -40,000+ articles scraped from mashable.com

    -Scraped and pre-processed with attention to linguistic features of each article

    -56 resulting features to consider

  • The Data:

    Among the 56 features, topics are:

    -Words

    -NLP

    -Publication Time

    -Digital Media Aspects

  • Goal:Create a model that will distinguish between popular and unpopular news

  • Exploring the Data:

  • Exploring the Data: Rate of +/- Words

  • Exploring the Data: +/- Polarity

  • Exploring the Data: Global Subjectivity

  • Exploring the Data: Self-reference Links

  • Exploring the Data: LDA Rank

  • Initial Analysis:

    Model Accuracy Precision Recall F1

    kNN 0.566000 0.594047 0.590866 0.592452

    Naive Bayes 0.479654 0.623277 0.064094 0.116236

    RandomForest 0.608804 0.640564 0.694331 0.666364

    LogisticReg 0.591984 0.617579 0.668346 0.641960

    SVC 0.533967 0.533928 1.000000 0.697104

  • Feature Reduction

    Principal Component Analysis to find distribution of variance in the data

  • Feature Reduction

    Eliminated features below variance threshhold (.8) Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal

    parameters Ran GridSearch CV with additional specified parameters and graph by feature importance

  • Most Important Features

    Rank Feature

    1 Average Keyword Score

    2 Data Channel is Entertainment

    3 Closeness to LDA topic 2

    4 Average Token Length

    5 Published on Weekend

    6 Closeness to LDA topic 4

    7 Data Channel is Technology

    8 Max Keyword Score

    9 Data Channel is World

  • Final Results:

    Model Accuracy Precision Recall F1

    kNN 0.562236 0.581848 0.591262 0.566142

    Naive Bayes 0.523288 0.660920 0.140122 0.231222

    RandomForest 0.662240 0.662520 0.695117 0.668421

    LogisticReg 0.614035 0.638523 0.566057 0.600111

    SVC 0.531645 0.532263 1.000000 0.697104

  • Final Results and Findings:

    Small but consistent gain in accuracy: Data well-processed Correlation between features is minimal

  • Conclusion and Next Steps

    In spite of the difficulty in separating data, selected model performed fairly well

    In the future, would like to rely less on sentiment analysis and focus on word vector correlations