Paper Abstracts Matter... But How much?

9
Fletcher Series. 2016 Aug 26;1(1-10) Abstracts Matter. But... How much so? Rascon CA 1 1 [email protected], San Francisco CA, 94105, USA. Abstractff The number of times a scientific paper is cited (citations count) has emerged as proxy of a paper’s success within its field. Here, I aim to address how relevant is an abstract to a scientific publication, and furthermore which features of such abstracts play the largest impact in a paper’s success (as estimated by citations count). The data set comprised all abstracts of scientific papers from 22 top biotech journals published in the period of 1995-2016, a total of 310,175 papers. Journals name or the affiliation of the heads of laboratories where not incorporated in this model, which aimed to be solely based on

Transcript of Paper Abstracts Matter... But How much?

Page 1: Paper Abstracts Matter... But How much?

Fletcher Series. 2016 Aug 26;1(1-10)

Abstracts Matter. But... How much so?Rascon CA1

[email protected], San Francisco CA, 94105, USA.Abstractff

The number of times a scientific paper is cited (citations count) has emerged as proxy of a paper’s success within its field. Here, I aim to address how relevant is an abstract to a scientific publication, and furthermore which features of such abstracts play the largest impact in a paper’s success (as estimated by citations count). The data set comprised all abstracts of scientific papers from 22 top biotech journals published in the period of 1995-2016, a total of 310,175 papers. Journals name or the affiliation of the heads of laboratories where not incorporated in this model, which aimed to be solely based on the abstracts title and content. Data cleaning, and feature engineering largely relying on NLP metrics (LSA, Tf-idf, POS-tagger), gave an good insight on what better predicts citation count across the

Page 2: Paper Abstracts Matter... But How much?

Biotech papers have a steady trending curve

Figure 1. Number of citations per paper by year of publishing. The corpus data set after cleaning is comprised by 202,173 abstracts. Each cyan dot represents a single paper (transparency 0.3).

Page 3: Paper Abstracts Matter... But How much?

A journal prestige is dependent on its impact factor

Figure 2. Journals used for the data set and the number of citations per paper published between 1995-2010 shown as a violin plot. This differences reflect to some extent each journals impact factor (the yearly average number of citations).

Page 4: Paper Abstracts Matter... But How much?

Figure 3. Final set of 134,374 papers (1995-2010). The total number of citations per paper, (target, y), was binned in two classes: under or over 10 total citations since the paper’s publishing date (0 or 1, respectively).

(left side: Example of an Abstract and citation count).Abstracts binned in two classes:

0 for 1-9 (25%), or 1 for 10 or more (75%) total citations

Page 5: Paper Abstracts Matter... But How much?

LAS, Tf-idf, and Positional Tagging selected as star features, with Random Forests as the model of choiceR

Figure 4. ROC and Precision/Recall curves for the top performing models.

Page 6: Paper Abstracts Matter... But How much?

Model over the last 5 years (2005-2009) to predict the ‘success’ of 2010 papers:R

Figure 5. ROC and Precision/Recall curves for the top performing models. This time modeling on 2005-2009 papers to predict 2010 papers ‘success’.

Page 7: Paper Abstracts Matter... But How much?

Features identified as important by RF for predicting coming years’ papers success:

Figure 6. Feature importances as ranked by Random Forests, for a model trained on 2005-2009 and tested on 2010 papers. *Abstract LSA (100 comp.), **Abstract LSA on Tfidf (100 comp.), *** in Title LSA

C2- **C2- *C4- *C7- **C4- **

POS tag ‘:’C8- **C5- **

Abstract lengthC3- **C1- *

C31-***C15- **C15- *C14- *C16- **

C3- *C6- *

POS tag ‘.’C29- **

1st – Next Generation Sequencing sequenc: 0.20, method: 0.17, data: 0.16, genom: 0.16, avail: 0.14

2nd – Cellular regulation / gene expressioncell: 0.71, activ: 0.19, induc: 0.08, regul: 0.08, mice: 0.07

3rd – Cellular models (methods)cell: 0.28, use: 0.23, data: 0.19, method: 0.17, model: 0.16

4th – Applied genomics (mutants)genom: 0.25, sequenc: 0.25, protein: 0.19,mutant: 0.12, human: 0.11

5th – Basic research (DNA related)gene: 0.28, dna: 0.27, rna: 0.20, transcript: 0.20, genom: 0.17

Page 8: Paper Abstracts Matter... But How much?

Abstracts matter about:

81%Need to consider:Are better scientist simply better communicators? Or… Great scientist are also really good at communicating?

I did not incorporate a feature to account for novelty. (quite the opposite)

It is circular to say the more papers exist in a filed the more likely it is to be cited in the future. However this suggests that trends exist in academia. *duh*

Page 9: Paper Abstracts Matter... But How much?

Abstracts matter about:

81%Future directions:Multi-class case

Extend prediction forecast window. 2017??

Examine those abstracts in which the model did poorly.

Flask app to ‘score’ new abstracts.

Time series, model topic trends over time. Is it too early or is it too late for a paper to come out?