Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News...
Transcript of Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News...
Predicting Document Creation Timesin News Citation Networks
Andreas Spitz1, Jannik Strötgen2, and Michael Gertz1
April 23, 2018 — TempWeb 2018, Lyon
1 Database Systems Research Group 2 Bosch Center for Artificial IntelligenceHeidelberg University, Germany Germany
Hm, when did this happen again?
1
News Citation Networks
News Citation Network Extraction
2
News Citation Network Overview
News articles from RSS feeds:
I Politics and business feeds
I 34 English news outlets(USA, UK, AUS, CAN, GER, CHN, QAT)
I 2 years (Nov 2015 - Oct 2017)
I 244.6 thousand articles
I 367.2 thousand edges
Used data:
I Hyperlinks in the article body
I Publication dates
I Temporal expressions3
News Outlet Statistics (sample)
short news outlet days 〈articles〉 〈temp exp〉 otherin otherout
AT The Atlantic 334 7.2 10.5 16.7 50.6BBC British Bc. Corp. 730 8.1 6.5 19.1 8.0DW Deutsche Welle 334 1.2 6.1 48.1 5.9FOX Fox News 548 2.7 9.8 0.0 0.0NPR National Public Radio 334 0.4 8.4 63.6 58.5NY The New Yorker 548 3.0 13.2 33.5 30.6NYT New York Times 669 23.8 10.7 26.8 4.7SMH Sydney Morn. Herald 548 2.3 7.0 3.0 51.9WP Washington Post 548 62.7 9.4 13.7 5.1
4
Evolution of Network Metrics
clustering coefficient average path length
average degree undirected diameter
2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07
0
20
40
60
5
10
15
1
2
3
0.0
0.2
0.4
0.6
days
mea
sure
val
ue
network aggregated politics business
5
Exploring Citation Chains
6
Article Publication Time Prediction
Task Definition: Publication Time Prediction
7
Available News Citation Network Data
Predict article publication times from:
I Citation network topology
I Publication dates of adjacent articles
I Temporal expressions in adjacent articles
I Not the metadata of the article itself
I Not the article content
8
Available News Citation Network Data
Predict article publication times from:
I Citation network topology
I Publication dates of adjacent articles
I Temporal expressions in adjacent articles
I Not the metadata of the article itself
I Not the article content
8
Feature Extraction
Network Topology Features
Node degree-based features:
I Incoming degree
I Outgoing degree
I Undirected degree
Density-based features:
I Undirected local clustering coe�icient
Centrality-based features:
I Betweenness centrality
I Incoming closeness centrality
I Outgoing closeness centrality
I Page Rank centrality
9
Network Topology Features
Node degree-based features:
I Incoming degree
I Outgoing degree
I Undirected degree
Density-based features:
I Undirected local clustering coe�icient
Centrality-based features:
I Betweenness centrality
I Incoming closeness centrality
I Outgoing closeness centrality
I Page Rank centrality
9
Network Topology Features
Node degree-based features:
I Incoming degree
I Outgoing degree
I Undirected degree
Density-based features:
I Undirected local clustering coe�icient
Centrality-based features:
I Betweenness centrality
I Incoming closeness centrality
I Outgoing closeness centrality
I Page Rank centrality
9
Temporal Network Features
10
Temporal Expression Features
Correlation of temporal expressions:
I good with publication dates ofreferencing articles (incoming edges)
I bad with publication dates ofreferenced articles (outgoing edges)
11
Temporal Expression Features
Correlation of temporal expressions:
I good with publication dates ofreferencing articles (incoming edges)
I bad with publication dates ofreferenced articles (outgoing edges)
11
Missing Features and Imputation
Missing features
I 30.8% of feature values are missing
I 89.6% of articles are missing at least one feature
Imputation of missing values
I Column mean of the feature
12
Missing Features and Imputation
Missing features
I 30.8% of feature values are missing
I 89.6% of articles are missing at least one feature
Imputation of missing values
I Column mean of the feature
12
Evaluation
Regression Methods
Used regression methods:
I BASE: Baseline (average publication date of adjacent articles)
I LR: Linear regression
I BAY: Bayesian ridge regression (Laplace model)
I RF: Random forest
I GB: Gradient boosting (Laplace distribution, decision trees)
I SVM: Support vector machine (radial kernel)
I NN: Neural network (feedforward, one hidden layer)
13
Evaluation Results: Mean Absolute Error (days)
BASE LR BAY NN RF GB SVM
all 66.72 60.46 59.61 26.88 24.98 22.66 26.19in 88.88 66.48 87.55 34.03 32.25 27.49 32.29
out 87.32 59.54 40.24 32.52 30.10 26.68 30.77in+out 18.68 55.45 54.95 12.62 11.23 12.76 14.31
14
Distribution of Absolute Errors
out in+out
all in
BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM
0
50
100
150
200
250
0
50
100
150
200
250
regression method
abso
lute
err
or (
days
)
method BASE LR BAY NN RF GB SVM
15
Recall by Varying Absolute Error
out in+out
all in
0 20 40 60 0 20 40 60
0
25
50
75
100
0
25
50
75
100
absolute error (days)
reca
ll (p
erce
ntag
e of
pre
dict
ions
< a
bsol
ute
erro
r)
method BASE LR BAY NN RF GB SVM
16
Feature Importance: Random Forest
●
●●●
●●
●
●
Feature importance: random forestm
ax (T
out)
min
(Tin)
µ (T ou
t)µ
(T in)
min
(Tou
t)m
ax (T
in)
max
(Xin)
µ (X
in) c pr
σ (T ou
t)σ
(Xin)
c cl,o
utsp
an (T
out)
σ (T in
)sp
an (T
in)
min
(Xin)
span
(Xin)
c cl,in
min
(Dis
t)de
g out
µ (D
ist)
deg in
deg al
lm
ax (D
ist)
c btw cc
σ (D
ist)
10−3
10−2
10−1
100
rela
tive
impo
rtan
ce
Feature type: ● network topology temporal expression temporal network
17
Feature Importance: Gradient Boosting
●
●
●
●
●
●●
●
Feature importance: gradient boostingm
ax (T
out)
min
(Tin)
deg ou
tµ
(T out)
min
(Dis
t)de
g in c prσ
(T out)
σ (T in
)µ
(T in)
deg al
lsp
an (T
in)
max
(Tin)
µ (X
in)
min
(Tou
t)µ
(Dis
t)c bt
wm
ax (X
in)
max
(Dis
t)sp
an (X
in)
σ (X
in)
span
(Tou
t)m
in (X
in)
c cl,o
utσ
(Dis
t)c cl
,in cc
10−5
10−4
10−3
10−2
10−1
100
rela
tive
impo
rtan
ce
Feature type: ● network topology temporal expression temporal network
18
Summary & Resources
Summary
News citation networks:
I Focus on anchored links inside the article body
I Constructed like a citation network between articles
Publication date prediction:
I Can be framed as a regression problem
I Average prediction error of 3 weeks
I Temporal network features are most discriminative
19
Resources
Data and implementation are available online:
I [data] News citation network (including URLs)
I [data] Temporal annotations
I [code] Publication date prediction
https://dbs.ifi.uni-heidelberg.de/resources/data/
20
Resources
Data and implementation are available online:
I [data] News citation network (including URLs)
I [data] Temporal annotations
I [code] Publication date prediction
https://dbs.ifi.uni-heidelberg.de/resources/data/
20