TweetsKB: A Public and Large-Scale RDF Corpus of Annotated...

25
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets Pavlos Fafalios 1 Vasileios Iosifidis 1 Eirini Ntoutsi 1 Stefan Dietze 1 1 L3S Research Center Leibniz Universit¨ at Hannover Hannover, Germany June 2018 Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 1/25

Transcript of TweetsKB: A Public and Large-Scale RDF Corpus of Annotated...

Page 1: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

TweetsKB: A Public and Large-Scale RDFCorpus of Annotated Tweets

Pavlos Fafalios1 Vasileios Iosifidis1 Eirini Ntoutsi1

Stefan Dietze1

1L3S Research CenterLeibniz Universitat Hannover

Hannover, Germany

June 2018

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 1/25

Page 2: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Micro-blogging for everyday life

I Micro-blogging services are extremely popular these days.

I Combination of blogging and instant messaging.

I Huge amount of data.

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 2/25

Page 3: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

TwitterI The most well known micro-blogging platform.I Twitter’s generated data has been used in a wide area of

research:I data science, sociology, psychology and so on and so forth.

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 3/25

Page 4: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Twitter

Figure: Number of tweets per month

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 4/25

Page 5: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Entity Relatedness ?

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 5/25

Page 6: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Opinions Overtime ?

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 6/25

Page 7: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Journalists manually analysed 14,000 tweets1 !

1https://www.nytimes.com/interactive/2016/12/06/upshot/how-to-know-what-donald-trump-really-cares-about-look-at-who-hes-insulting.html

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 7/25

Page 8: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Lack of large scale public social archives

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 8/25

Page 9: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

TweetsKB Overview

I Public RDF corpus of anonymized tweets.

I More than 1.5 billion English tweets.

I Spanning period: Jan-2013 till Nov-2017.

I Includes entities and sentiment annotations.I Ideal for:

I time-aware and entity-centric exploration.I data integration by existing knowledge base (like DBpedia).I multi-aspect and entity-centric analysis.

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 9/25

Page 10: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Generating TweetsKB: FilteringI 6 billion tweets (Jan-13 - Nov-17).I Retweets, non-English tweets and spam2 tweets were removed.I We employ some of tweets’ metadata.

45,647

59,194

5,514,931

30,530,907

29,959,886

34,980,039

39,936,540

29,352,003

36,067,474

21,583,787

33,675,162

26,677,285

35,601,761

32,012,415

31,378,550

33,232,571

33,247,115

30,923,457

34,516,026

34,131,838

30,392,248

30,860,493

30,812,913

30,570,787

30,734,059

27,671,321

30,245,433

28,641,364

28,601,539

28,133,412

29,800,237

30,555,993

27,709,765

28,003,261

28,585,095

28,331,869

27,508,112

25,411,034

26,855,489

26,286,412

26,565,430

26,573,886

27,058,115

24,079,006

24,200,624

24,350,615

16,577,648

23,007,020

24,741,745

22,436,542

23,285,384

21,163,478

22,357,313

20,978,546

21,910,272

21,974,665

20,246,552

19,978,891

19,473,362

2013-01

2013-02

2013-03

2013-04

2013-05

2013-06

2013-07

2013-08

2013-09

2013-10

2013-11

2013-12

2014-01

2014-02

2014-03

2014-04

2014-05

2014-06

2014-07

2014-08

2014-09

2014-10

2014-11

2014-12

2015-01

2015-02

2015-03

2015-04

2015-05

2015-06

2015-07

2015-08

2015-09

2015-10

2015-11

2015-12

2016-01

2016-02

2016-03

2016-04

2016-05

2016-06

2016-07

2016-08

2016-09

2016-10

2016-11

2016-12

2017-01

2017-02

2017-03

2017-04

2017-05

2017-06

2017-07

2017-08

2017-09

2017-10

2017-11

Figure: TweetsKB: Number of tweets per month

2for spam detection HSpam14 dataset was used [Sedhai and Sun, 2015]Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 10/25

Page 11: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Entity Linking

I Yahoo’s FEL tool [Blanco et al., 2015].

I FEL has been specifically designed for short texts.

I 1.4 million distinct entities.I Evaluated on NEEL challenge dataset [Rizzo et al., 2016]:

I Dataset consists of 9,289 English tweets from 2011, 2013,2014 and 2015.

I Precisions = 85% , Recall = 39% and F1 = 54%

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 11/25

Page 12: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Sentiment Annotation

I For sentiment analysis we have used SentiStrength [Thelwallet al., 2012].

I SentiStrength assigns both positive and negative scores toshort texts [5, -5].

I Evaluated on TSentiment15 dataset [Iosifidis and Ntoutsi,2017]:

I Dataset consists of 2.5 million English tweets from 2015.I Accuracy = 91% , F1 = 80%

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 12/25

Page 13: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

TweetKB’s overall statistics

I TweetsKB is available as N3 files through Zenodo repo (DOI:10.5281/zenodo.573852)3

I It’s also registered at datahub.ckan.io4

Number of tweets 1,560,096,518

Number of distinct users 125,104,569

Number of distinct hashtags 40,815,854

Number of distinct user mentions 81,238,852

Number of distinct entities 1,428,236

Number of tweets with sentiment 772,044,599

Number of RDF triples 48,207,277,042

3https://zenodo.org/record/5738524https://datahub.ckan.io/dataset/tweetskb

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 13/25

Page 14: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Computational Cost

I Cluster: CPUs 504, RAM 6,784 GB, Disk 2.1 PB

I Sentiment annotation: 6M tweets per minute

I Entity Extraction: 4.8M tweets per minute

I Triplification: 14M tweets per minute

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 14/25

Page 15: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

RDF Schema Our schema follows established vocabularies such asSIOC, schema.org, and Onyx

schema:ShareAction

rdfs:Literal

rdfs:Literal nee:Entity

rdfs:Literalnee:confidence

nee:detectedAs

nee:hasMatchedURI

rdfs:Literal

sioc: http://rdfs.org/sioc/ns#sioc_t: http://rdfs.org/sioc/types#dc: http://purl.org/dc/terms/rdfs: http://www.w3.org/2000/01/rdf-schema#nee: http://www.ics.forth.gr/isl/oae/core#schema: http://schema.org/onyx: http://www.gsi.dit.upm.es/ontologies/onyx/nswna: http://www.gsi.dit.upm.es/ontologies/wnaffect/ns#

dc:created

sioc:Post

rdfs:Resource

sioc:id

rdfs:Literal

schema:mentions

sioc:UserAccountschema:mentions sioc:name

sioc_t:Tag rdfs:Literalrdfs:label

sioc:UserAccountsioc:has_creator

rdfs:Literal

sioc:id

rdfs:Literal

schema:mentionsonyx:hasEmotionSet

onyx:EmotionSet

onyx:Emotion

wna:positive-emotionwna:negative-emotion

onyx:hasEmotion

onyx:hasEmotionCategory

onyx:hasEmotionIntensity

rdfs:Literal

schema:InteractionCounter

schema:interactionStatistic

schema:LikeAction

schema:interactionType

schema:userInteractionCount

Figure: Employed RDF/S model

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 15/25

Page 16: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Tweets in our RDF/S model are associated with six types ofelements:

I metadata: sioc : Post → tweet, sioc : UserAccount → user

I entity mentions: for each entity, we store surface form, URI,confidence score (NEE).

I user mentions: sioc : UserAccount → user

I hashtag mentions: sioc t : Tag

I sentiment scores: onyx : EmotionSet

I interaction statistics:schema : LikeAction → favorite count,schema : ShareAction → retweet count,

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 16/25

Page 17: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Example

nee:Entitynee:confidence

nee:detectedAs

nee:hasMatchedURI

dc:created

sioc:Postsioc:id

schema:mentions

sioc:UserAccount

sioc:name

sioc_t:Tag

rdfs:label

sioc:has_creatorsioc:id

rdf:type

_:ent1

dbr:Roger_Federer

-1.54

“Federer”

rdf:type

_:usr8

rdf:type

“livetennis”

_:tag1 “usopen”

schema:mentions

schema:mentions

06.01.2012 06:40

“9565121266”

_:usr1“2356912”

12

0.75 0.0

_:stat1

schema:interactionStatistic

schema:LikeAction

schema:interactionType

schema:userInteractionCount

_:ems1

onyx:hasEmotionSet

_:em1 _:em2

onyx:hasEmotion

wna:positive-emotion wna:negative-emotion

_:tweet1

rdf:type

onyx:hasEmotionIntensity

onyx:hasEmotionCategory

Figure: Tweet example

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 17/25

Page 18: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Analysis of the top 100K entities and hashtags

Figure: Distribution of top-100,000 entities.

Figure: Distribution of top-100,000 hashtags.Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 18/25

Page 19: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Analysis of the top 100K entities

DBpedia typeNumber of

distinct entities

http://dbpedia.org/ontology/Person 21,139 (21.1%)http://dbpedia.org/ontology/Organisation 14,815 (14.8%)http://dbpedia.org/ontology/Location 8,215 (8,2%)http://dbpedia.org/ontology/Athlete 5,192 (5.2%)http://dbpedia.org/ontology/Artist 3,737 (3.7%)http://dbpedia.org/ontology/City 2,563 (2.6%)http://dbpedia.org/ontology/Event 510 (0.5%)http://dbpedia.org/ontology/Politician 208 (0.2%)

Table: Overview of popular entity types of the top-100,000 entities.

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 19/25

Page 20: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Data Exploration and IntegrationSPARQL queries can be used to integrate information fromexternal knowledge bases such as DBpedia.

SELECT DISTINCT ?tweetID ?sentNegScore ?retweetCount ?politician ?birthPlace WHERE {

SERVICE <http://dbpedia.org/sparql> {

?politician dc:subject dbc:German_politicians ; dbo:birthPlace ?birthPlace }

?tweet a sioc:Post ; dc:created ?date ; sioc:id ?tweetID FILTER(year(?date) = 2016) .

?tweet schema:mentions ?entity . ?entity a nee:Entity ; nee:hasMatchedURI ?politician .

?tweet schema:interactionStatistic ?stat . ?stat schema:interactionType schema:ShareAction .

?stat schema:userInteractionCount ?retweetCount FILTER(?retweetCount > 100) .

?tweet onyx:hasEmotionSet ?emotSet . ?emotSet onyx:hasEmotion ?emot .

?emot onyx:hasEmotionCategory wna:negative-emotion ;

onyx:hasEmotionIntensity ?sentNegScore FILTER (?sentNegScore >= 0.75) }

Figure: SPARQL query for retrieving popular tweets in 2016 mentioning German politicians with strong negativesentiment.

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 20/25

Page 21: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Temporal Entity AnalyticsGiven an entity and a time period one can make temporal queriessuch as:SELECT ?month xsd:double(?cEnt)/xsd:double(?cAll)

WHERE {

{ SELECT (month(?date) AS ?month) (count(?tweet) AS ?cAll) WHERE {

?tweet a sioc:Post ; dc:created ?date FILTER(year(?date) = 2015)

} GROUP BY month(?date) }

{ SELECT (month(?date) AS ?month) (count(?tweet) AS ?cEnt) WHERE {

?tweet a sioc:Post ; dc:created ?date FILTER(year(?date) = 2015) .

?tweet schema:mentions ?entity .

?entity a nee:Entity ; nee:hasMatchedURI dbr:Alexis_Tsipras

} GROUP BY month(?date) }

} ORDER BY ?month

Figure: SPARQL query for retrieving the monthly popularity of Alexis Tsipras (Greek prime minister) in 2015

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 21/25

Page 22: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Entity RecommendationsEntity co-occurancies indicate entity relatedness.

July’15 Oct’15Apr’15

Greece

Vladimir Putin

Russia

Yanis Varoufakis

Sanctions (law)

Referendum

Reuters

Greek withdrawal from the eurozone

Syriza

Athens

European migrant crisis

Lesbos

Debt relief

The New York Times

Austerity

Greek War of Independ.

Pragmatism

Greek gov. debt crisis

Intern. Monetary Fund

Figure: Co-occuring entities of Tsipras (Greek prime minister) duringApr’15 - Dec’15

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 22/25

Page 23: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Sustainability and Maintenance

I Archive is deposed on Zenodo and registered atdatahub.ckan.io

I Integrated into AFEL5 and ALEXANDRIA6 projects.

I Endpoint7 which currently facilitates 5% of the whole corpus.

I Periodically update TweetsKB (every 6 months) with recentinformation.

5http://afel-project.eu/6http://www.alexandria-project.eu7http://l3s.de/tweetsKB/endpoint/

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 23/25

Page 24: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Conclusions

I Largest annotated RDF corpus of tweets.

I Advanced entity-centric exploration.

I Data integration with other knowledge bases (at queryexecution time).

I Analytics for different disciplines and research problems.

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 24/25

Page 25: TweetsKB: A Public and Large-Scale RDF Corpus of Annotated …l3s.de/~fafalios/files/ppts/TweetsKB_ESWC2018_Slides.pdf · 2018-06-11 · Entity Linking I Yahoo’s FEL tool [Blanco

Thanks

Questions?Contact: {fafalios, iosifidis}@L3S.de

Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 25/25