SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

17
EVENT DETECTION 5 th BDE Hang-out “Big Data in Secure societies” 13/12/2017 George Giannakopoulos and Nikiforos Pittaras, NCSR "Demokritos"

Transcript of SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

Page 1: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

EVENT DETECTION

5th BDE Hang-out “Big Data in Secure societies”13/12/2017

George Giannakopoulos and

Nikiforos Pittaras,

NCSR "Demokritos"

Page 2: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

Pilot Architecture

18-déc.-17www.big-data-europe.eu

Page 3: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

Event Detection Workflow

18-déc.-17www.big-data-europe.eu

News & Twitter

Crawler

Event

DetectorLookup

Service

Page 4: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: News Crawler

Runs periodically

Stores parsed content and metadata to Cassandra

RSS feeds:

o Crawler conforms with privacy regulations

o Default RSS feeds list to Reuters generic categories

Direct links to published articles:

o Best-effort parsing

18-déc.-17www.big-data-europe.eu

Page 5: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Twitter Crawler

Runs periodically

Stores parsed content and metadata to Cassandra

Multiple operation modes:

o Query specified twitter accounts

o Monitor all twitter posts of a specified language

o Keyword-based search

o Parse individual specified posts

18-déc.-17www.big-data-europe.eu

Page 6: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Cassandra

Scalable, noSQL distributed database

I/O scenarios:

1. News & Tweets storage:

o Individual items (news articles or tweets) from the crawlers

2. Event storage:

o Event objects & metadata, as identified by the Event Detector

3. Frontend queries:

o Queries from Sextant about the stored news items and events

18-déc.-17www.big-data-europe.eu

Page 7: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Event Detector

Runs periodically

Distributed execution based on Apache SPARK

Two algorithm steps:

1. Discovers related news items and clusters them into events

2. Produced events are augmented with useful meta-data: date,

locations, images and specified named entities

Detector algorithm based on

18-déc.-17www.big-data-europe.eu

Page 8: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: ED Algorithm

1) Identify events:

o Gather all unique article pairs

o Extract similarity of members in each pair using graph

representation methods

If similarity > threshold → related pair

o Form clusters based on related pairs

If cluster has support > threshold → event

18-déc.-17www.big-data-europe.eu

Page 9: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: ED Algorithm

2) Enrich events:

o Assign individual social media items to events

Convert to graph-based representation method, similarity-based classification

If similarity > threshold → attach to event

o Augment events from external metadata extractable from their member

articles and tweets:

Locations names and geocoordinates (GADM)

Named entities (Famous people)

Photographs (Flickr)

18-déc.-17www.big-data-europe.eu

Page 10: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Location Extraction

Based on Apache Lucene for fuzzy queries

Based on the GAMD dataset

o more than 180,000 location names & geometries

Input: Clean text

Output: Location name(s) with their corresponding

geocoordinates

18-déc.-17www.big-data-europe.eu

Page 11: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Entity extraction

Incorporation of semantic metadata extraction

Augment events by extracting generic named

entities

o Grounded to a unique entity URI

o Highly extensible: entity metadata easily queriable

from additional RESTful APIs, if needed

APIs & thesauri by the Semantic Web Company

18-déc.-17www.big-data-europe.eu

Page 12: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

Text (https://en.wikipedia.org/wiki/The_Godfather#Cast)

ED Workflow: Entity extraction

Example: famous people thesaurus:

18-déc.-17www.big-data-europe.eu

Extractor

APIhttp://bde.poolparty.biz/People/20

http://bde.poolparty.biz/People/446473

http://bde.poolparty.biz/People/688722

....

Metadata

API

name: Marlon Brando

uri: http://bde.poolparty.biz/People/688722

grounding: http://dbpedia.org/resource/Marlon_Brando

broaders: http://bde.poolparty.biz/People/2

properties: http://www.w3.org/1999/02/22-rdf-syntax-

ns#type

...

Entity metadata Entities

Page 13: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Detector Scaling

Study on event detection performance scaling

Distributed execution in Apache SPARK

Further experiments on two datasets on two different domains

o News articles (Reuters-21578)

o Biomedical scientific publications (bioASQ)

Up to 10K articles in total (~ 5 mil pairs)

Technical report draft available upon request

18-déc.-17www.big-data-europe.eu

Page 14: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Detector Scaling

Preliminary results on Reuters-21578

Parallel vs distributed execution time (lower is better)

Substantial speedup at large enough (> 8K articles) workloads

18-déc.-17www.big-data-europe.eu

Page 15: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Image extraction

Enrichment of extracted locations with photographs

o Considers a radial area around the centroid of the

geocoordinates of a location geometry

o Queries the Flickr API for user-uploaded public

photographs within that area

o Filters results to a temporal window relevant to

the date of the event in question

18-déc.-17www.big-data-europe.eu

Page 16: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

ED Workflow: Connectivity

Workflow inter-connections

Automatic triggering of the CD workflow

o Event support calculated during detection

o Triggers if support greater than a specified threshold

Twitter Crawler source injection

o Targeted consumption of specified posts

Asynchronous non-blocking operations18-déc.-17www.big-data-europe.eu

Page 17: SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"

Thank you!

Questions?

Links

Strabon: http://strabon.di.uoa.gr

GeoTriples: https://github.com/LinkedEOData/GeoTriples

Event Detection: https://github.com/big-data-europe/docker-

event-detection

18-déc.-17www.big-data-europe.eu