SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"
-
Upload
bigdataeurope -
Category
Technology
-
view
87 -
download
0
Transcript of SC7 Webinar 5 13/12/2017 NCSR "Demokritos" Presentation "Event Detection"
EVENT DETECTION
5th BDE Hang-out “Big Data in Secure societies”13/12/2017
George Giannakopoulos and
Nikiforos Pittaras,
NCSR "Demokritos"
Pilot Architecture
18-déc.-17www.big-data-europe.eu
Event Detection Workflow
18-déc.-17www.big-data-europe.eu
News & Twitter
Crawler
…
Event
DetectorLookup
Service
ED Workflow: News Crawler
Runs periodically
Stores parsed content and metadata to Cassandra
RSS feeds:
o Crawler conforms with privacy regulations
o Default RSS feeds list to Reuters generic categories
Direct links to published articles:
o Best-effort parsing
18-déc.-17www.big-data-europe.eu
ED Workflow: Twitter Crawler
Runs periodically
Stores parsed content and metadata to Cassandra
Multiple operation modes:
o Query specified twitter accounts
o Monitor all twitter posts of a specified language
o Keyword-based search
o Parse individual specified posts
18-déc.-17www.big-data-europe.eu
ED Workflow: Cassandra
Scalable, noSQL distributed database
I/O scenarios:
1. News & Tweets storage:
o Individual items (news articles or tweets) from the crawlers
2. Event storage:
o Event objects & metadata, as identified by the Event Detector
3. Frontend queries:
o Queries from Sextant about the stored news items and events
18-déc.-17www.big-data-europe.eu
ED Workflow: Event Detector
Runs periodically
Distributed execution based on Apache SPARK
Two algorithm steps:
1. Discovers related news items and clusters them into events
2. Produced events are augmented with useful meta-data: date,
locations, images and specified named entities
Detector algorithm based on
18-déc.-17www.big-data-europe.eu
ED Workflow: ED Algorithm
1) Identify events:
o Gather all unique article pairs
o Extract similarity of members in each pair using graph
representation methods
If similarity > threshold → related pair
o Form clusters based on related pairs
If cluster has support > threshold → event
18-déc.-17www.big-data-europe.eu
ED Workflow: ED Algorithm
2) Enrich events:
o Assign individual social media items to events
Convert to graph-based representation method, similarity-based classification
If similarity > threshold → attach to event
o Augment events from external metadata extractable from their member
articles and tweets:
Locations names and geocoordinates (GADM)
Named entities (Famous people)
Photographs (Flickr)
18-déc.-17www.big-data-europe.eu
ED Workflow: Location Extraction
Based on Apache Lucene for fuzzy queries
Based on the GAMD dataset
o more than 180,000 location names & geometries
Input: Clean text
Output: Location name(s) with their corresponding
geocoordinates
18-déc.-17www.big-data-europe.eu
ED Workflow: Entity extraction
Incorporation of semantic metadata extraction
Augment events by extracting generic named
entities
o Grounded to a unique entity URI
o Highly extensible: entity metadata easily queriable
from additional RESTful APIs, if needed
APIs & thesauri by the Semantic Web Company
18-déc.-17www.big-data-europe.eu
Text (https://en.wikipedia.org/wiki/The_Godfather#Cast)
ED Workflow: Entity extraction
Example: famous people thesaurus:
18-déc.-17www.big-data-europe.eu
Extractor
APIhttp://bde.poolparty.biz/People/20
http://bde.poolparty.biz/People/446473
http://bde.poolparty.biz/People/688722
....
Metadata
API
name: Marlon Brando
uri: http://bde.poolparty.biz/People/688722
grounding: http://dbpedia.org/resource/Marlon_Brando
broaders: http://bde.poolparty.biz/People/2
properties: http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
...
Entity metadata Entities
ED Workflow: Detector Scaling
Study on event detection performance scaling
Distributed execution in Apache SPARK
Further experiments on two datasets on two different domains
o News articles (Reuters-21578)
o Biomedical scientific publications (bioASQ)
Up to 10K articles in total (~ 5 mil pairs)
Technical report draft available upon request
18-déc.-17www.big-data-europe.eu
ED Workflow: Detector Scaling
Preliminary results on Reuters-21578
Parallel vs distributed execution time (lower is better)
Substantial speedup at large enough (> 8K articles) workloads
18-déc.-17www.big-data-europe.eu
ED Workflow: Image extraction
Enrichment of extracted locations with photographs
o Considers a radial area around the centroid of the
geocoordinates of a location geometry
o Queries the Flickr API for user-uploaded public
photographs within that area
o Filters results to a temporal window relevant to
the date of the event in question
18-déc.-17www.big-data-europe.eu
ED Workflow: Connectivity
Workflow inter-connections
Automatic triggering of the CD workflow
o Event support calculated during detection
o Triggers if support greater than a specified threshold
Twitter Crawler source injection
o Targeted consumption of specified posts
Asynchronous non-blocking operations18-déc.-17www.big-data-europe.eu
Thank you!
Questions?
Links
Strabon: http://strabon.di.uoa.gr
GeoTriples: https://github.com/LinkedEOData/GeoTriples
Event Detection: https://github.com/big-data-europe/docker-
event-detection
18-déc.-17www.big-data-europe.eu