Geo-spatial Event Detection in the Twitter Stream Michael Kaisser, AGT International Berlin...
-
Upload
laila-rowlett -
Category
Documents
-
view
219 -
download
4
Transcript of Geo-spatial Event Detection in the Twitter Stream Michael Kaisser, AGT International Berlin...
Geo-spatial Event Detection in the Twitter Stream
Michael Kaisser, AGT International
Berlin Buzzwords, June 3, 2013
2
Outline
1. Introduction & Context
• Social Media Analysis in a C2 Center
2. The “Avalanche” event detection approach
• Identify posting “hot spots”
• Evaluate post clusters with Machine Learning approach
3. Evaluation
4. Future work
3
Background: Social Data
• Social Media continuously creates massive amounts of data
• E.g. 500 Million tweets each day: ~300 GB raw data
• Nature of the data:
• time-stamped
• textual (many languages, lingos & slangs, spelling mistakes are ripe, only a few words per tweet)
• links to pictures
• links to news paper articles (more text)
• sometimes geo-spatial (contains coordinates)
• Creating real actionable insights from this isn’t an easy problem
This talk gives one specific example how this can be done
4
Use case: Urban Management & Public Safety
• Cites today are complex and need to be organized
• Administration is responsible for keeping population safe
• emergency services
• health services
• fire fighters
• police
Command & Control Center
5
Urban Management & Public Safety
Why is Social Media relevant in this context?
?
6
Urban Management & Public Safety
Why is Social Media relevant in this context?
“There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy”
7
Urban Management & Public Safety
Why is Social Media relevant in this context?
“De tering, wat een hel!!! 1,4 miljoen mensen op dat terrein! #loveparade”
8
Urban Management & Public Safety
Why is Social Media relevant in this context?
“#Hoboken is on fire. Building above Hoboken Farm Corporation at 300 Washington is all smoked out”
Social Media can help creating a situational awareness picture
9
Context: Social Media in a C2 Center
10
Avalanche: Event detection in a C2 Center
11
Avalanche: Event detection in a C2 Center
12
Avalanche: Event detection in a C2 Center
13
Avalanche: Event detection in a C2 Center
14
Avalanche: Event detection in a C2 Center
15
Avalanche: Event detection in a C2 Center
16
Two step approach:
1.Identify locations with high tweet activity
• Collect geo-spatial tweet clusters
2.Evaluate clusters with a Machine Learning approach
• Do these clusters constitute an real-world event that the tweeters are witnessing first-hand?
Work in Progress:
3.Classify events according to type
How is it done?
17
Machine Learning – What is the task?
= geo-located Social Media post (Tweet)
18
Machine Learning – What is the task?
• Suspicious package in #GrandCentral #NYC #bomb threat possibility not sure?? http://t.co/VwU7SP3X
• Suspicious package found in Grand Central Station... the 456 train..the trains are closed !! [pic]: http://t.co/9YPki4k2
• Something happened in the #456 #trainstation in #GrandCentral #NYC http://t.co/GGKvQura
• Accident on the #456train in #midtown #NYC http://t.co/fj2mJJmf
vs.• RT @refinery29: This image of Madeleine Albright playing the drums
will be the best thing you'll see today: http://t.co/rGwQ5RdG• «@_PrettyPoison Guess ill fill out more job apps today» make punna
fill out some 2!• The Glamour & Glitz at the 2012 Emmy' s that we loved!
http://t.co/CiTFszfL• @IszwanieSyahira: i'm happy and i hope u feel the same too.
weeeee ~.~• How to prepare yourself for Friday's apocalypse http://cnet.co/lPU
We need to automatically determine which of the tweet clusters (tweets issued close to each other in a short time frame) represent real-world events and which are just random chatter.
Good
Bad
19
• We look for geo-spatial clusters of tweets (e.g. 3 or more tweets in a 200m radius, posted within 30 mins)
• These become “event candidates”
• Event candidates are evaluated with a Machine Learning scheme.
• We currently use C4.5 decision trees.
Architecture
20
Machine Learning - Features
Tweet cluster:•Suspicious package in #GrandCentral #NYC #bomb threat possibility not sure?? http://t.co/VwU7SP3X•Suspicious package found in Grand Central Station... the 456 train..the trains are closed !! [pic]: http://t.co/9YPki4k2•Something happened in the #456 #trainstation in #GrandCentral #NYC http://t.co/GGKvQura•Accident on the #456train in #midtown #NYC http://t.co/fj2mJJmf
21
Blue = trainingGreen = runtime
In offline ML, we train once, but use the predictive model possibly millions of times a day.
It’s okay if training isn’t fast as lightning. But during execution every CPU cycle can count.
Scalable Machine Learning … …with Weka!
22
…
Scalable Machine Learning … …with Weka!
… which can be optimized further in various ways.
See e.g. Nima Asadi, Jimmy Lin, Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 2013.
23
Evaluation setup:• 1,000 hand-labeled tweet clusters.
• 319 good, 681 bad.
• 10-fold cross validation.
Machine Learning - Evaluation
24
Evaluation setup:•1,000 hand-labeled tweet clusters. 319 good, 681 bad. •10-fold cross validation.
Machine Learning - Evaluation
25
Machine Learning - Evaluation
Evaluation setup:•1,000 hand-labeled tweet clusters. 319 good, 681 bad. •10-fold cross validation.
Unique Posters scoreC
om
mon
Th
em
e s
core
11
0
Blue: eventRed: no event
26
If there are several tweets …• from roughly the same location
• at roughly the same time
• from different users
• that nevertheless use the same words
… chances are good that we have detected an event.
(Somewhat simplyfied) Summary
27
Outlook – work in progress and future work
Derive more coordinates
• from shared pictures
• from toponyms in posts
• use image sharing sites directly
Make use of posts without coordinates
• and add them to already existing clusters
Explore real-time TF-IDF
• to get rid of the Kardashians & Beliebers
Evaluate system with real-world data
• Because recall numbers are currently somewhat misleading
28
Machine Learning – Relevance Feedback
Machine Learning Model
Users (journalists, C2 operators )
Documents (e.g. tweets, post clusters)
Good
Good
Bad
• Users implicitly rate documents by how they interact with them• User performs follow up actions relevant• User clicks document away irrelevant
System learns to present more relevant documents System can adapt to changing needs over time
Work in progress
29
Example: Explosion in an image
Explosion detectedwith Image Analysis OMG!!!OMG!!!
http://t.co/maiAgHoh
Problem:• Not all tweets contain useful textual information• Shared text might be hard to analyze
Solution:• ~35% of tweets contain linked images• Images provide a wealth of information that can be analyzed
• Objects, events, persons• coordinates
Image Analysis of shared pictures Work in progress
Thank you!