Extracting City Traffic Events from Social Streams

Extracting City Events from Social Streams

Pramod AnantharamKno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing

Wright State University, Dayton, Ohio, USA

http://www.ict-citypulse.eu/page/

Collaborator: Dr. Payam BarnaghiAdvisors: Prof. Amit Sheth, Prof. Krishnaprasad Thirunarayan

Pramod Anantharam, Payam Barnaghi, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. Extracting City Traffic Events from Social Streams. ACM Trans. Intell. Syst. Technol. 6, 4, Article 43 (July 2015), 27 pages. DOI=10.1145/2717317

http://doi.acm.org/10.1145/2717317

http://wiki.knoesis.org/index.php/Citypulse

http://knoesis.org/researchers/pramod/

http://knoesis.org/researchers/pramod/




http://personal.ee.surrey.ac.uk/Personal/P.Barnaghi/




http://knoesis.wright.edu/amit/

http://knoesis.wright.edu/amit/

http://knoesis.wright.edu/tkprasad/

http://www.knoesis.org/library/resource.php?id=2030





A Historical Perspective on Cities and its Inhabitants

“kings, emperors and other rulers benefited from being on the front lines with their people when it came to making decisions.”1

1http://gicoaches.com/what-we-can-learn-from-kings-of-the-past-who-disguised-themselves-as-ordinary-men/ http://en.wikipedia.org/wiki/Qianlong_Emperor

Qianlong Emperor (8 October 1735 – 9 February 1796)Qing Dynasty (1644–1912)

Disguised as a commoner, Qianlong visited cities to understand a common man’s life

This is popularly known as “Management by Walking Around” since the 1980’s

http://gicoaches.com/what-we-can-learn-from-kings-of-the-past-who-disguised-themselves-as-ordinary-men/




http://en.wikipedia.org/wiki/Qianlong_Emperor

http://en.wikipedia.org/wiki/Qianlong_Emperor

A Modern Perspective on Cities and its Inhabitants

City authorities, government and other humanitarian agencies are benefited from being on the front lines with their people when it comes to making decisions.

We want to be connected to citizens to understand and prioritize decisions

Image credit: http://www-03.ibm.com/software/products/us/en/intelligent-operations-centerImage credit: http://www.ibm.com/smarterplanet/us/en/smarter_cities/overview/index.html

Life in a City

http://www-03.ibm.com/software/products/us/en/intelligent-operations-center

http://www-03.ibm.com/software/products/us/en/intelligent-operations-center

http://www.ibm.com/smarterplanet/us/en/smarter_cities/overview/index.html

Image credit: http://www.ibm.com/smarterplanet/us/en/smarter_cities/overview/index.html

Public Safety Urban planning Gov. & agency admin.

Energy &water

Environmental Transportation Social Programs Healthcare Education

Pulse of a City (CityPulse)



What are People Talking About City Infrastructure on Twitter?

• What are people talking about city infrastructure on twitter?

• How do we extract city infrastructure related events from twitter?

• How can we leverage event and location knowledge bases for event extraction?

• How well can we extract city events?

Research Questions

Some Challenges in Extracting Events from Tweets

• No well accepted definition of ‘events related to a city’ • Tweets are short (140 characters) and its informal

nature make it hard to analyze– Entity, location, time, and type of an event

• Multiple reports of the same event and sparse report of some events (biased sample)– Numbers don’t necessarily indicate intensity

• Validation of the solution is hard due to the open domain nature of the problem

Formal Text Informal Text

Closed Domain

Open Domain [Roitman et al. 2012][Kumaran and Allan 2004]

[Lampos and Cristianini 2012]

[Becker et al. 2011]

[Wang et al. 2012]

[Ritter et al. 2012]

Related Work on Event Extraction

• N-grams + Regression – Text analysis to extract uni- and bi-grams (event markers)– Feature selection to select best possible event markers– Apply regression to estimate conditional probability P(Y|X) to enable prediction where

Y is the target (e.g., rainfall) and X is the input (e.g., traffic jam event).• Clustering

– Create event clusters incrementally over time– Identify clusters of interest based on its relevance (manual inspection)– Granularity remains at the tweet/cluster level (tweets are assigned to clusters of

interest)• Sequence Labeling (CRFs)

– Text analysis to extract features such as named entities, POS1 tagging– Each event indicator is modeled as a mixture of event types that are latent variables– Each type corresponds to a distribution over named entities n (labels assigned to

event types by manual inspection) and other features

Event Extraction -- Techniques

1Part Of Speech

• Event extraction should be open domain (no a priori event types) preferably with event metadata (e.g., event duration, impact).

• Incorporate background knowledge related to city events e.g., 511.org hierarchy, SCRIBE ontology, city location names.

• Assess the intensity of an event using content and network cues.

• Be robust w.r.t. to noise, informal nature, and variability of data.

City Event Extraction -- Desiderata

• N-grams + Regression– Open domain: works best when there is a reference

corpus to extract n-grams– Event metadata: cannot distinguish between entities

and hence hard to extract event metadata– Background knowledge: incorporating domain

vocabulary (e.g., subsumption) is not natural– Event intensity: regression maps the event indicators

to some quantified values– Robustness: quite robust if there is a reference corpus

Techniques -- Desiderata

• Clustering– Open domain: works well for domains with no a priori

knowledge of events (may need human inspection)– Event metadata: too coarse grained (document level)

and event metadata extraction is not natural– Background knowledge: incorporating domain

vocabulary is not natural– Event intensity: not captured – Robustness: quite robust for twitter data with enough

data for each cluster


• Sequence Labeling (CRFs)– Open domain: works well for domains with no a priori

knowledge of events (may need human inspection)– Event metadata: event metadata extraction is captured

naturally with the named entities– Background knowledge: incorporating domain vocabulary

is quite natural – Event intensity: part-of-speech tag may indirectly capture

intensity– Robustness: with a deeper model for capturing context,

quite robust for twitter data


City Infrastructure

Tweets from a cityPOS

Tagging

Hybrid NER+ Event term extraction

Geohashing

Temporal Estimation

Impact Assessment

Event Aggregation

OSM Locations

SCRIBE ontology

511.org hierarchy

City Event Extraction

City Event Extraction Solution Architecture

City Event Annotation

tag 1 tag 2 tag 3

token 1

token 2

token 3

Φ11(tag1,tag2) Φ1

2(tag2,tag3)

Φ21(tag1,token1) Φ2

2(tag2,token2) Φ23(tag3, token3)

t1 t2

T1 T2 T3

Training data with tokens and tags

A General CRF Model Regression Based Implementation of CRF Model

t2 t2

T4 T5 T6

t3

t1

City Event Annotation – CRF Formalization

The global normalization distinguishes CRFs from other models allowing for factoring in long distance dependencies

City Event Annotation – CRF Annotation Examples

Last O night O in O CA... O (@ O Half B-LOCATION Moon I-LOCATION Bay B-LOCATION Brewing I-LOCATION Company O w/ O 8 O others) O http://t.co/w0eGEJjApY O

#Manteca O accident. B-EVENT two O lanes O blocked B-EVENT on O Hwy O 99 O NB O at O Austin O Rd O #traffic O http://t.co/YehsHpD7aC O

#Fontana O accident. B-EVENT three O lanes O blocked B-EVENT on O I-10 O WB O between O Cherry B-LOCATION Ave I-LOCATION and O Etiwanda O Ave O in O #Ontario O #LAtraffic O http://t.co/e2e6MW3d78 O

B-LOCATIONI-LOCATIONB-EVENTI-EVENTO

Tags used in our approach:

a) Space: events reported within a grid (gi G ∈where G is a set of all grids in a city)at a certain time are most likely reporting the same event

b) Time: events reported within a time ∆t in a grid gi are most likely to be reporting the same event

c) Theme: events with similar entities within a grid gi and time ∆t are most likely reporting the same event

City Event Extraction -- Key Insights

We will utilize these principles in the event aggregation algorithm

0.6 miles

Max-lat

Min-lat

Min-long

Max-long

0.38 miles

37.7545166015625, -122.40966796875

37.7490234375, -122.40966796875

37.7545166015625, -122.420654296875

37.7490234375, -122.420654296875

4

37.74933, -122.4106711

Hierarchical spatial structure of geohash for representing locations with variable precision.Here the location string is 5H34

0 1 2 3 4 5 6

7 8 9 B C D E

F G H I J K L

0 1

7

2 3 4

5 6 8 9

0 1 2 3 4

5 6 7

0 1 2

3 4 5

6 7 8

City Event Extraction – Geohashing

City Event Extraction – Metadata Population Algorithm

City Event Extraction– Event Aggregation Algorithm

Event metadata inference

Spatial filtering

Grouping Events by types

• <traffic, 5889, 2013-10-22 19:24:39, 2013-10-23 18:54:08, 19>

• <concert, 5889, 2013-10-20 19:46:06, 2013-10-21 19:06:29, 35>

• <accident, 32400, 2013-10-20 19:51:10, 2013-10-21 15:53:08, 11>

• <parade, 8672, 2013-08-10 12:57:17, 2013-08-10 18:57:21, 11>

City Event Extraction – A Sample of Extracted Events

Location refers to the geohash number which is mapped to lat-long

• City Event Annotation – Automated creation of training data – Annotation task (our CRF model vs. baseline CRF model)

• City Event Extraction– Use aggregation algorithm for event extraction– Extracted events AND ground truth

• Dataset (Aug – Nov 2013) ~ 8 GB of data on disk– Over 8 million tweets (extract events using Alg. 1 and 2)– Over 162 million sensor data points (find delays by looking at change

in travel time for links serving as ground truth)– 311 active events and 170 scheduled events (readily available events

as ground truth)

Evaluation

Evaluation – Automated Creation of Training Data (to train CRF model)

Evaluation over 500 randomly chosen tweets from around 8,000 annotated tweets

Aho-Corasick [Commentz-Walter 1979] string matching algorithm implemented by LingPipe [Alias-i 2008]

Evaluation – Annotation Task (our CRF model vs. baseline CRF model)Baseline CRF Model

Our CRF Model

Baseline CRF model (trained on a huge manually created data) works well on generic tasks.

Our CRF model trained on automatically generated training data performs on par with the baseline.

Our CRF model does better on the event extraction task due to the availability of event related knowledge

Ground Truth Data (only incident reports) -- City Event Extraction

We have around 162 million data records from sensors monitoring over 3,700 links in San Franciso Bay Area<link_id, link_speed, link_volume, link_travel_time,time_stamp> a data record

GREEN – Active EventsYELLOW – Scheduled Events

311 active events and 170 scheduled events

Evaluation – Use Aggregation Algorithm for Event Extraction

Evaluation – Extracted Events AND Ground Truth Verification

Evaluation Metric For Comparing Events with Ground Truth:• Complementary Events

• Additional information• e.g., slow traffic from sensor data and accident from textual data

• Corroborative Events• Additional confidence• e.g., accident event supporting a accident report from ground truth

• Timeliness • Additional insight• e.g., knowing poor visibility before formal report from ground truth


Complementary Events

Complementary Events


Corroborative Events



Corroborative Events Complementary Events


Timeliness

Timeliness

• People in a city indeed talk about various infrastructure related specifically, traffic.

• City traffic related events can be extracted from tweets using sequence labeling techniques and spatial aggregation algorithms.

• Domain knowledge of events and locations can be utilized to create large training datasets to train sequence labeling algorithms.

• Traffic events extracted from twitter can be complementary, corroborative, and timely compared to formal reports of traffic events.

Conclusion

[Kumaran and Allan 2004] Giridhar Kumaran and James Allan. 2004. Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 297–304.

[Lampos and Cristianini 2012] Vasileios Lampos and Nello Cristianini. 2012. Nowcasting events from the social web with statistical learn- ing. ACM Transactions on Intelligent Systems and Technology (TIST) 3, 4 (2012), 72.

[Roitman et al. 2012] Haggai Roitman, Jonathan Mamou, Sameep Mehta, Aharon Satt, and LV Subramaniam. 2012. Harnessing the Crowds for smart city sensing. In Proceedings of the 1st international workshop on Multimodal crowd sensing. ACM, 17–18.

[Ritter et al. 2012] Alan Ritter, Oren Etzioni, Sam Clark, and others. 2012. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1104–1112.

[Wang et al. 2012] Xiaofeng Wang, Matthew S Gerber, and Donald E Brown. 2012. Automatic crime prediction using events extracted from twitter posts. In Social Computing, Behavioral-Cultural Modeling and Prediction. Springer, 231–238.

[Becker et al. 2011] Hila Becker, Mor Naaman, and Luis Gravano. 2011. Beyond Trending Topics: Real-World Event Identification on Twitter.. In ICWSM.

[Alias-i 2008] Alias-i. 2008. LingPipe 4.1.0. (2008). http://alias-i.com/lingpipe

[Commentz-Walter 1979] Beate Commentz-Walter. 1979. A string matching algorithm fast on the average. Springer.

References

http://alias-i.com/lingpipe

http://alias-i.com/lingpipe

Extracting City Traffic Events from Social Streams

Engineering

Transcript of Extracting City Traffic Events from Social Streams