Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

Post on 10-Jul-2015

295 views 1 download

Transcript of Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

Overview of the 2014 ALTA Shared TaskIdentifying Expressions of Locations in Tweets

Diego Molla Sarvnaz Karimi

Macquarie University CSIRO

ALTA 2014, Melbourne, Australia

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 2/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 3/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

The 2014 Shared Task

Task: Identify Expressions of Locations in Tweets

Categories: student, open

Prize: $500 (IBM Research Shared Task Student Prize)

Framework: Kaggle in Class

Student Category

I All members areuniversity students.

I No members are full-timeemployed.

I No members have a PhD.

Open Category

I Any other teams.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 4/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Identify Expressions of Locations in Tweets

Tweet LocationFrance and Germany join the US and UKin advising their nationals in Libya to leaveimmediately http://bbc.in/1rVmrDJ

France, Ger-many, US, UK,Libya

Dutch investigators not going to MH17crash site in eastern Ukraine due to securityconcerns, OSCE monitors say

MH17 crash site,eastern Ukraine

Seeing early signs of potential flashflooding with stationary storms near St.Marys, Tavistock, Cambridge #onstormpic.twitter.com/BtogIxgQ5G

St. Marys,Tavistock,Cambridge

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 5/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


1. When people discuss events, often they mention the location.

2. In the case of emergencies, such locations are very useful.

3. Recommender systems can use location information toimprove their recommendations.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


1. When people discuss events, often they mention the location.

2. In the case of emergencies, such locations are very useful.

3. Recommender systems can use location information toimprove their recommendations.


2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


1. When people discuss events, often they mention the location.

2. In the case of emergencies, such locations are very useful.

3. Recommender systems can use location information toimprove their recommendations.


2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Location Expressions in Tweets

What is a location?

Any specific mention of a country, city, suburb, or POI.

I Macquarie Centre.

I Ryde Hospital.

Where can we find location mentions?

I In the text.

I In hashtags: #Australia.

I In URLs: http://abc.net.au/melbourne/.

I In mentions: @Australia.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 7/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Location Expressions in Tweets

What is a location?

Any specific mention of a country, city, suburb, or POI.

I Macquarie Centre.

I Ryde Hospital.

Where can we find location mentions?

I In the text.

I In hashtags: #Australia.

I In URLs: http://abc.net.au/melbourne/.

I In mentions: @Australia.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 7/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Related Work

Named entity recognition in Twitter

I LabelledLDA for NER and PoS on tweets (Ritter et al. 2011).

I TwiNER: Unsupervised, using external sources (e.g.Wikipedia) for NER on tweets (Li et al. 2012).

Location extraction

I Twitcident: Using NER to identify location information ontweets (Abel et al. 2012).

I Ensemble classifiers to predict home locations of tweets(Mahmud et al. 2012).

I NER tools, used out of the box vs. re-trained on tweets(Lingad et al. 2013).

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 8/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Related Work

Named entity recognition in Twitter

I LabelledLDA for NER and PoS on tweets (Ritter et al. 2011).

I TwiNER: Unsupervised, using external sources (e.g.Wikipedia) for NER on tweets (Li et al. 2012).

Location extraction

I Twitcident: Using NER to identify location information ontweets (Abel et al. 2012).

I Ensemble classifiers to predict home locations of tweets(Mahmud et al. 2012).

I NER tools, used out of the box vs. re-trained on tweets(Lingad et al. 2013).

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 8/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 9/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Tweet Collection


I From Lingad et al. (2013).

I Tweets from late 2010 to late 2012.

I Augmented with additional tweets.

I Several annotations, only location mentions were used for theALTA shared task.


I Originally, 3,220 tweets.

I Available for the ALTA shared task: 3,047.

I After removing duplicates: 3,003.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 10/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Data Contents

Data for training and development

I Tweet IDs.

I Location mentions.

I Tweet download script.

Copyright restrictions

I Twitter does not allow the distribution of tweets.

I The shared task participants were asked to download thetweets themselves.

I Depending on the network status and changes by Twitter andTwitter users, specific tweets might not be available fordownload.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 11/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Data Format

Format of location mentions

I All multi-word terms split into their single words.

I Word duplicates are numbered.

I All punctuation marks are removed, including #.

I Words are lowercased.

I Data in a CSV file.


I Tweet ID1, france germany us uk libya

I Tweet ID2, australia australia2 australia3

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 12/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 13/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Kaggle in Class


I Kaggle offers a Web-based framework for data-drivencompetitions.

I A large base of potential participants.

I Potentially large prizes for the participants.

I Fee-based for the organisers; free for the participants.

Kaggle in Class

I Free for organisers and participants.

I Limited user support by Kaggle.

I Used by course-based competitions.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 14/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Alta Shared Task in Kaggle in Class

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 15/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Alta Shared Task in Kaggle in Class

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 16/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Features of Kaggle in Class

I Public leaderboard: all participants can submit and comparewith other participants.

I Automated evaluation: organisers can choose among severalevaluation metrics.

I Public and private partitions: A private partition of the testdata is held private for the final ranking

I Public: 501 tweets.I Private: 502 tweets.

I Discussion forum: for communication among participants.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 17/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 18/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Evaluation Metric

Mean F1-Score

I Compute recall and precision of each individual word.

I This allows evaluation of partially correct location mentions.

F1 = 2pr

p + r


I Target: senegal senegal2

I System output: senegal christchurch brighton

I p = 1/3

I r = 1/2

I F1 = 0.42014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 19/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results



I Kaggle in class, a useful means to run the shared task.I Few participants, but very active.

I 168 runs in the combined 4 teams.

I Participants (read the Proceedings!) used a combination of:

1. sequence labellers,2. feature engineering, and3. combined classifiers.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 20/21

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results


Team Category Public PrivateMQ Student 0.781 0.792AUT NLP Open 0.748 0.747Yarra Student 0.768 0.732JK Rowling Open 0.751 0.726

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 21/21