Multimedia Lab @ ACL W-NUT NER Shared Task: Named Entity Recognition for Twitter Microposts using...

Multimedia Lab @ ACLW-NUT NER Shared Task:Named Entity Recognition for Twitter Microposts

using Distributed Word RepresentationsFrederic GodinMultimedia Lab

Ghent University - iMindsGhent, Belgium

[email protected]

Baptist VandersmissenMultimedia Lab


[email protected]

Wesley De NeveMultimedia Lab & IVY Lab

Ghent University - iMinds & KAISTGhent, Belgium & Daejeon, South-Korea

[email protected]

Rik Van de WalleMultimedia Lab


[email protected]

Abstract

Due to the short and noisy nature of Twittermicroposts, detecting named entities is oftena cumbersome task. As part of the ACL2015Named Entity Recognition (NER) shared task,we present a semi-supervised system that de-tects 10 types of named entities. To that end,we leverage 400 million Twitter microposts togenerate powerful word embeddings as inputfeatures and use a neural network to executethe classification. To further boost the perfor-mance, we employ dropout to train the net-work and leaky Rectified Linear Units (Re-LUs). Our system achieved the fourth positionin the final ranking, without using any kind ofhand-crafted features such as lexical featuresor gazetteers.

1 Introduction

Users on Online Social Networks such as Facebookand Twitter have the ability to share microposts withtheir friends or followers. These microposts areshort and noisy, and are therefore much more dif-ficult to process for existing Natural Language Pro-cessing (NLP) pipelines. Moreover, due to the infor-mal and contemporary nature of these microposts,they often contain Named Entities (NEs) that are notpart of any gazetteer.

In this challenge, we tackled Named EntityRecognition (NER) in microposts. The goal was todetect named entities and classify them in one of the

following 10 categories: company, facility, geolo-cation, music artist, movie, person, product, sportsteam, tv show and other entities. To do so, we onlyused word embeddings that were automatically in-ferred from 400 million Twitter microposts as inputfeatures. Next, these word embeddings were usedas input to a neural network to classify the words inthe microposts. Finally, a post-processing step wasexecuted to check for inconsistencies, given that weclassified on a word-per-word basis and that a namedentity can span multiple words.

The challenge consisted of two subtasks. For thefirst subtask, the participants only needed to detectNEs without categorizing them. For the second sub-task, the NEs also needed to be categorized into oneof the 10 categories listed above. Throughout theremainder of this paper, only the latter subtask willbe considered, given that solving subtask two makessubtask one trivial.

2 Related Work

NER in news articles gained substantial popularitywith the CoNLL 2003 shared task, where the chal-lenge was to classify four types of NEs: persons, lo-cations, companies and a set of miscellaneous enti-ties (Tjong Kim Sang and De Meulder, 2003). How-ever, all systems used hand-crafted features such aslexical features, look-up tables and corpus-relatedfeatures. These systems provide good performanceat a high engineering cost and need a lot of annotatedtraining data (Nadeau and Sekine, 2007). Therefore,

a lot of effort is needed to adapt them to other typesof corpora.

More recently, semi-supervised systems showedto achieve near state-of-the-art results with muchless effort (Turian et al., 2010; Collobert et al.,2011). These systems first learn word representa-tions from large corpera in an unsupervised way anduse these word representations as input features forsupervised training instead of using hand-crafted in-put features. There exist three major types of wordrepresentations: distributional, clustering-based anddistributed word representations, and where the lasttype of representation is also known as a word em-bedding. A very popular and fast to train wordembedding is the word2vec word representation ofMikolov et al. (2013). When complemented withtraditional hand-crafted features, word representa-tions can yield F1-scores of up to 91% (Tkachenkoand Simanovsky, 2012).

However, when applied to Twitter microposts, theF1-score drops significantly. For example, Liu etal. (2011) report a F1 score of 45.8% when apply-ing the Stanford NER tagger to Twitter micropostsand Ritter et al. (2011) even report a F1-score of29% on their Twitter micropost dataset. Therefore,many researchers (Cano et al., 2013; Cano et al.,2014) trained new systems on Twitter microposts,but mainly relied on cost-intensive hand-crafted fea-tures, sometimes complemented with cluster-basedfeatures.

Therefore, in this paper, we will investigate thepower of word embeddings for NER applied to mi-croposts. Although adding hand-crafted featuressuch as lexical features or gazetteers would probablyimprove our F1-score, we will only focus on wordembeddings, given that this approach can be easilyapplied to different corpora, thus quickly leading togood results.

3 System Overview

The system proposed for tackling this challenge con-sists of three steps. First, the individual words areconverted into word representations. For this, onlythe word embeddings of Mikolov et al. (2013) areused. Next, we feed the word representations to aFeed-Forward Neural Network (FFNN) to classifythe individual words with a matching tag. Finally,

we execute a simple, rule-based post-processing stepin which we check the coherence of individual tagswithin a Named Entity (NE).

3.1 Creating Feature RepresentationsRecently, Mikolov et al. (2013) introduced an effi-cient way for inferring word embeddings that are ef-fective in capturing syntactic and semantic relation-ships in natural language. In general, a word em-bedding of a particular word is inferred by using theprevious or future words within a number of micro-posts/sentences. Mikolov et al. (2013) proposed twoarchitectures for doing this: the Continuous Bag OfWords (CBOW) model and the Skip-gram model.

To infer the word embeddings, a large dataset ofmicroposts is used. The algorithm iterates a numberof times over this dataset while updating the wordembeddings of the words within the vocabulary ofthe dataset. The final result is a look-up table whichcan be used to convert every word w(t) in a featurevector we(t). If the word is not in the vocabulary, avector only containing zeros is used.

3.2 Neural Network ArchitectureBased on the successful application of Feed-Forward Neural Networks (FFNN) using word em-beddings as input features for both recognizing NEsin news articles (Turian et al., 2010; Collobert et al.,2011) and Part-of-Speech tagging of Twitter micro-posts (Godin et al., 2014), a FFNN is used as the un-derlying classification algorithm. Because a NE canconsist of multiple words, the BIO (Begin, Inside,Outside NE) notation is used to classify the words.Given that there are 10 different NE categories thateach have a Begin and Inside tag, the FFNN will as-sign a tag(t) to every word w(t) out of 21 differenttags.

Because the tag tag(t) of a word w(t) is also de-termined by the surrounding words, a context win-dow centered aroundw(t) that contains an odd num-ber of words, is used. As shown in Figure 1, thecorresponding word embeddings we(t) are concate-nated and used as input to the FFNN. This is theinput layer of the neural network.

The main design parameters of the neural net-work are the type of activation function, the num-ber of hidden units and the number of hidden lay-ers. We considered two types of activation functions,

the classic tanh activation function and the newer(leaky) Rectified Linear Units (ReLUs). The outputlayer is a standard softmax function.

3.3 Postprocessing the Neural Network Output

Given that NEs can span multiple words and giventhat the FFNN classifies individual words, we applya postprocessing step after a micropost is completelyclassified to correct inconsistencies. The tags of thewords are changed according to the following tworules:

If the NE does not start with a word that has aB(egin)-tag, we select the word before the wordwith the I(nside)-tag and replace the O(utside)-tag with a B-tag and copy the category of theI-tag.

If the individual words of a NE have differentcategories, we select the most frequently occur-ring category. If it is a tie, we select the cate-gory of the last word within the NE.

4 Experimental Setup

4.1 Dataset

The challenge provided us with three differentdatasets: train, dev and dev_2015. These datasetshave 1795, 599 and 420 microposts, respectively,also containing 1140, 356 and 272 NEs, respec-tively. The train and dev datasets came from thesame period and therefore have some overlap inNEs. Moreover, they contained the complete datasetof Ritter et al. (2011). The microposts withindev_2015, however, were sampled more recentlyand resembled the test set of this challenge. The testset consisted of 1000 microposts, having 661 NEs.The train and dev dataset will be used as trainingset throughout the experiments and the dev_2015dataset will be used as development set.

For inferring the word embeddings, a set ofraw Twitter microposts was used, collected during300 days using the Twitter Streaming API, from1/3/2013 till 28/2/2014. After removing all non-English microposts using the micropost languageclassifier of Godin et al. (2013), 400 million raw En-glish Twitter microposts were left.

4.2 Preprocessing the Data

For preprocessing the 400 million microposts, weused the same tokenizer as Ritter et al. (2011). Ad-ditionally, we used replacement tokens for URLs,mentions and numbers on both the challenge datasetand the 400 million microposts we collected. How-ever, we did not replace hashtags as doing so exper-imentally demonstrated to decrease the accuracy.

4.3 Training the Model

The model was trained in two phases. First, thelook-up table containing per-word feature vectorswas constructed. To that end, we applied theword2vec software (v0.1c) of Mikolov et al. (2013)on our preprocessed dataset of 400 million Twittermicroposts to generate word embeddings.

Next, we trained the neural network. To thatend, we used the Theano library (v0.6) (Bastienet al., 2012), which easily effectuated the use ofour NVIDIA Titan Black GPU. We used mini-batchstochastic gradient descent with a batch size of 20,a learning rate of 0.01 and a momentum of 0.5. Weused the standard negative log-likelihood cost func-tion to update the weights. We used dropout onboth the input and hidden layers to prevent overfit-ting and used (Leaky) Rectified Linear Units (Re-LUs) as hidden units (Srivastava et al., 2014). Forthis, we used the implementation of the Lasagne1 li-brary. We trained the neural network on both thetrain and dev dataset and iterated until the accuracyon the dev_2015 set did not improve anymore.

4.4 Baseline

For evaluating our system, we will use two differentbaselines. The word embeddings and the neural net-work architecture will be evaluated in terms of ac-curacy on the word level. For these components, thebaseline system will simply assign the O-tag to ev-ery word, yielding an accuracy of 93.53%. For thepostprocessing step and the overall system evalua-tion, we will use the baseline provided by the chal-lenge which evaluates on the NE level. That base-line system uses lexical features and gazetteers andyields an F1-score of 34.29%. Note that we will onlyreport the performance on subtask two (i.e., catego-rizing the NEs), except for the final evaluation.

1https://github.com/Lasagne/Lasagne

Hidden LayerFeed Forward

Neural Network

Softmax output layer

we(t)we(t 1) we(t+ 1)

we(2) we(3) we(4) BORDERwe(1)BORDER

w(2) = to w(3) = New w(4) = Y orkw(1) = goin

tag(2) = O tag(3) = B geoloc tag(4) = I geoloctag(1) = O

t=1 t=2 t=3 t=4

Look up table with word embeddings

Figure 1: High-level illustration of the FFNN that classifies each word as part of one of the 10 named entity classes.At the input, a micropost containing four words is given. The different words w(t) are first converted in featurerepresentations we(t) using a look-up table of word embeddings. Next, a feature vector is constructed for each wordby concatenating all the feature representations we(t) of the other words within the context window. In this example,a context window of size three is used. One-by-one, these concatenated vectors are fed to the FFNN. In this example,a one-hidden layer FFNN is used. The output of the FFNN is the tag tag(t) of the corresponding word w(t).

5 Experiments

5.1 Word Embeddings

For inferring the word embeddings of the 400 mil-lion microposts, we mainly followed the suggestionsof Godin et al. (2014), namely, the best word em-beddings are inferred using a Skip-gram architectureand negative sampling. We used the default parame-ter settings of the word2vec software, except for thecontext window.

As noted by Bansal et al. (2014), the type ofword embedding you create depends on the size ofthe context window. In particular, a bigger contextwindow creates topic-oriented embeddings while asmaller context window creates syntax-oriented em-beddings. Therefore, we trained an initial versionof our neural network using an input window of fivewords and 300 hidden nodes, and evaluate the qual-ity of the word embeddings based on the classifica-

tion accuracy on the dev_2015 dataset. The resultsof this evaluation are shown in Table 1. Although thedifference is small, a smaller context window con-sistently gave a better result.

Additionally, we evaluated the vector size. As ageneral rule of thumb, the larger the word embed-dings, the better the classification will be (Mikolovet al., 2013). However, too many parameters andtoo few training examples will lead to suboptimalresults and poor generalization. We chose word em-beddings of size 400 because smaller embeddingsexperimentally showed to capture not as much de-tail and resulted in a lower accuracy. Larger wordembeddings, on the other hand, made the model toocomplex to train.

The final word2vec word embeddings model hasa vocabulary of 3,039,345 words and word repre-sentations of dimensionality 400. The model was

Table 1: Evaluation of the influence of the context win-dow size of the word embeddings on the accuracy ofpredicting NER tags using a neural network with an in-put window of five, 500 hidden Leaky ReLU units anddropout. All word embeddings are inferred using nega-tive sampling, a Skip-gram architecture and have a vectorsize of 400. The baseline accuracy is achieved when tag-ging all words of a micropost with the O-tag.

ContextWindow

Accuracy Error RateReduction

Baseline 93.53%

1 95.64% -32.57%3 95.57% -31.44%5 95.52% -30.72%

trained using the Skip-gram architecture and nega-tive sampling (k = 5) for five iterations, with a con-text window of one and subsampling with a factorof 0.001. Additionally, to be part of the vocabulary,words should occur at least five times in the corpus.

5.2 The Neural Network Architecture

The next step is to evaluate the Neural Network Ar-chitecture. The most important parameters are thesize of the input layer, the size of the hidden layersand the type of hidden units. Although we exper-imented with multiple hidden layers, the accuracydid not improved. We surmise that (1) the trainingset is too small and that (2) the word embeddingsalready contain a fixed summary of the informationand therefore limit the feature learning capabilitiesof the neural network. Note that the word embed-dings at the input layer also can be seen as a (fixed)hidden layer.

First, we will evaluate the activation function andthe effectiveness of dropout (p = 0.5). We com-pared the classic tanh function and the leaky ReLUwith a leak rate of 0.012. As can be seen in Ta-ble 2, both activation functions perform equallygood when no dropout is applied during training.However, when dropout is applied, the gap betweenthe two configurations becomes larger. The combi-nation of ReLUs and dropout seems to be the best

2This is the default value in Lasagne and showed to workthe best for us

Table 2: Evaluation of the influence of the activationfunction and dropout on the accuracy of predicting NEtags. A fixed neural network with an input window offive, word embeddings of size 400 and 500 hidden unitsis used. The baseline accuracy is achieved when taggingall words of a micropost with the O-tag.

ActivationFunction

Dropout AccuracyError RateReduction

Baseline 93.53%

TanhNo 95.01% -22.78%Yes 95.49% -30.30%

Leaky ReLUNo 95.02% -23.01%Yes 95.64% -32.57%

one, compared to the classic configuration of a neu-ral network3.

Next, we will evaluate a number of neural net-work configurations for which we varied the inputlayer size and the hidden layer size. The results aredepicted in Table 3. Although the differences aresmall, the best configuration seems to be a neuralnetwork with five input words and 500 hidden nodes.

Finally, we evaluate our best model on the NElevel instead of on the word level. To that end, wecalculated the F1-score of our best model using theprovided evaluation script. The F1-score of our bestmodel on dev_2015 is 45.15%, which is an absoluteimprovement of almost 11% and an error reductionof 16.52% over the baseline (34.29%) provided bythe challenge.

5.3 Postprocessing

As a last step, we corrected the output of the neu-ral network for inconsistencies because our classifierdoes not see the labels of the neighbouring words.As can be seen in Table 4, the postprocessing causesa significant error reduction, yielding a final F1-score of 49.09% on the dev_2015 development set.This is an error reduction of 22.52% over the base-line.

Additionally, we also report the F1-score on thedev_2015 development set for subtask one, where

3Given that dropout also acts as a regularization technique, acomparison with other regularization techniques should be con-ducted to be complete (e.g., L2 regularization)

Table 3: Evaluation of the influence of the input layer and hidden layer size on the accuracy/error reduction whenpredicting NE tags. The fixed neural network is trained with dropout, word embeddings of size 400 and ReLUs. Theerror reduction values are calculated using the baseline which tags all words with an O-tag.

WindowNumber of hidden units

300 500 1000

Three 95.63% / -32.35% 95.58% / -31.66% 95.55% / -31.21%Five 95.57% / -31.44% 95.64% / -32.57% 95.61% / -32.12%

Table 4: Evaluation of the postprocessing step for detect-ing named entities. The baseline was provided by thechallenge.

Configuration F1Error RateReduction

Baseline 34.29%

Without postprocessing 45.15% -16.52%With postprocessing 49.09% -22.52%

Table 5: Evaluation of the postprocessing step for detect-ing named entities without categorizing them. The base-line was provided by the challenge.

Configuration F1Error RateReduction

Baseline 52.63%

Without postprocessing 60.04% -15.64%With postprocessing 60.80% -17.24%

the task was to only detect the NEs but not to cate-gorize them into the 10 different categories. For this,we retrained the model with the best parameters ofsubtask two. The results are shown in Table 5.

If we compare both subtasks, we see that the neu-ral network has a similar error reduction for bothsubtasks but that the postprocessing step effectuatesa larger error reduction for subtask two. In otherwords, a common mistake of the neural network is toassign different categories to different words withinone NE. These mistakes are easily corrected by thepostprocessing step.

6 Evaluation on the Test Set

Our best model realized a F1-score of 58.82% onsubtask one (no categories), hereby realizing an er-ror reduction of 17.84% over the baseline (49.88%).On subtask two (10 categories), an F1-score of43.75% was realized, yielding an error reduction of17.32% over the baseline (31.97%). A break-downof the results on the different NE categories can befound in Table 6. Our system ranked fourth in bothsubtasks.

7 Conclusion

In this paper, we presented a system to apply NamedEntity Recognition (NER) on microposts. Given thatmicroposts are short and noisy compared to newsarticles, we did not want to invest time in craftingnew features that would improve NER for microp-osts. Instead, we implemented the semi-supervisedarchitecture of Collobert et al. (2011) for NER innews articles. This architecture only relies on goodword embeddings inferred from a large corpus and asimple neural network.

To realize this system, we used the word2vec soft-ware to quickly generate powerful word embeddingsover 400 million Twitter microposts. Additionally,we employed state-of-the-art neural networks usingleaky Rectified Linear Units (ReLUs) and dropoutto train the network, showing a significant benefitover classic neural networks. Finally, we checkedthe output for inconsistencies when categorizing thenamed entities, resulting in a significant error reduc-tion.

Our word2vec word embeddings trainedon 400 million microposts are released tothe community and can be downloaded athttp://www.fredericgodin.com/software/.

Table 6: Break-down of the results on the test set. The test set contains 661 NEs. Our system detected 523 NEs ofwhich 259 NEs were correct.

Category Precision Recall F1 #Entities

company 57.89% 28.21% 37.93% 19facility 30.77% 21.05% 25.00% 26geo-loc 55.24% 68.10% 61.00% 143movie 37.50% 20.00% 26.09% 8

musicartist 16.67% 2.44% 4.26% 6other 20.78% 12.12% 15.31% 77

person 60.89% 63.74% 62.29% 179product 22.58% 18.92% 20.59% 31

sportsteam 77.42% 34.29% 47.52% 31tvshow 33.33% 50.00% 40.00% 3

Total 49.52 % 39.18% 43.75% 523

Acknowledgments

The research activities described in this paper werefunded by Ghent University, iMinds, the Institute forthe Promotion of Innovation by Science and Tech-nology in Flanders (IWT), the Fund for ScientificResearch-Flanders (FWO-Flanders), and the Euro-pean Union.

References[Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and

Karen Livescu. 2014. Tailoring continuous wordrepresentations for dependency parsing. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics, ACL 2014, June 22-27,2014, Baltimore, MD, USA, Volume 2: Short Papers,pages 809815. The Association for Computer Lin-guistics.

[Bastien et al.2012] Frederic Bastien, Pascal Lamblin,Razvan Pascanu, James Bergstra, Ian J. Goodfellow,Arnaud Bergeron, Nicolas Bouchard, and Yoshua Ben-gio. 2012. Theano: new features and speed im-provements. Deep Learning and Unsupervised Fea-ture Learning NIPS 2012 Workshop.

[Cano et al.2013] Amparo Elizabeth Cano, Andrea Varga,Matthew Rowe, Milan Stankovic, and Aba-SahDadzie. 2013. Making sense of microposts(#msm2013) concept extraction challenge. In MakingSense of Microposts (#MSM2013) Concept ExtractionChallenge.

[Cano et al.2014] Amparo E Cano, Giuseppe Rizzo, An-drea Varga, Matthew Rowe, Stankovic Milan, and

Aba-Sah Dadzie. 2014. Making sense of microposts(#Microposts2014) named entity extraction & linkingchallenge. In WWW 2014, 23rd International WorldWide Web Conference, Making Sense of Microposts2014, Seoul, South Korea.

[Collobert et al.2011] Ronan Collobert, Jason Weston,Leon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language processing(almost) from scratch. Journal of Machine LearningResearch, 12:24932537.

[Godin et al.2013] Frederic Godin, Viktor Slavkovikj,Wesley De Neve, Benjamin Schrauwen, and RikVan de Walle. 2013. Using topic models for twit-ter hashtag recommendation. In Proceedings of the22Nd International Conference on World Wide WebCompanion, WWW 13 Companion, pages 593596,Republic and Canton of Geneva, Switzerland. Interna-tional World Wide Web Conferences Steering Com-mittee.

[Godin et al.2014] Frederic Godin, Baptist Vandersmis-sen, Azarakhsh Jalalvand, Wesley De Neve, and RikVan de Walle. 2014. Alleviating manual feature en-gineering for part-of-speech tagging of twitter micro-posts using distributed word representations. In Work-shop on Modern Machine Learning and Natural Lan-guage Processing (NIPS 2014).

[Liu et al.2011] Xiaohua Liu, Shaodian Zhang, Furu Wei,and Ming Zhou. 2011. Recognizing named entitiesin tweets. In Proceedings of the 49th Annual Meet-ing of the Association for Computational Linguistics:Human Language Technologies - Volume 1, HLT 11,pages 359367, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efficient estima-tion of word representations in vector space. CoRR,abs/1301.3781.

[Nadeau and Sekine2007] David Nadeau and SatoshiSekine. 2007. A survey of named entity recogni-tion and classification. Linguisticae Investigationes,30(1):326, January. Publisher: John Benjamins Pub-lishing Company.

[Ritter et al.2011] Alan Ritter, Sam Clark, Mausam, andOren Etzioni. 2011. Named entity recognition intweets: An experimental study. In Proceedings ofthe Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 11, pages 15241534,Stroudsburg, PA, USA. Association for ComputationalLinguistics.

[Srivastava et al.2014] Nitish Srivastava, Geoffrey Hin-ton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A simple way to pre-vent neural networks from overfitting. Journal of Ma-chine Learning Research, 15:19291958.

[Tjong Kim Sang and De Meulder2003] Erik F. TjongKim Sang and Fien De Meulder. 2003. Introductionto the conll-2003 shared task: Language-independentnamed entity recognition. In Walter Daelemans andMiles Osborne, editors, Proceedings of CoNLL-2003,pages 142147. Edmonton, Canada.

[Tkachenko and Simanovsky2012] Maksim Tkachenkoand Andrey Simanovsky. 2012. Named entityrecognition: Exploring features. In 11th Conferenceon Natural Language Processing, KONVENS 2012,Empirical Methods in Natural Language Processing,Vienna, Austria, September 19-21, 2012, pages118127.

[Turian et al.2010] Joseph Turian, Lev Ratinov, andYoshua Bengio. 2010. Word representations: A sim-ple and general method for semi-supervised learning.In Proceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics, ACL 10,pages 384394, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

IntroductionRelated WorkSystem OverviewCreating Feature RepresentationsNeural Network ArchitecturePostprocessing the Neural Network Output

Experimental SetupDatasetPreprocessing the DataTraining the ModelBaseline

ExperimentsWord EmbeddingsThe Neural Network ArchitecturePostprocessing

Evaluation on the Test SetConclusion

Multimedia Lab @ ACL W-NUT NER Shared Task: Named Entity Recognition for Twitter Microposts using...

Documents

Transcript of Multimedia Lab @ ACL W-NUT NER Shared Task: Named Entity Recognition for Twitter Microposts using...