The NERD project
-
Upload
giuseppe-rizzo -
Category
Technology
-
view
2.975 -
download
7
description
Transcript of The NERD project
Giuseppe Rizzo <[email protected]>
What is a Named Entity recognition task?
A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document
12 March 2012 Seminar @ Ecole Centrale, Paris 2/21
History of NER benchmarks CoNLL 2003 and CoNLL 2005
schema (4 types): person, organization, location and miscellaneous language independent task
ACE 2004, ACE 2005 and ACE 2007 schema (7 types): person, organization, location, facility, weapon,
vehicle and geo-political entity entity recognition, not just name (e.g. description, pronoun) find relationships among entities extracted
TAC 2009 (Knowledge Base Track) schema (3 types): person, organization and location create a knowledge base from the named entities extracted
ETAPE 2012 (Named Entity Task) schema: Quaero (7 main types, 32 sub-types)
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 3
NER Tools
Standalone software GATE Stanford CoreNLP Temis
Web APIs
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 4
Factual comparison of 10 Web NER tools
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 5
Alchemy DBpedia
Evri Extr Lup Calais Saplo WM Yahoo Zemanta
Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED
Language EN FR GR IT PT RU SP SW
EN GR* PT* SP*
EN IT
EN EN FR IT
EN FR SP
EN SW
EN FR SP
EN EN
Quota (calls/day)
30000 unl 3000 3000 unl 50000 1333 unl 5000 10000
Sample Clients
C/C++ C# Java Perl PHP5 Python Ruby
Java JS PHP
AS Java PHP
Java N/A Java Java JS PHP Python
Java Perl
JS PHP
C# Java JS Perl PHP Python Ruby
Content chunk
150KB 452KB 8KB 32KB 20KB 8KB 26KB 80KB 7769KB 970KB
12 March 2012 Seminar @ Ecole Centrale, Paris 6/21
Factual comparison (II) Alchemy DBpedia Evri Extr Lup Calais Saplo WM Yahoo Zemanta
Response Format
JSON MicroF XML RDF
HTML JSON RDF XML
HTML JSON RDF
HTML JSON RDF XML
HTML JSON RDFa XML
JSON MicroF
JSON JSON XML
JSON XML
XML JSON RDF
Entity type number
324 320 5 34 319 95 5 7 13 81
Entity position
N/A char offset
N/A word offset
range of chars
char offset
N/A POS offset
range of chars
N/A
Classif. Ontologies
Alchemy DBpedia FreeBase Scema.org
Evri DBpedia
DBpedia LinkedMDB
OpenCalais
N/A ESTER
Yahoo FreeBase
Defer. Vocabularies
DBpeda FreeBase USCensus UMBEL OpenCyc YAGO MusicBrainz CrunchBase
DBpedia Evri DBpedia
DBpedia LinkedMDB
OpenCalais
N/A DBpedia Geonames CIAFactbook Wikicompanies
Wikipedia
Wikipedia IMDB MusicBrainz Amazon YouTube TechCrunch ...
Human made benchmarks
We performed two evaluation experiments: WEKEX 2011 ISWC 2011
Each field has been rated by a Boolean value: true if correct, false otherwise
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 7
t = (entity, type, URI, relevant)
Rizzo G., Troncy R. (2011), NERD: A Framework for Evaluating Named Entity Recognition Tools in the Web of Data. In: International Semantic Web Conference 2011 (ISWC'11), Bonn, Germany.
WEKEX 2011 Benchmark
Controlled experiment 4 human raters 10 English news articles (5 from BBC and 5 from The
New York Times) Each rater evaluated each article for 5 extractors
200 total evaluations
Fleiss's kappa score moderate agreement among raters
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 8
Rizzo G., Troncy R. (2011), NERD: Evaluating Named Entity Recognition Tools in the Web of Data. In: (ISWC'11) Workshop on Web Scale Knowledge Extraction (WEKEX'11), Bonn, Germany.
Results
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 9
different behavior for different sources
ISWC 2011 Benchmark
Controlled experiment 10 human raters 2 English news articles from The New York Times each rater evaluated each article for 6 extractors
120 total evaluations
Fleiss's kappa score substantial
agreement among raters
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 10
Results
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 11
12 March 2012 Seminar @ Ecole Centrale, Paris 12/21
What is NERD?
REST API2 ontology1
UI3
1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl 3 http://nerd.eurecom.fr/
The NERD ontology has been integrated in the NIF project,
a EU FP7 in the context of the LOD2: Creating Knowledge
out of Interlinked Data
NERD Ontology
Align the taxonomies used by the extractors
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 13
Building the NERD ontology
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 14
NERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...
NERD REST API
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 15
GET, POST, PUT,
DELETE
/document /user /annotation/{extractor} /extraction /evaluation ...
JSON/RDF*
“entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }]
Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France.
NIF: NLP Interchange Format Framework
Different outputs for the NLP tools
Manual effort required for integration or reuse time consuming need to capture the definition of the attributes used in the
response format
NIF uses RDF for representing NER results as Linked Data
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 16
OpenCalais "_type": "Organization", “name": "North Atlantic Treaty Organization", "organizationtype": "governmental civilian", "nationality": "N/A", "_typeReference": http://s.opencalais.com/1/type/em/e/Organization", ...
DBpedia Spotlight "@URI": "http://dbpedia.org/resource/DBpedia", "@types": "DBpedia:Software,DBpedia:Work” "@surfaceForm": "dbpedia", "@offset": "0", "@support": "11", "@similarityScore": "0.2387271374464035", …
Named Entities as textual annotations
Let's consider the document: http://www.w3.org/DesignIssues/LinkedData.html
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 17
The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.…. All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff ...
entities: { … [entity: W3C, startChar: 23107, endChar: 23110], … }
NERD meets NIF
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 18
Model documents through a set of strings deferencable on the Web
: offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 .
: offset_23107_ 23110 sso:oen dbpedia:W3C.
dbpedia:W3C rdf:type nerd:Organization .
Map string to entity
Classification
Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France.
NERD Demo
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 19
NERD Timeline and Future Work
12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 20
beginning
today
Comparison of named entity extractors
NERD REST API and NERD ontology
NERD “smart” service: combining the best of all NER tools
NERD benchmarks
Lift NERD output results to the LOD cloud
Dashboard for improving the NERD user experience
12/03/2012 - - 21 Multimedia Semantics and Interaction - Séminaire Ecole Centrale
http://www.slideshare.net/giusepperizzo
@giusepperizzo @rtroncy #nerd
http://nerd.eurecom.fr