NLP & DBpedia

Using BabelNet in Bridging the Gap Between Natural Language Queries and

Linked Data Concepts

Khadija Elbedweihy, Stuart N. Wrigley, Fabio Ciravegna and and Ziqi ZhangOAK Research Group,

Department of Computer Science, University of Sheffield, UK

• Motivation and Problem Statement

• Natural Language Query Approach

• Approach Steps

• Evaluation

• Results and Discussion

Outline

Motivation – Semantic Search

• Wikipedia states that Semantic Search:“seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results”

• Semantic search evaluations reported user preference for free natural language as a query approach (simple, fast & flexible) as opposed to controlled or view-based inputs.

Problem Statement

• Complete freedom increases difficulty of matching query terms with the underlying data and ontologies.

• Word sense disambiguation (WSD) is core to the solution.Question: “How tall is ..... ?”: property height

– tall is polysemous, should be first disambiguated:– great in vertical dimension; tall people; tall buildings, etc.– too improbable to admit of belief; a tall story, …

• Another difficulty: Named Entity (NE) recognition and disambiguation.

Approach• Free-NL semantic search approach, matching user query

terms with the underlying ontology using:

1) An extended-Lesk WSD approach.

2) A NE recogniser.

3) A set of advanced string similarity algorithms and ontology-based heuristics to match disambiguated query terms to ontology concepts and properties.

Extended-Lesk WSD approach• WordNet is predominant, however its granularity is a

problem for achieving high performance in WSD.

• BabelNet is a very large multilingual ontology with wide-coverage obtained from both WordNet and Wikipedia.

• For disambiguation, bags are extended with senses’ glosses and different lexical and semantic relations.

• Include synonyms, hyponyms , hypernyms , attribute, see also and similar to relations.

Extended-Lesk WSD approach• Information added from a Wikipedia page (W), mapped to

a WordNet synset includes:

1.labels; page “Play (theatre)” add play and theatre

2. set of pages redirecting to W; Playlet redirects to Play

3. set of pages linked from W; links in the page Play (theatre) include literature, comedy, etc.

• Synonyms of synset S, associated with Wikipedia page W: WordNet synonyms of S in addition to lemmas of wikipedia information of W".

Extended-Lesk WSD approachFeature P R F1

Baseline 58.09 57.98 58.03Synonyms 59.14 59.03 59.09Syn + hypo 62.16 62.07 62.12Syn + gloss examples (WN) 61.97 61.86 61.92Syn + gloss examples (Wiki) 61.14 61.02 61.08Syn + gloss examples (WN + Wiki) 60.21 60.10 60.16Syn + hyper 60.36 60.26 60.31Syn + semRel 59.65 59.54 59.59Syn + hypo + gloss(WN) 64.92 64.81 64.86Syn + hypo + gloss(WN) + hyper 65.28 65.18 65.23Syn + hypo + gloss(WN) + hyper + semRel 65.45 65.33 65.39Syn+hypo+gloss(WN)+hyper+semRel+relGlosses 69.76 69.66 69.71

• Sentences with less than seven words: f-measure of 81.34%

Approach – Steps

1. Recognition and disambiguation of Named Entities.

2. Parsing and Disambiguation of the NL query.

3. Matching query terms with ontology concepts and properties.

4. Generation of candidate triples.

5. Integration of triples and generation of SPARQL queries.

1.Recognition and disambiguation of Named Entities

• Named entities recognised using AlchemyAPI.

• AlchemyAPI had the best recognition performance in NERD evaluation of SOA NE recognizers.

• AlchemyAPI exhibits poor disambiguation performance

• Each NE is disambiguated using our BabelNet-based WSD approach.

1.Recognition and disambiguation of Named Entities

• Example: “In which country does the Nile start?”

• Matches of Nile in BabelNet include:– http://dbpedia.org/resource/Nile (singer)– http://dbpedia.org/resource/Nile (TV series)– http://dbpedia.org/resource/Nile (band)– http://dbpedia.org/resource/Nile

• Match selected (Nile: river): overlapping terms between sense and query (geography, area, culture, continent) more than other senses.

2.Parsing and Disambiguation of the NL query• Stanford Parser used to gather lemmas and POS tags.

• Proper nouns identified by the parser and not recognized by AlchemyAPI are disambiguated and added to the recognized entities.

• Example: “In which country does the Nile start?”

– The algorithm does not miss the entity Nile, although it was not recognized by AlchemyAPI.

2.Parsing and Disambiguation of the NL query• Example: “Which software has been developed by

organizations founded in California?”

Output:

• Equivalent output generated using keywords or phrases.

Word Lemma POS positionsoftware software NP 1

developed develop VBN 2organizations organize NNS 3

founded find VBN 4California California NP 5

3.Matching Query Terms with Ontology Concepts & Properties

• Noun phrases, nouns and adjectives are matched with concepts and properties.

• Verbs are matched only with properties.

• Candidate ontology matches ordered using: Jaro-Winkler and Double Metaphone string similarity algorithms.

• Jaro-Winkler threshold to accept a match is set to 0.791, shown in literature to be the best threshold value.

3.Matching Query Terms with Ontology Concepts & Properties

• Matching process uses the following in order:1. query term (e.g., created)2. lemma (e.g., create)3. derivationally related forms (creator)

• If no matches, disambiguate query term and use expansion terms in order:1. synonyms2. hyponyms3. hypernyms4. semantic relations (e.g., height as an attribute for tall)

4. Generation of Candidate Query Triples• Structure of the ontology (taxonomy of classes and domain and

range of properties) used to link matched concepts and properties and recognized entities to generate query triples.

Three-Terms Rule • Each three consecutive terms matched with set of templates.

E.g., “Which television shows were created by Walt Disney?”

• Template (concept-property-instance) generates triples: ?television_show <dbo:creator> <res:Walt_Disney>?television_show <dbp:creator> <res:Walt_Disney>?television_show <dbo:creativeDirector> <res:Walt_Disney>

Three-Terms Rule

Examples of templates used in three-terms rule:

• concept-property-instance– airports located in California– actors born in Germany

• instance-property-instance– Was Natalie Portman born in the United States?

• property-concept-instance– birthdays of actors of television show Charmed

Two-Terms Rule

Two-Terms Rule, used when:

1) There is fewer than three derived terms2) No match between query terms and three-term template3) Matched template did not generate candidate triples

E.g., “In which films directed by Garry Marshall was Julia Roberts starring?”

<Garry Marshall, Julia Roberts, starring> : matched to a three-terms template but does not generate triples.

Two-Terms Rule

Two-Terms Rule Question: “what is the area code of Berlin?”

• Template (property-instance) generates the triples:

<res:Berlin> <dbp:areaCode> ?area_code

<res:Berlin> <dbo:areaCode> ?area_code

Comparatives

Comparatives Scenarios:

1) Comparative used with a numeric datatype property: e.g., “companies with more than 500,000 employees”

?company <dbp:numEmployees> ?employee?company <dbp:numberOfEmployees> ?employee

?company a <dboCompany>FILTER (?employee > 500000)

Comparatives

2) Comparative is used with a concept: e.g., “places with more than 2 caves”

• Generate the same triples for places with caves:?place a <http://dbpedia.org/ontology/Place>.

?cave a <http://dbpedia.org/ontology/Cave>.?place ?rel1 ?cave. ?cave ?rel1 ?place.

• Add the aggregate restriction:GROUP BY ?place

HAVING (COUNT(?cave)>2).

Comparatives

3) Comparative is used with an object property e.g., “countries with more than 2 official languages”

• Similarly, generate the same triples for country and official language and add the restriction:

GROUP BY ?country HAVING (COUNT(?official_language) > 2)

4) Generic Comparativese.g., “Which mountains are higher than the Nanga Parbat?”

Generic Comparatives• Difficulty: identify the property referred to by the

comparative term.

1) Select best relation according to query context.– Identify all numeric datatype properties associated

with the concept “mountain”, include:“latS, longD, prominence, firstAscent, elevation, longM, …”

2) Disambiguate synsets of all properties and use WSD approach to identify the most related synset to the query.– property elevation is correctly selected

5. Integration of Triples and Generation of SPARQL Queries

• Generated triples integrated to produce SPARQL query.

• Query term positions used to order the generated triples.

• Triples originating from the same query term are executed in order until an answer is found.

• Duplicates are removed while merging the triples.

• SELECT and WHERE clauses added in addition to any aggregate restrictions or solution modifiers.

Evaluation• Test data from 2nd Open Challenge at QALD-2.

• Results produced by QALD-2 evaluation tool.

• Very promising results: 76% of questions answered correct.

Approach Answered Correct Precision Recall F1

BELA 31 17 0.62 0.73 0.67QAKiS 35 11 0.39 0.37 0.38

Alexandria 25 5 0.43 0.46 0.45SenseAware 54 41 0.51 0.53 0.52

SemSeK 80 32 0.44 0.48 0.46MHE 97 30 0.36 0.4 0.38

Discussion• Design choices affected by priority for precision or recall:

1. Query Relaxatione.g., “Give me all actors starring in Last Action Hero”

– Restricting results to actors harms recall– Not all entities in LD are typed, let alone correctly typed– Query relaxation favors recall but affects precision

e.g. “How many films did Leonardo DiCaprio star in?”– Return TV series rather than only films such as

res:Parenthood (1990 TV series).

• Decision: favor precision; keep restriction when specified.

Discussion

2. Best or All Matchese.g., “software by organizations founded in

California”– Properties matched: foundation and foundationPlace– Using only best match (foundation ) does not generate

all results affects recall.– Using all properties (may not be relevant to the query)

would harm precision.

• Decision: use all matches; with high value for the similarity threshold; perform checks against the ontology structure to assure relevant matches are only used.

Discussion

3. Query Expansion• Can be useful for recall, when the query term is not

sufficient to return all answers.• Example: use “website” and “homepage” if any of them

used in a query and both have matches in the ontology.• Quality of expansion terms influenced by WSD approach;

wrong sense identification will lead to noisy list of terms.

• Decision: perform query expansion only when no matches found in the ontology for a term; or no results generated using the identified matches.

Questions

Questions?

Additional Slides

Additional Slides

NLP & DBpedia

Education

Transcript of NLP & DBpedia