Automated Focus Extraction for Question Answering over Topic Maps

TMRA’09, LeipzigAutomated Focus Extraction for Question Answering over Topic Maps

Automated Focus Extraction for Question Answering over Topic Maps

Rani Pinchuk, Alexander Mikhailian and Tiphaine Dalmas


2

Context: domain portable Question Answering over Topic Maps

•Partly funded by the Flemish government as part of the ITEA2 project LINDO (ITEA2-06011)

•The research towards portable domain question answering over

Topic Maps is done within the Belgian part of the LINDO project.


3

• Space industry needs a solution to the knowledge retention problem.

• More structured than mind maps, less formal than

RDF/OWL.

• Allows to organize information in an ontological view.

• An ISO standard.

Why Topic Maps?


4

Who is the composer of La Bohème?

� Puccini

Why Topic Maps?


5

LINDO-BE General Architecture

Time Exp.

Extractor

Focus

ExtractorGraph

ReducerAnchorer

Topic Map Engine

QuestionAnswerAnswer

Extractor


6

Question

LINDO-BE General Architecture

Time Exp.

Extractor

Graph

ReducerAnchorer

Topic Map Engine

AnswerAnswer

Extractor

Focus

Extractor


7

Question FocusFocus is the type of the answer in the question terminology

Who is the composer of La Bohème?

� Puccini


8

Focus

Asking Point (AP) Expected Answer Type (EAT)

HUMAN: “Who wrote the libretto for La Tilda?”“Who is the librettist of La Tilda?”

(explicit) (implicit)

EAT Classes: TIME,

NUMERIC,

DEFINITION,

LOCATION,

HUMAN,


9

• Where was Puccini born?

• What is Puccini's place of birth?

• What is Puccini's birthplace?

• What is the birth place of Puccini?

• What city was Puccini born in?

• What place was Puccini born in?

• Where is Puccini from?

Is it difficult to find the focus?

Puccini

Lucca

born

in

pers

on

plac

e

City

is a


10

Why AP should take precedence over EAT?

“Who is the librettist of La Tilda?”

EAT = HUMAN � Person

AP = Librettist


11

Precision and Recall

|}{|

|}{}{|

retrieved

retrievedrelevantP

I=

|}{|

|}{}{|

relevant

retrievedrelevantR

I=


12


“Who is the librettist of La Tilda?”

EAT = HUMAN � Person

AP = Librettist

PAP = 57/57 =1

PEAT = 57/1165 =0.049


13


0.210.089EAT

0.300.311AP

RecallPrecisionName

Results over 100 annotated questions:


14

Focus Branching


15

Focus Extractor Architecture • Supervised machine learning based on the

principal of maximum entropy (Maxent).

• 2100 questions have been annotated:

• 1500 from Li & Roth corpus

• 500 from TREC-10

• 100 asked over the Italian Opera topic map

• The corpus was split into 80% of training and 20% testing. The evaluation was done 10 times, each time shuffling the training and test data.

Syntactic

Parser

POS

Tagger

Question FocusFocus

ExtractorTokenizer

Lexical

Analysis


16

Asking Point Expected Answer Type

O: What

AP: operaO: did

O: Puccini

O: writeO: ?

AP classifier

HUMAN: Who is PucciniDEFINITION: What is Tosca?

LOCATION: Where did Dante die?

TIME: When did Puccini die?NUMERIC: How many characters have

been killed by poisoning?OTHER: What did Heinrich Heine write?

EAT classifier

Questions Annotation


17

Class Precision Recall F-Score

AskingPoint 0.854 0.734 0.789

Other 0.973 0.987 0.980

AP Results


18

Class Precision Recall F-Score

DEFINITION 0.887 0.800 0.841

LOCATION 0.834 0.812 0.821

HUMAN 0.904 0.753 0.820

TIME 0.880 0.802 0.838

NUMERIC 0.943 0.782 0.854

OTHER 0.746 0.893 0.812

EAT Results


19

Overall Results

Value Std dev Std err

Focus (AP+EAT) 0.827 0.020 0.006

The overall results are provided as the accuracy of the classifier.

Accuracy = correct instances / overall instances


20

Prediction of Accuracy


21

Conclusions

• We achieved 82.7% accuracy for focus extraction.• The specificity of the focus degrades gracefully (we first try

to extract the AP, and fall back to the EAT).

• The focus is identified dynamically instead of relying on static taxonomy of question types.

• Machine learning techniques were used throughout the application stack.

• The results could be improved with more training data.• The whole setting is domain independent.


22

Thank you

Questions?

Automated Focus Extraction for Question Answering over Topic Maps

Technology

Transcript of Automated Focus Extraction for Question Answering over Topic Maps