Text Processing for Procedural Question Answering

21
Text Processing for Procedural Question Answering Undergoing work for TextCoop project ILPL group, presentation by Estelle Delpech

description

Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team. August 30-31, 2007. Institution: Institut de Recherche en Informatique de Toulouse (IRIT) Patan Dhoka, Lalitpur, Nepal.

Transcript of Text Processing for Procedural Question Answering

Page 1: Text Processing for Procedural Question Answering

Text Processing for Procedural Question Answering

Undergoing work for TextCoop project

ILPL group, presentation by Estelle Delpech

Page 2: Text Processing for Procedural Question Answering

Text Processing for Procedural Question Answering

I. INTRODUCTION : GLOBAL ARCHITECTURE

II. CLUES TO IDENTIFY TITLES/ INSTRUCTIONNAL COMPOUNDS

III. THE WHOLE PROCESS

IV. MAIN ISSUES

V. DEMO

Page 3: Text Processing for Procedural Question Answering

I. INTRODUCTION : GLOBAL I. INTRODUCTION : GLOBAL ARCHITECTUREARCHITECTURE

Page 4: Text Processing for Procedural Question Answering

A global Architecture (Surdeau & Pasca)

How to…?

Goal

Task

TEXTPROCESSING

Page 5: Text Processing for Procedural Question Answering

TEXT PROCESSING for Procedural QA : Identification of task structure

Xbar analysis of task structure

PRE-PROCESSING

SEGMENTER

TEXT GRAMMAR

Identification of terminal symbols

HTML cleaning MS tagging

TASK

spec G’

complement

GoalPre-requisite

Title Instructional Compound

DATABASE

.html

Page 6: Text Processing for Procedural Question Answering

II . CORPUS OBSERVATION : II . CORPUS OBSERVATION :

WHAT CLUES TO IDENTIFYWHAT CLUES TO IDENTIFY-INSTRUCTIONNAL COMPOUNDS ?-INSTRUCTIONNAL COMPOUNDS ?-TITLES ?-TITLES ?

Page 7: Text Processing for Procedural Question Answering

1. Clues for Instructional Compounds Identification

Definition : kernel instructions linked to various clauses by rhetorical or logical relations.

Identification in two steps :

Fixing the first wall plate (or shelf bracket)We are going to mark the first wall plate (or bracket) for drilling. First, position the face plate so one screw lines up with the mark on the wall you made in the last step and place the level on top of the face plate to ensure it is level. Second, you should mark the wall in the next screw hole, again by turning the screw until it bites into the wall (see fig 1.3). It is advised that you mark any remaining screw holes while keeping the wall plate firmly in position. Now you have to choose a suitable drill bit (masonry or the right type for the surface). It should be the same width as the wall plug to be used. Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig 1.4). Finally, Place a piece of masking tape on the drill bit to use as a guide, this will ensure you don't drill too deep.

Fixing the first wall plate (or shelf bracket)We are going to mark the first wall plate (or bracket) for drilling. First, position the face plate so one screw lines up with the mark on the wall you made in the last step and place the level on top of the face plate to ensure it is level. Second, you should mark the wall in the next screw hole, again by turning the screw until it bites into the wall (see fig 1.3). It is advised that you mark any remaining screw holes while keeping the wall plate firmly in position. Now you have to choose a suitable drill bit (masonry or the right type for the surface). It should be the same width as the wall plug to be used. Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig 1.4). Finally, place a piece of masking tape on the drill bit to use as a guide, this will ensure you don't drill too deep.

Detect presence of instructions : expression of obligation Find instructionnal compound boudaries, e.g. connectors…

Fixing the first wall plate (or shelf bracket)We are going to mark the first wall plate (or bracket) for drilling. First, position the face plate so one screw lines up with the mark on the wall you made in the last step and place the level on top of the face plate to ensure it is level. Second, you should mark the wall in the next screw hole, again by turning the screw until it bites into the wall (see fig 1.3). It is advised that you mark any remaining screw holes while keeping the wall plate firmly in position. Now you have to choose a suitable drill bit (masonry or the right type for the surface). It should be the same width as the wall plug to be used. Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig 1.4). Finally, place a piece of masking tape on the drill bit to use as a guide, this will ensure you don't drill too deep.

Page 8: Text Processing for Procedural Question Answering

Presence of instructions : Morpho-lexical patterns

HTML tags (typo-disposition) :

shall Adv* base form verbHave to Adv* base form verb## Op? adv* base form verbit be adv* (necessary|compulsory) that

<p> </p> <li> </li>

Compound boudaries : Morpho-lexical patterns

## to Adv* base form verb .* ,(##|Conj) (if|then|after )

You should pre-heat the oven

You have to pre-heat the oven

Do not pre-heat the oven

It is better that you pre-heat the oven

[To cook the cake, pre-heat the oven] [and then start peeling …

[If you want to cook the cake, pre-heat the oven.] [If you don’t want to cook …

<li> [ Pre-heat the oven … ]</li>

1. Clues for Instructional Compounds Identification

Page 9: Text Processing for Procedural Question Answering

2. Titles identification :About the HTML encoding of titles

The <hn> tag can not be used as a single clue for title identification

HTML encoding is free, the code can be underspecified (css)

Corpus observation : 80 % titles are encoded with <b> 57 % <b> encode titles 64 % <h> encode titles the coding varies from a web site to another

We had to find some other clues …

Page 10: Text Processing for Procedural Question Answering

2. Clues for Title Identification

Some helpful visual Clues : Short sequence of word

Spaced from the rest of the text

Emphasized

not emphasizednot a title

not short

Page 11: Text Processing for Procedural Question Answering

Linguistic Clues :

2. Clues for Title Identification

Rarely contains tensed verb Can be a single question

?

?

Textual environment clues :

Occurs between two paragraphs of text

Occurs between title and a paragraph of text ?

?

No single clue, but a bundle of clues

Page 12: Text Processing for Procedural Question Answering

III. THE WHOLE PROCESSIII. THE WHOLE PROCESS

PRE-PROCESSING

SEGMENTER

Identification of terminal symbols

HTML cleaning MS tagging

Title

Instructional Compound

Page 13: Text Processing for Procedural Question Answering

1. HTML Cleaning module

Emphasis tags

<h><b><u><i>

<b>

<br>

<br>

<br>

<li>

<li>

<p>

<p>

<p>

<p>

<p>

Main typo-dispostional informationHTML Cleaning

<b>

<b>

<b>

<b>

Text chunks tags

<p> <div><ol> <ul>

<p><p>

Raw HTML Code

<br><li>

<br><li>

Subdivision tagsSubdivision tags

The output of the HTML Cleaning module is :

a list of text chunks, corresponding more or less to paragraph breaks

Their corresponding typo-dispositionnal structure

Page 14: Text Processing for Procedural Question Answering

2. Clues Collection module

MS Tagging Clues collection

<li><li>

<b>

<b>

<b><br>

<b><br><br>

<b>

<li><li>

<b>

STRUCTURE TEXT CLUES

Nb of instructions Instructions types Nb of goals Nb of words Nb of sentences Nb of question Nb of tensed verbs

TAGS

TreeTagger

The output of the Clues Collection module is : the list of text chunks with :

Their corresponding typo-dispositionnal structure

Text with tagged instructions, goals, connectors

Linguistic information This information is used for :

Titles identification Instructionnal compounds

identification

Page 15: Text Processing for Procedural Question Answering

3. Processing each chunk : text or title ?TEXT CHUNKSTYPE

unknown

unknown

unknown

unknown

unknown

unknown

unknown

unknown

unknown

unknown

TEXT CHUNKS

Identification of unambiguous

Titles

Short chunk spaced from the rest of the

text with emphasis a single question

Identification ofunambiguous paragraphs of

text

Long chunk No emphasis Subdivided + than 1 instruction presence of tensed verbs

title

text

ambiguous

text

text

title

ambiguous

text

ambiguous

ambiguous

Page 16: Text Processing for Procedural Question Answering

3. Ambiguous chunks : text or title ?

Short chunks with no emphasis

Instruction-like short chunks

Use of textual environement clues : 1. Identify unambiguous titles/paragraphs of text2. Desambiguates the remaining chunks

Page 17: Text Processing for Procedural Question Answering

3. Ambiguous chunks : text or title ?TEXT CHUNKS

Desambiguisationusing textual

environment clues

a series of ambiguous paragraphs become texttitle

text

ambiguous

text

text

title

ambiguous

text

ambiguous

ambiguous

TEXT CHUNKS

title

text

text

text

title

text

text

text

title

title

an ambiguous paragraph between two paragraphs of text becomes a title

an ambiguous paragraph between two paragraphs of text becomes a title

Page 18: Text Processing for Procedural Question Answering

OUTPUT EXAMPLE

goal

goal

goal

MAIN GOAL

MAIN TASK

task

task

task

Page 19: Text Processing for Procedural Question Answering

IV. Main issues : noise in web pages

« noise » of web pages : advertisements, lists of links, navigation help... interfers with compouds /title identification :

short sequence emphasis linguistic form:

Base form verb at the beginning of a sentence typical of a title or an instruction

but it is a list of links !!

titles

instruction

titles

Page 20: Text Processing for Procedural Question Answering

IV. Main issues : refining goal/titles identification

only sub-goals sub tasks relations are identified what about the hierarchy task/sub-task(s) ?

what about the head title / main goal ? the head title is not always the 1st

identified title (noise) sometimes there is no head title

what if the action is implicit ? ex : the room and the bed implicit : how to clean the room and the

bed

some ideas : choose a title that has vocabulary in

common with instructions identify action verbs in relation with the

nouns of the title

Page 21: Text Processing for Procedural Question Answering

V. DEMOV. DEMO