Text Processing for Procedural Question Answering
-
Upload
estelle-delpech -
Category
Technology
-
view
199 -
download
2
description
Transcript of Text Processing for Procedural Question Answering
Text Processing for Procedural Question Answering
Undergoing work for TextCoop project
ILPL group, presentation by Estelle Delpech
Text Processing for Procedural Question Answering
I. INTRODUCTION : GLOBAL ARCHITECTURE
II. CLUES TO IDENTIFY TITLES/ INSTRUCTIONNAL COMPOUNDS
III. THE WHOLE PROCESS
IV. MAIN ISSUES
V. DEMO
I. INTRODUCTION : GLOBAL I. INTRODUCTION : GLOBAL ARCHITECTUREARCHITECTURE
A global Architecture (Surdeau & Pasca)
How to…?
Goal
Task
TEXTPROCESSING
TEXT PROCESSING for Procedural QA : Identification of task structure
Xbar analysis of task structure
PRE-PROCESSING
SEGMENTER
TEXT GRAMMAR
Identification of terminal symbols
HTML cleaning MS tagging
TASK
spec G’
complement
GoalPre-requisite
Title Instructional Compound
DATABASE
.html
II . CORPUS OBSERVATION : II . CORPUS OBSERVATION :
WHAT CLUES TO IDENTIFYWHAT CLUES TO IDENTIFY-INSTRUCTIONNAL COMPOUNDS ?-INSTRUCTIONNAL COMPOUNDS ?-TITLES ?-TITLES ?
1. Clues for Instructional Compounds Identification
Definition : kernel instructions linked to various clauses by rhetorical or logical relations.
Identification in two steps :
Fixing the first wall plate (or shelf bracket)We are going to mark the first wall plate (or bracket) for drilling. First, position the face plate so one screw lines up with the mark on the wall you made in the last step and place the level on top of the face plate to ensure it is level. Second, you should mark the wall in the next screw hole, again by turning the screw until it bites into the wall (see fig 1.3). It is advised that you mark any remaining screw holes while keeping the wall plate firmly in position. Now you have to choose a suitable drill bit (masonry or the right type for the surface). It should be the same width as the wall plug to be used. Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig 1.4). Finally, Place a piece of masking tape on the drill bit to use as a guide, this will ensure you don't drill too deep.
Fixing the first wall plate (or shelf bracket)We are going to mark the first wall plate (or bracket) for drilling. First, position the face plate so one screw lines up with the mark on the wall you made in the last step and place the level on top of the face plate to ensure it is level. Second, you should mark the wall in the next screw hole, again by turning the screw until it bites into the wall (see fig 1.3). It is advised that you mark any remaining screw holes while keeping the wall plate firmly in position. Now you have to choose a suitable drill bit (masonry or the right type for the surface). It should be the same width as the wall plug to be used. Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig 1.4). Finally, place a piece of masking tape on the drill bit to use as a guide, this will ensure you don't drill too deep.
Detect presence of instructions : expression of obligation Find instructionnal compound boudaries, e.g. connectors…
Fixing the first wall plate (or shelf bracket)We are going to mark the first wall plate (or bracket) for drilling. First, position the face plate so one screw lines up with the mark on the wall you made in the last step and place the level on top of the face plate to ensure it is level. Second, you should mark the wall in the next screw hole, again by turning the screw until it bites into the wall (see fig 1.3). It is advised that you mark any remaining screw holes while keeping the wall plate firmly in position. Now you have to choose a suitable drill bit (masonry or the right type for the surface). It should be the same width as the wall plug to be used. Get to hand one of the wall plugs, and place it against the tip of the drill bit (see fig 1.4). Finally, place a piece of masking tape on the drill bit to use as a guide, this will ensure you don't drill too deep.
Presence of instructions : Morpho-lexical patterns
HTML tags (typo-disposition) :
shall Adv* base form verbHave to Adv* base form verb## Op? adv* base form verbit be adv* (necessary|compulsory) that
<p> </p> <li> </li>
Compound boudaries : Morpho-lexical patterns
## to Adv* base form verb .* ,(##|Conj) (if|then|after )
You should pre-heat the oven
You have to pre-heat the oven
Do not pre-heat the oven
It is better that you pre-heat the oven
[To cook the cake, pre-heat the oven] [and then start peeling …
[If you want to cook the cake, pre-heat the oven.] [If you don’t want to cook …
<li> [ Pre-heat the oven … ]</li>
1. Clues for Instructional Compounds Identification
2. Titles identification :About the HTML encoding of titles
The <hn> tag can not be used as a single clue for title identification
HTML encoding is free, the code can be underspecified (css)
Corpus observation : 80 % titles are encoded with <b> 57 % <b> encode titles 64 % <h> encode titles the coding varies from a web site to another
We had to find some other clues …
2. Clues for Title Identification
Some helpful visual Clues : Short sequence of word
Spaced from the rest of the text
Emphasized
not emphasizednot a title
not short
Linguistic Clues :
2. Clues for Title Identification
Rarely contains tensed verb Can be a single question
?
?
Textual environment clues :
Occurs between two paragraphs of text
Occurs between title and a paragraph of text ?
?
No single clue, but a bundle of clues
III. THE WHOLE PROCESSIII. THE WHOLE PROCESS
PRE-PROCESSING
SEGMENTER
Identification of terminal symbols
HTML cleaning MS tagging
Title
Instructional Compound
1. HTML Cleaning module
Emphasis tags
<h><b><u><i>
<b>
<br>
<br>
<br>
<li>
<li>
<p>
<p>
<p>
<p>
<p>
Main typo-dispostional informationHTML Cleaning
<b>
<b>
<b>
<b>
Text chunks tags
<p> <div><ol> <ul>
<p><p>
Raw HTML Code
<br><li>
<br><li>
Subdivision tagsSubdivision tags
The output of the HTML Cleaning module is :
a list of text chunks, corresponding more or less to paragraph breaks
Their corresponding typo-dispositionnal structure
2. Clues Collection module
MS Tagging Clues collection
<li><li>
<b>
<b>
<b><br>
<b><br><br>
<b>
<li><li>
<b>
STRUCTURE TEXT CLUES
Nb of instructions Instructions types Nb of goals Nb of words Nb of sentences Nb of question Nb of tensed verbs
TAGS
TreeTagger
The output of the Clues Collection module is : the list of text chunks with :
Their corresponding typo-dispositionnal structure
Text with tagged instructions, goals, connectors
Linguistic information This information is used for :
Titles identification Instructionnal compounds
identification
3. Processing each chunk : text or title ?TEXT CHUNKSTYPE
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
TEXT CHUNKS
Identification of unambiguous
Titles
Short chunk spaced from the rest of the
text with emphasis a single question
Identification ofunambiguous paragraphs of
text
Long chunk No emphasis Subdivided + than 1 instruction presence of tensed verbs
title
text
ambiguous
text
text
title
ambiguous
text
ambiguous
ambiguous
3. Ambiguous chunks : text or title ?
Short chunks with no emphasis
Instruction-like short chunks
Use of textual environement clues : 1. Identify unambiguous titles/paragraphs of text2. Desambiguates the remaining chunks
3. Ambiguous chunks : text or title ?TEXT CHUNKS
Desambiguisationusing textual
environment clues
a series of ambiguous paragraphs become texttitle
text
ambiguous
text
text
title
ambiguous
text
ambiguous
ambiguous
TEXT CHUNKS
title
text
text
text
title
text
text
text
title
title
an ambiguous paragraph between two paragraphs of text becomes a title
an ambiguous paragraph between two paragraphs of text becomes a title
OUTPUT EXAMPLE
goal
goal
goal
MAIN GOAL
MAIN TASK
task
task
task
IV. Main issues : noise in web pages
« noise » of web pages : advertisements, lists of links, navigation help... interfers with compouds /title identification :
short sequence emphasis linguistic form:
Base form verb at the beginning of a sentence typical of a title or an instruction
but it is a list of links !!
titles
instruction
titles
IV. Main issues : refining goal/titles identification
only sub-goals sub tasks relations are identified what about the hierarchy task/sub-task(s) ?
what about the head title / main goal ? the head title is not always the 1st
identified title (noise) sometimes there is no head title
what if the action is implicit ? ex : the room and the bed implicit : how to clean the room and the
bed
some ideas : choose a title that has vocabulary in
common with instructions identify action verbs in relation with the
nouns of the title
V. DEMOV. DEMO