IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03.

49
IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03.

IE by Candidate Classification:Jansche & Abney, Cohen et al

William Cohen

1/19/03

SCAN: Search & Summarization for Audio Collections (AT&T Labs)

Why IE from personal voicemail

• Unified interface for email, voicemail, fax, … requires uniform headers:– Sender, Time, Subject, …– Headers are key for uniform interface

• Independently, voicemail access is slow:– useful to have fast access to important parts

of message (contact number, caller)

Why else to read this paper

• Robust information extraction– Generalizing from manual

transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts

• Place of hand-coding vs learning in information extraction– How to break up task– Where and how to use

engineering

Candidate Generator

Learned filter

Candidate phrase

Extracted phrase

Voicemail corpus

• About 10,000 manually transcribed and annotated voice messages.

• 1869 used for evaluation

Observation: caller phrases are short and near the beginning of the message.

Caller-phrase extraction

• Propose start positions i1,…,iN

• Use a learned decision tree to pick the best i

• Propose end positions i+j1,i+j2,…,i+jM

• Use a learned decision tree to pick the best j

Baseline (HZP, Col log-linear)

• IE as tagging:

• Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model

• Beam search to find best tag sequence given word sequence

• Features of model are words, word pairs, word pair+tag trigrams, ….

Hi there it’s Bill and…

OUT OUT IN IN OUT…

Performance

Observation: caller names are really short and near the beginning of the message.

What about ASR transcripts?

Extracting phone numbers

• Phase 1: hand-coded grammer proposes candidate phone numbers– Not too hard, due to limited vocabulary– Optimize recall (96%) not precision (30%)

• Phase 2: a learned decision tree filters candidates– Use length, position, context, …

Results

Their Conclusions

Cohen, Wang, Murphy

• Another paper with a similar flavor:– IE for a particular task– IE using similar propose-and-filter approach– When and how to you engineer, and when

and how to you use learning?

Background – subcellular localization

The most important tool for studying protein localizations is fluorescence microscopy.

New image processing techniques can automatically produce a quantitative description of subcellular localization.

Background – subcellular localization

Two golgi proteins that cannot be distinguished by eye

Background – subcellular localization

EntrezEntrez: “a new 376kD Golgi complex : “a new 376kD Golgi complex outher membrane protein”outher membrane protein”SWISSProtSWISSProt: : “INTEGRAL MEMBRANE “INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE”PROTEIN. GOLGI MEMBRANE”

EntrezEntrez: “GPP130; type II Golgi : “GPP130; type II Golgi membrane protein”membrane protein”SWISSProtSWISSProt: : nothingnothing

Overview of SLIF: image analysis of existing images from online publications

Image

Panel Splitter

Panel Classifier

Scale FinderFl. Micr. Panel

Micr. Scale

On-line paper

Figure

Figure finder

Overview of SLIF: image analysis of existing images from online publications

End result: collection of on-line fluorescence microscope images, with quantitative description of localization.

E.g.: we know this figure section shows a tubulin-like protein…

…but not which one!

Background – overview of SLIF2.0

Caption

Image Pointer Finder

Scope Finder

Name Finder

Panel Label Matcher

Image

Panel Splitter

Panel Classifier

Scale FinderFl. Micr. Panel

Micr. Scale Cell Type Protein Name

Background – overview of SLIF2.0

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

An old issue: entity recognition

BY-2U2B 0-GFP

p80-coilin

anti-p80 coilin

A new issue: “caption understanding” - where are the entities in the image?

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Why caption understanding?

- Location proteomics.- Remove extraneous junk from caption text for “ordinary” IE, NLP, indexing, … - Better text- or content-based image retrieval for scientific images.

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Identify image pointers:Substrings that refer to parts of the image

Will focus on text issues, not matching

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Identify image pointers:Substrings that refer to parts of the image

Classify image pointers as citation-style or bullet-style.

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Classify image pointers as citation-style or bullet-style.

Compute scopes: - The scope of a bullet-style image pointer is all words between it and the next “bullet”

scope of (A)

scope of (B)

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Compute scopes: - The scope of a bullet-style image pointer is all words after it, but before next “bullet” - The scope of a citation-style image pointer is some set of words nearby it (heuristically determined by separating words and punctuation)

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Image pointers share all entities in their “scope”.

Entities are assigned to panels based on matches of image-pointers to annotations in panels.

Outline

• Details on caption understanding– Baseline hand-coded methods

– Learning methods

– Experimental results

Task

• Identify image pointers in captions.• Classify image pointers:

– bullet-style, citation-style, or NP-style• E.g., “Panels A and C show the …”

• Won’t talk about scoping• Will focus first on extracting image pointers

—i.e., binary classification of substrings “is this an image pointer”

• Data: 100 captions from 100 papers—about 600 positive examples.

Baseline methods

• Labeled 100 sample figure captions.

• HANDCODE-1: patterns like (A), (B-E), (c and d), etc.

• HANDCODE-2: all short parenthesized expressions & patterns like “panel A” or “in B-C”

HC-1 HC-2

Precision

98.5 74.5

Recall 45.6 98.0

F1 62.3 84.6

Some plausible tricks (like filtering HC-2) don’t help much…

HC-1 HC-2f

HC-2

Precis. 98.5 89.0 74.5

Recall 45.6 54.8 98.0

F1 62.3 67.8 84.6

How hard is the problem?

Some citation-style image pointers

How hard is the problem?

NP-style

non-image pointers

The difficulty of the task suggests using a learning approach

Another use of propose-and-filter

Candidate Generator

Learned filter

Candidate phrase

Extracted phrase

Note that Hand-Code2 (recall 98%) is a natural candidate generator.

We’ll start with “off the shelf” features…

Learning methods: boosting

Generalized version of AdaBoost (Singer&Schapire, 99)

Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

Learning methods: boosting rules

Weak learner: to find weak hypothesis t:

1. Split Data into Growing and Pruning sets

2. Let Rt be an empty conjunction

3. Greedily add conditions to Rt guided by Growing set:

4. Greedily remove conditions from Rt guided by Pruning set:

5. Convert to weak hypothesis:

where

Constraint: W+ > W-

and caret is smoothing

Learning methods: boosting rules

SLIPPER also produces fairly compact rule sets.

Learning methods: BWI

• Boosted wrapper induction (BWI) learns to extract substrings from a document.– Learns three concepts: firstToken(x),

lastToken(x), substringLength(k)– Conditions are tests on tokens before/after x

• E.g., toki-2=‘from’, isNumber(toki+1)

– SLIPPER weak learner, no pruning.– Greedy search extends window size by at most L

in each iteration, uses lookahead L, no fixed limit on window size.

• Good results in (Kushmeric and Frietag, 2000)

Learning methods: ABWI

• “Almost boosted wrapper induction” (ABWI) learns to extract substrings:– Learns to filter candidate substrings (HandCode2)– Conditions are the same tests on tokens near x:

• E.g., toki-2=‘from’, isNumber(toki+1)

– SLIPPER weak learner, no pruning.– Greedy search extends window size any amount, uses

no lookahead, has fixed limit on window size.• Optimal window sizes for this problem seem to be small…

Learning methods

• Features: W tokens before/after, all tokens inside.

• Learner: 100 rounds boosting conjunctions of feature tests– Inspired by BWI (Frietag

& Kushmeric)– Implemented with

SLIPPER learner

HC-1 HC-2f

HC-2

ABWI (W=2)

Precis.

98.5 89.0 74.5 89.7

Recall 45.6 54.8 98.0 91.0

F1 62.3 67.8 84.6 90.3

Other learning methods

HC-1 HC-2f HC-2 ABWI (W=2)

ABWI Slipper

ABWI Ripper

ABWI SVM1

ABWI SVM2

Precis.

98.5 89.0 74.5 89.7 96.1 88.1 69.0 100.0

Recall 45.6 54.8 98.0 91.0 85.2 87.1 78.0 75.2

F1 62.3 67.8 84.6 90.3 90.3 87.6 73.2 85.6

All learning methods are competitive with hand-coded methods

Additional features

• Check if candidate contains certain “special” substrings:– Matches color name: labeled color– Matches HANDCODE-1 pattern: handcode1– Matches “mm”, “mg”, etc: measure– Matches 1980,…,2003, “et al”: citation– Matches “top”, “left”, etc: place

• Added “sentence boundary” substrings:– Feature is “distance to boundary”.

Learning with expanded feature set

HC-1 HC-2f HC-2 ABWI (W=2)

ABWI

+ NA

Precis. 98.5 89.0 74.5 89.7 85.9

Recall 45.6 54.8 98.0 91.0 92.2

F1 62.3 67.8 84.6 90.3 89.0

Many new features are inversely correlated with class (e.g. citation), but ABWI looks only for positively-correlated patterns.

Learning with expanded feature set

HC-1 HC-2f HC-2 ABWI (W=2)

ABWI

+ NA

SABWI

+ NA

Precis. 98.5 89.0 74.5 89.7 85.9 88.6

Recall 45.6 54.8 98.0 91.0 92.2 93.8

F1 62.3 67.8 84.6 90.3 89.0 91.1

SABWI is a symmetric version of ABWI: can use rules and/or conditions negatively or positively correlated with the class

Task

• Identify image pointers in captions.

• Classify image pointers:– bullet-style, citation-style, or NP-style

• Combine these to get a four-class problem:– bullet-style, citation-style, or NP-style, other– no hand-coded baseline methods

Four-class extraction results

Method Error rate

W=2 W=3 W=5

ABWI 24.6 27.5 26.7

ABWI+NA 26.7 22.2 26.7

SABWI+NA 24.2 18.2 22.6

Further improvement is probablewith additional labeled data