Usami bionlp2011

54
Automatic Acquisition of Huge Training Data for Bio-Medical Named Entity Recognition Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii Graduate School of Information Science and Technology University of Tokyo

Transcript of Usami bionlp2011

Page 1: Usami bionlp2011

Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition

Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii

Graduate School of Information Science and Technology University of Tokyo

Page 2: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Page 3: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Page 4: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I

Labels B : Beginning of NE I : Inside of NE O: Out of NE

Page 5: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I

Page 6: Usami bionlp2011

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I Expensive• Cost• Time

Page 7: Usami bionlp2011

Our Idea

Page 8: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Page 9: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

Page 10: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

String match

Page 11: Usami bionlp2011

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

String match

Acquire annotated corpus for Training

Page 12: Usami bionlp2011

Dictionary Building

Page 13: Usami bionlp2011

Dictionary Building

Symbol: CD177

Page 14: Usami bionlp2011

Dictionary Building

Official Name: CD177 molecule

Page 15: Usami bionlp2011

Dictionary Building

Synonyms: NB1, PRV1, HNA2A, CD177

Page 16: Usami bionlp2011

Dictionary Building

Page 17: Usami bionlp2011

CD177 CD177 molecule NB1 PRV1 HNA2A

Dictionary Building

Page 18: Usami bionlp2011

Task Settings

Task: Single class NER

Target Class: Gene-or-gene-product (GGP)

Resources:

• Lexical database: Entrez Gene

include 6,816,109 gene (protein) records

• Unlabeled text: 2009 MEDLINE

include 17,764,827 articles

Page 19: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Page 20: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Page 21: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

String match

Page 22: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Page 23: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training dataString match

Unlabeled text

Page 24: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Training data

Page 25: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Model

Learn

Page 26: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

Apply

Page 27: Usami bionlp2011

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

14.27

40.78

23.83

42.69

10.18

39.03

PRF1

Dic-based

ML-based

Page 28: Usami bionlp2011

Problem of Simple Approach

Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs

Examples(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 29: Usami bionlp2011

Goal of This Study

Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text

Methodology

1. Utilize references (links) for disambiguation

2. Expand NEs based on coordination analysis

3. Gain new NEs by using self-training

Page 30: Usami bionlp2011

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 31: Usami bionlp2011

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 32: Usami bionlp2011

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Page 33: Usami bionlp2011

Side Effect of Using References

Lacks of the reference in the lexical database

record entA entB entC

ref PMID 19025 1021 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

String matchif referred

Page 34: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Page 35: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Start from Here

Page 36: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Coordinate token

Page 37: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Page 38: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Is this mention included in the dictionary?

Page 39: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 40: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 41: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Coordinate token

Page 42: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Page 43: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Is this mention included in the dictionary?

Coordination Analysis

Page 44: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 45: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Page 46: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Not a coordinate tokenNot included

Page 47: Usami bionlp2011

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

End

Page 48: Usami bionlp2011

Self-training

Training Data

Classifier Model Remaining Data

Learning

Apply

Add new NEs

Page 49: Usami bionlp2011

Evaluation Settings

Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)

Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)

Page 50: Usami bionlp2011

NER Results

Method Prec. Recall F1

String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87

String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77

Dic-based

ML-based

Page 51: Usami bionlp2011

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1

Page 52: Usami bionlp2011

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1F1: 67.89 F1: 62.66

Page 53: Usami bionlp2011

Conclusion

Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training

Acquired large size training data• Used 10% (Memory limitation)

Page 54: Usami bionlp2011

Future Work

Utilize all of acquired training data for learning‣ Online learning

Improve self-training performance

Semi-supervised approach with acquired data

Apply to another domain or semantic class