iSimp: A Sentence Simplification System for Biomedical Text

54
iSimp: A Sentence Simplification System for Biomedical Text Yifan Peng, Catalina O. Tudor, Manabu Torii Cathy H. Wu, and K. Vijay-Shanker Computer & Information Sciences University of Delaware Oct 6, 2012

Transcript of iSimp: A Sentence Simplification System for Biomedical Text

Page 1: iSimp: A Sentence Simplification System for Biomedical Text

iSimp: A Sentence Simplification Systemfor Biomedical TextYifan Peng, Catalina O. Tudor, Manabu ToriiCathy H. Wu, and K. Vijay-Shanker

Computer & Information SciencesUniversity of Delaware

Oct 6, 2012

Page 2: iSimp: A Sentence Simplification System for Biomedical Text

Outline

1 Introduction and motivation

2 iSimp: what to simplify, and how

3 Evaluation

4 Summary and future work

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 3: iSimp: A Sentence Simplification System for Biomedical Text

Introduction

Lots of text mining applications are developed for biomedical text

Complexity of sentences is a challengeiSimp simplifies the text so that the existing text mining tools canbe improvedThis topic is still new, though we are not the first one (Miwa,2010; Jonnalagadda, 2010; Siddharthan, 2003)

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 4: iSimp: A Sentence Simplification System for Biomedical Text

Introduction

Lots of text mining applications are developed for biomedical textComplexity of sentences is a challenge

iSimp simplifies the text so that the existing text mining tools canbe improvedThis topic is still new, though we are not the first one (Miwa,2010; Jonnalagadda, 2010; Siddharthan, 2003)

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 5: iSimp: A Sentence Simplification System for Biomedical Text

Introduction

Lots of text mining applications are developed for biomedical textComplexity of sentences is a challengeiSimp simplifies the text so that the existing text mining tools canbe improved

This topic is still new, though we are not the first one (Miwa,2010; Jonnalagadda, 2010; Siddharthan, 2003)

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 6: iSimp: A Sentence Simplification System for Biomedical Text

Introduction

Lots of text mining applications are developed for biomedical textComplexity of sentences is a challengeiSimp simplifies the text so that the existing text mining tools canbe improvedThis topic is still new, though we are not the first one (Miwa,2010; Jonnalagadda, 2010; Siddharthan, 2003)

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 7: iSimp: A Sentence Simplification System for Biomedical Text

How do we extract relations?

• PKC::::::::::::::::phosphorylates GAP-43 on serine 41.

• It was suggested that Yak1::::::::::::::::phosphorylates

Crf1 to promote its nuclear entry.Sentences

... ProteinA ::::::::::::::::phosphorylates ProteinB ...Word

Sequence

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 8: iSimp: A Sentence Simplification System for Biomedical Text

Three alternative ways

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

... ProteinA ::::::::::::::phosphorylates ProteinB ...Word

Sequence

They are same because of subject – object relation

Design rules for all possible variationsThere are TOO many variations

Improve deep representations of sentencesParsers become error-prone for long and complex sentencesParsers will be less efficient for long sentences

Simplify sentences to reduce variations

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 9: iSimp: A Sentence Simplification System for Biomedical Text

Three alternative ways

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

... ProteinA ::::::::::::::phosphorylates ProteinB ...Word

Sequence

They are same because of subject – object relation

Design rules for all possible variationsThere are TOO many variations

Improve deep representations of sentencesParsers become error-prone for long and complex sentencesParsers will be less efficient for long sentences

Simplify sentences to reduce variations

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 10: iSimp: A Sentence Simplification System for Biomedical Text

Three alternative ways

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

... ProteinA ::::::::::::::phosphorylates ProteinB ...Word

Sequence

They are same because of subject – object relationDesign rules for all possible variations

There are TOO many variations

Improve deep representations of sentencesParsers become error-prone for long and complex sentencesParsers will be less efficient for long sentences

Simplify sentences to reduce variations

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 11: iSimp: A Sentence Simplification System for Biomedical Text

Three alternative ways

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

... ProteinA ::::::::::::::phosphorylates ProteinB ...Word

Sequence

They are same because of subject – object relationDesign rules for all possible variations

There are TOO many variationsImprove deep representations of sentences

Parsers become error-prone for long and complex sentencesParsers will be less efficient for long sentences

Simplify sentences to reduce variations

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 12: iSimp: A Sentence Simplification System for Biomedical Text

Three alternative ways

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

... ProteinA ::::::::::::::phosphorylates ProteinB ...Word

Sequence

They are same because of subject – object relationDesign rules for all possible variations

There are TOO many variationsImprove deep representations of sentences

Parsers become error-prone for long and complex sentencesParsers will be less efficient for long sentences

Simplify sentences to reduce variationsYifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 13: iSimp: A Sentence Simplification System for Biomedical Text

Simplification as a preprocessing module

Assume we are building a phosphorylation system.

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

• Raf-1:::::::::::::::phosphorylates MEK1

• MEK1:::::::::::::::phosphorylates ERK1

• MEK1:::::::::::::::phosphorylates ERK2

• Raf-1 activates MEK1...

Simplify

... ProteinA phosphorylates ProteinB ...WordSequence

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 14: iSimp: A Sentence Simplification System for Biomedical Text

Simplification as a preprocessing module

Assume we are building a phosphorylation system.

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

• Raf-1:::::::::::::::phosphorylates MEK1

• MEK1:::::::::::::::phosphorylates ERK1

• MEK1:::::::::::::::phosphorylates ERK2

• Raf-1 activates MEK1...

Simplify

... ProteinA phosphorylates ProteinB ...WordSequence

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 15: iSimp: A Sentence Simplification System for Biomedical Text

Simplification as a preprocessing module

Assume we are building a phosphorylation system.

Raf-1:::::::::::::::phosphorylates and activates MEK1 , which in turn

:::::::::::::::phosphorylates and activates the MAP kinases/extracellularsignal regulated kinases, ERK1 and ERK2.Sentence

• Raf-1:::::::::::::::phosphorylates MEK1

• MEK1:::::::::::::::phosphorylates ERK1

• MEK1:::::::::::::::phosphorylates ERK2

• Raf-1 activates MEK1...

Simplify

... ProteinA phosphorylates ProteinB ...WordSequence

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 16: iSimp: A Sentence Simplification System for Biomedical Text

Outline

1 Introduction and motivation

2 iSimp: what to simplify, and how

3 EvaluationiSimp accuracyImprovement of recall of information extraction systemsImprovement of sentence ranking systems

4 Summary and future work

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 17: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 MEK1 phosphorylates ERK13 MEK1 phosphorylates ERK2

Verb conjunctionRelative clauseAppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 18: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK1

2 MEK1 phosphorylates ERK13 MEK1 phosphorylates ERK2

Verb conjunction

Relative clauseAppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 19: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 which phosphorylates ...

3 MEK1 phosphorylates ERK14 MEK1 phosphorylates ERK2

Verb conjunction

Relative clauseAppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 20: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 which phosphorylates ...

3 MEK1 phosphorylates ERK14 MEK1 phosphorylates ERK2

Verb conjunctionRelative clause

AppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 21: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 MEK1 phosphorylates ...

3 MEK1 phosphorylates ERK14 MEK1 phosphorylates ERK2

Verb conjunctionRelative clause

AppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 22: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 MEK1 phosphorylates ...

3 MEK1 phosphorylates ERK14 MEK1 phosphorylates ERK2

Verb conjunctionRelative clauseApposition

Noun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 23: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 MEK1 phosphorylates ERK1 and

ERK2

3 MEK1 phosphorylates ERK14 MEK1 phosphorylates ERK2

Verb conjunctionRelative clauseApposition

Noun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 24: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 MEK1 phosphorylates ERK1 and

ERK2

3 MEK1 phosphorylates ERK14 MEK1 phosphorylates ERK2

Verb conjunctionRelative clauseAppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 25: iSimp: A Sentence Simplification System for Biomedical Text

How does iSimp work?

Active Raf-1 :::::::::::::::phosporylates and activates MEK1 , which in turn

:::::::::::::::phosporylates and activates the MAP kinases/extracellular signal

regulared kinases , ERK1 and ERK2 .

1 Raf-1 phosphorylates MEK12 MEK1 phosphorylates ERK13 MEK1 phosphorylates ERK2

Verb conjunctionRelative clauseAppositionNoun conjunction

Yifan Peng iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 26: iSimp: A Sentence Simplification System for Biomedical Text

Types of simplification constructs

ConjunctionRelative clauseAppositionSubordinate clauseIntroductory phraseParenthesized element

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 27: iSimp: A Sentence Simplification System for Biomedical Text

Types of simplification constructs

ConjunctionRelative clause

Almost all abstracts contain at leastone of these three constructsThey are challenging to detectApposition

Subordinate clauseIntroductory phraseParenthesized element

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 28: iSimp: A Sentence Simplification System for Biomedical Text

iSimp pipeline

Tagger and chunker are trained onGenia corpus∗Part-of-speech tagging &

Simple phrase chunking

6 constructions are detected: conjunc-tion, relative clause, apposition, etc.Detection of

complex constructions

For each type of constructs, use aproper templateGeneration of

simplified sentences

* Y. Tateisi, A. Yakushiji, T. Ohta, and J. Tsujii, “Syntax annotation for the geniacorpus,” in Procs. of the IJCNLP, Companion volume, 2005, pp. 222–227.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 29: iSimp: A Sentence Simplification System for Biomedical Text

iSimp pipeline

Tagger and chunker are trained onGenia corpus∗Part-of-speech tagging &

Simple phrase chunking

6 constructions are detected: conjunc-tion, relative clause, apposition, etc.Detection of

complex constructions

For each type of constructs, use aproper templateGeneration of

simplified sentences

* Y. Tateisi, A. Yakushiji, T. Ohta, and J. Tsujii, “Syntax annotation for the geniacorpus,” in Procs. of the IJCNLP, Companion volume, 2005, pp. 222–227.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 30: iSimp: A Sentence Simplification System for Biomedical Text

iSimp pipeline

Tagger and chunker are trained onGenia corpus∗Part-of-speech tagging &

Simple phrase chunking

6 constructions are detected: conjunc-tion, relative clause, apposition, etc.Detection of

complex constructions

For each type of constructs, use aproper templateGeneration of

simplified sentences

* Y. Tateisi, A. Yakushiji, T. Ohta, and J. Tsujii, “Syntax annotation for the geniacorpus,” in Procs. of the IJCNLP, Companion volume, 2005, pp. 222–227.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 31: iSimp: A Sentence Simplification System for Biomedical Text

Detection

Look for triggers: “and”, “which”, etc.Scan the right and left of the trigger to determine the type ofconstructsUse part-of-speech tags and chunking boundaries to determinethe boundary of constructs

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 32: iSimp: A Sentence Simplification System for Biomedical Text

Challenges – detection of simplification types

ExampleeIF2alpha dephosphorylation, GADD34 and CreP, ...

noun phrase conjunction

Two markers, D16S3070 and D16S3275, ...

apposition and noun conjunction

Criteria for apposition detectionOne of two noun phrases begins with a number, a determiner (e.g.,“a”, “an”, “the”), or words “other” or “another”

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 33: iSimp: A Sentence Simplification System for Biomedical Text

Challenges – detection of simplification types

ExampleeIF2alpha dephosphorylation, GADD34 and CreP, ...

noun phrase conjunction

::::Two

:::::::::markers, [D16S3070 and D16S3275], ...apposition and noun conjunction

Criteria for apposition detectionOne of two noun phrases begins with a number, a determiner (e.g.,“a”, “an”, “the”), or words “other” or “another”

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 34: iSimp: A Sentence Simplification System for Biomedical Text

Challenges – detection of simplification types

ExampleeIF2alpha dephosphorylation, GADD34 and CreP, ...

noun phrase conjunction

::::Two

:::::::::markers, [D16S3070 and D16S3275], ...apposition and noun conjunction

Criteria for apposition detectionOne of two noun phrases begins with a number, a determiner (e.g.,“a”, “an”, “the”), or words “other” or “another”

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 35: iSimp: A Sentence Simplification System for Biomedical Text

Challenges – detection of boundary

Examplehyperglycemic clamps in carriers of a CA repeat in the IGF-I promoter

::::and an ApaI polymorphism in the IGF-II gene...

noun phrase “of” [noun phrase::::and noun phrase]...

[noun phrase “of” noun phrase::::and noun phrase]...

Criteria for noun phrase similaritysame wordnumbersGreek alpha-betanumbers followed by letters...

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 36: iSimp: A Sentence Simplification System for Biomedical Text

Challenges – detection of boundary

Examplehyperglycemic clamps in carriers of a CA repeat in the IGF-I promoter

::::and an ApaI polymorphism in the IGF-II gene...

noun phrase “of” [noun phrase::::and noun phrase]...

[noun phrase “of” noun phrase::::and noun phrase]...

Criteria for noun phrase similaritysame wordnumbersGreek alpha-betanumbers followed by letters...

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 37: iSimp: A Sentence Simplification System for Biomedical Text

Challenges – detection of boundary

Examplehyperglycemic clamps in carriers of a CA repeat in the IGF-I promoter

::::and an ApaI polymorphism in the IGF-II gene...

noun phrase “of” [noun phrase::::and noun phrase]...

[noun phrase “of” noun phrase::::and noun phrase]...

Criteria for noun phrase similaritysame wordnumbersGreek alpha-betanumbers followed by letters...

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 38: iSimp: A Sentence Simplification System for Biomedical Text

Outline

1 Introduction and motivation

2 iSimp: what to simplify, and how

3 EvaluationiSimp accuracyImprovement of recall of information extraction systemsImprovement of sentence ranking systems

4 Summary and future work

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 39: iSimp: A Sentence Simplification System for Biomedical Text

Results of simplification detection

100 abstracts from PubMed, for a total of 998 sentences5 judges annotated the corpus

0%

Conjunctions

Relative clauses

Appositions

0% 100%

recall

20% 40% 60% 80%100%

precision

80% 60% 40% 20%

typetype+boundary

100%

100%76.8%

88.5%

93.8%93.8%

87.9%85.5%

93.0%91.3%

83.3%83.3%

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 40: iSimp: A Sentence Simplification System for Biomedical Text

Results of simplification detection

100 abstracts from PubMed, for a total of 998 sentences5 judges annotated the corpus

0% 0% 100%

recall

20% 40% 60% 80%100%

precision

80% 60% 40% 20%

type

100%

100%

93.8%

87.9%

93.0%

83.3%

Conjunctions

Relative clauses

Appositions

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 41: iSimp: A Sentence Simplification System for Biomedical Text

Results of simplification detection

100 abstracts from PubMed, for a total of 998 sentences5 judges annotated the corpus

0%

Conjunctions

Relative clauses

Appositions

0% 100%

recall

20% 40% 60% 80%100%

precision

80% 60% 40% 20%

typetype+boundary

100%

100%76.8%

88.5%

93.8%93.8%

87.9%85.5%

93.0%91.3%

83.3%83.3%

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 42: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of recall of information extractionsystems

RLIMS-P∗

Protein phosphorylation information extraction systemHand-coded patterns

1,000 Medline abstracts related to phosphorylationWith simplification, we expect the recall to go upNumber of pairs matched: 1,768−→ 2,111 ( 20% more)Manual verification shows that, the precision stays the same

* Z.-Z. Hu, M. Narayanaswamy, K. E. Ravikumar, K. Vijay-Shanker, and C. H. Wu,“Literature mining and database annotation of protein phosphorylation using arule-based system,” Bioinformatics, vol. 21, no. 11, pp. 2759–2765, 2005.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 43: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of recall of information extractionsystems

RLIMS-P∗

Protein phosphorylation information extraction systemHand-coded patterns

1,000 Medline abstracts related to phosphorylationWith simplification, we expect the recall to go up

Number of pairs matched: 1,768−→ 2,111 ( 20% more)Manual verification shows that, the precision stays the same

* Z.-Z. Hu, M. Narayanaswamy, K. E. Ravikumar, K. Vijay-Shanker, and C. H. Wu,“Literature mining and database annotation of protein phosphorylation using arule-based system,” Bioinformatics, vol. 21, no. 11, pp. 2759–2765, 2005.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 44: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of recall of information extractionsystems

RLIMS-P∗

Protein phosphorylation information extraction systemHand-coded patterns

1,000 Medline abstracts related to phosphorylationWith simplification, we expect the recall to go upNumber of pairs matched: 1,768−→ 2,111 ( 20% more)

Manual verification shows that, the precision stays the same

* Z.-Z. Hu, M. Narayanaswamy, K. E. Ravikumar, K. Vijay-Shanker, and C. H. Wu,“Literature mining and database annotation of protein phosphorylation using arule-based system,” Bioinformatics, vol. 21, no. 11, pp. 2759–2765, 2005.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 45: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of recall of information extractionsystems

RLIMS-P∗

Protein phosphorylation information extraction systemHand-coded patterns

1,000 Medline abstracts related to phosphorylationWith simplification, we expect the recall to go upNumber of pairs matched: 1,768−→ 2,111 ( 20% more)Manual verification shows that, the precision stays the same

* Z.-Z. Hu, M. Narayanaswamy, K. E. Ravikumar, K. Vijay-Shanker, and C. H. Wu,“Literature mining and database annotation of protein phosphorylation using arule-based system,” Bioinformatics, vol. 21, no. 11, pp. 2759–2765, 2005.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 46: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of sentence ranking systems

RankPref∗

Rank sentences containing particular gene and relevant termSVM with linear kernel

100 gene-term pairsWith simplification, we expect the relation between gene andrelevant term is more clearnDCG of ranked sentences containing gene and relevant term:67%−→ 74% (relative improvement: 10.4% )

nDCG (normalized discounted cumulative gain) is a widely used metric used in

information retrieval to evaluate the quality of the ranked lists.

* C. O. Tudor and K. Vijay-Shanker, “Rankpref : Ranking sentences describingrelation between biomedical entities with an application,” in Procs. of BioNLP inconjunction with NAACL-HLT, 2012.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 47: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of sentence ranking systems

RankPref∗

Rank sentences containing particular gene and relevant termSVM with linear kernel

100 gene-term pairsWith simplification, we expect the relation between gene andrelevant term is more clear

nDCG of ranked sentences containing gene and relevant term:67%−→ 74% (relative improvement: 10.4% )

nDCG (normalized discounted cumulative gain) is a widely used metric used in

information retrieval to evaluate the quality of the ranked lists.

* C. O. Tudor and K. Vijay-Shanker, “Rankpref : Ranking sentences describingrelation between biomedical entities with an application,” in Procs. of BioNLP inconjunction with NAACL-HLT, 2012.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 48: iSimp: A Sentence Simplification System for Biomedical Text

Improvement of sentence ranking systems

RankPref∗

Rank sentences containing particular gene and relevant termSVM with linear kernel

100 gene-term pairsWith simplification, we expect the relation between gene andrelevant term is more clearnDCG of ranked sentences containing gene and relevant term:67%−→ 74% (relative improvement: 10.4% )

nDCG (normalized discounted cumulative gain) is a widely used metric used in

information retrieval to evaluate the quality of the ranked lists.

* C. O. Tudor and K. Vijay-Shanker, “Rankpref : Ranking sentences describingrelation between biomedical entities with an application,” in Procs. of BioNLP inconjunction with NAACL-HLT, 2012.

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 49: iSimp: A Sentence Simplification System for Biomedical Text

Summary

We developed a sentence simplifier – iSimpDetects six simplification structuresGenerates simplified sentencesRuns efficiently in linear time

We confirmed in experiments that iSimp helps improve textmining tools

Rule based information extraction toolsMachine learning based sentence ranking tools

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 50: iSimp: A Sentence Simplification System for Biomedical Text

Summary

We developed a sentence simplifier – iSimpDetects six simplification structuresGenerates simplified sentencesRuns efficiently in linear time

We confirmed in experiments that iSimp helps improve textmining tools

Rule based information extraction toolsMachine learning based sentence ranking tools

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 51: iSimp: A Sentence Simplification System for Biomedical Text

Future work

Analysis and developmentImprove boundary detection for conjunction constructsExamine the utility of iSimp for other biomedical text mining toolsAnalyze the use of simplification for different entity/conceptrelations

DisseminationMake iSimp available as a software moduleRelease the benchmark corpus used in the study

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 52: iSimp: A Sentence Simplification System for Biomedical Text

Future work

Analysis and developmentImprove boundary detection for conjunction constructsExamine the utility of iSimp for other biomedical text mining toolsAnalyze the use of simplification for different entity/conceptrelations

DisseminationMake iSimp available as a software moduleRelease the benchmark corpus used in the study

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 53: iSimp: A Sentence Simplification System for Biomedical Text

Acknowledgment

National Science Foundation (grant number 1062520)National Institutes of Health (grant number 1G08LM010720)OpenNLP for MaxEnt tagger and chunkerGENIA for the training corpusJudges who helped annotate the corpus

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text

Page 54: iSimp: A Sentence Simplification System for Biomedical Text

Q & A

Yifan Peng, et. al. iSimp: A Sentence Simplification Systemfor Biomedical Text