Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature...

23
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center, Chennai, India & K. Vijay-Shanker University of Delaware

Transcript of Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature...

Page 1: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Towards Building A Database of Phosphorylate Interactions

Extracting Information from the Literature

M. Narayanaswamy & K. E. RavikumarAU-KBC Center, Chennai, India

&K. Vijay-Shanker

University of Delaware

Page 2: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Information Extraction from the Literature

• Wealth of information available (only) in unstructured form (scientific literature)

• Need to store data in structured form (databases) for bioinformatics applications

• Information extraction is an active field.

• Focus in the biological domain -- extracting information on protein pairs that interact

Page 3: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Phosphorylation Extraction

• <Agent = Frp-1

Theme = p53

Site = Ser 15>

• <Agent = JNK

Theme = c-Jun

Site = unk>

Page 4: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Why Phosphorylation

• Central role in signal transduction.

• One of the more widely studied

• IE generalize to other post-translational modifications and binding.

• Different challenges – Agent and target – not just proteins– Site of phosphorylation

Page 5: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Steps in Text Processing and Information Extraction

• Basic Text Processing – e.g., sentence boundary detection.

• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)

chunking• Type Classification of Terms and noun

phrases• Template Pattern Matching

Page 6: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

BioNEx (PSB 2003)

• Detects Names/Terms of following types:– Protein/gene– Protein/gene parts– Chemicals– Source– Others

• Two tasks – Detection and Classification

Page 7: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Classification -- F-Term

• Names are often descriptive NPs– Simian immundeficiency virus– T cell– Mitogen-activated protein kinase– Ras guanine nucleotide exchange factor

• Description of function/class of entities

• Useful to assign types

Page 8: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Additional Sources for Classification

• Using context – h-terms such as “expression” – “…IL-2 expression…”

• Appositives – “Mek1, a tyrosine kinase,…”

• Acronyms– Mitogen-activated protein kinase (MAPK)

… ...MAPK …

• Coordination• High precision and recall (PSB 2003)

Page 9: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Steps in IE

• Basic Text Processing – e.g., sentence boundary detection.

• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)

chunking• Type Classification of Terms and noun

phrases• Template Pattern Matching

Page 10: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Phrase Chunking

• Detect BaseNPs – Active p90Rsk2 was found to be able to

phosphorylate histone H3 at Ser10

Page 11: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Phrase Chunking

• Detect BaseNPs and Verb Groups– Active p90Rsk2 was found to be able to

phosphorylate histone H3 at Ser10

Page 12: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Phrase Chunking

• Detect BaseNPs– Active p90Rsk2 was found to be able to

phosphorylate histone H3 at Ser10

• Verb groups (passive vs. active forms)

Page 13: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Phrase Chunking

• Detect BaseNPs– Active p90Rsk2 was found to be able to

phosphorylate histone H3 at Ser10

• Verb groups

• Appositives – … Sic1, an inhibitor …, is phosphorylated

• Relative Clauses– … Ser38 which is phosphorylated …

Page 14: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Steps in IE

• Basic Text Processing – e.g., sentence boundary detection.

• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)

chunking• Type Classification of Terms and noun

phrases• Template Pattern Matching

Page 15: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Why type classification?

• A phosphorylated B in C– ATR/FRP-1 also phosphorylated p53 in Ser

15 …– Active Chk2 phosphorylated the SQ/TQ sites

in Ckk2 SCD …– cdk9/cyclinT2 could phosphorylate the

retinoblastoma gene (pRb) in human cell lines

Page 16: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Type Classification

• Extensive use of type information in rules

• Typing done by means of – Phrase internal -- e.g., Ras guanine

nucleotide exchange factor Sos– Contextual – e.g., homolog of TAF(II) – syntactic information – appositive,

coordination etc.

Page 17: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Steps in IE

• Basic Text Processing – e.g., sentence boundary detection.

• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)

chunking• Type Classification of Terms and noun

phrases• Template Pattern Matching

Page 18: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Patterns and Templates

• <Agent> <VG-active-phosphorylate> <Target> (in/at <SITE>)?– Active p90Rsk2 was found to be able to

phosphorylate histone H3 at Ser10

• Active, Passive, Adjectival forms for phoshorylate/phosphorylated

• Different orders and optionality of arguments

Page 19: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Patterns for Phosphorylation

• Non-Verbal (not common) but frequent • Phosphorylation of <Target> (by <Agent>)?

(in/at <Site>)?

• Phosphorylation of <Site> …• <Agent> <VG-active> <Target> by/via

phosphorylation (at <Site>)?• Altogether, large number of patterns from

examining 300 abstracts and 10 journal articles.

Page 20: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Sentence-Based Evaluation

Precision Recall F-measure

Agent 91 88 89

Theme 96 87 92

Site 94 73 82

Relation 89 77 83

Agent 89 89 81

Theme 98 93 96

Site 100 96 96

Relation 96 89 92

Page 21: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Utility in Building Databases

• IE on 1000 abstracts – 5m/3s

• Precision on 200 abstracts – Relation > 92%

• Scales up well

• Useful for constructing DBs.

Page 22: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Discussion

• High precision and recall

• Beyond protein-protein (e.g., site)

• Non-verbal

• Generalizes to other post-translational modifications? (acetylate, methylation,…)

Page 23: Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,

Future Work

• Piecemeal information specification

• X phosphorylates Y

+

phosphorylation of Y at Z

=

X phosphorylates Y at Z

• Fusion/Information Merging