Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature...
-
Upload
harold-norris -
Category
Documents
-
view
214 -
download
0
Transcript of Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature...
Towards Building A Database of Phosphorylate Interactions
Extracting Information from the Literature
M. Narayanaswamy & K. E. RavikumarAU-KBC Center, Chennai, India
&K. Vijay-Shanker
University of Delaware
Information Extraction from the Literature
• Wealth of information available (only) in unstructured form (scientific literature)
• Need to store data in structured form (databases) for bioinformatics applications
• Information extraction is an active field.
• Focus in the biological domain -- extracting information on protein pairs that interact
Phosphorylation Extraction
• <Agent = Frp-1
Theme = p53
Site = Ser 15>
• <Agent = JNK
Theme = c-Jun
Site = unk>
Why Phosphorylation
• Central role in signal transduction.
• One of the more widely studied
• IE generalize to other post-translational modifications and binding.
• Different challenges – Agent and target – not just proteins– Site of phosphorylation
Steps in Text Processing and Information Extraction
• Basic Text Processing – e.g., sentence boundary detection.
• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)
chunking• Type Classification of Terms and noun
phrases• Template Pattern Matching
BioNEx (PSB 2003)
• Detects Names/Terms of following types:– Protein/gene– Protein/gene parts– Chemicals– Source– Others
• Two tasks – Detection and Classification
Classification -- F-Term
• Names are often descriptive NPs– Simian immundeficiency virus– T cell– Mitogen-activated protein kinase– Ras guanine nucleotide exchange factor
• Description of function/class of entities
• Useful to assign types
Additional Sources for Classification
• Using context – h-terms such as “expression” – “…IL-2 expression…”
• Appositives – “Mek1, a tyrosine kinase,…”
• Acronyms– Mitogen-activated protein kinase (MAPK)
… ...MAPK …
• Coordination• High precision and recall (PSB 2003)
Steps in IE
• Basic Text Processing – e.g., sentence boundary detection.
• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)
chunking• Type Classification of Terms and noun
phrases• Template Pattern Matching
Phrase Chunking
• Detect BaseNPs – Active p90Rsk2 was found to be able to
phosphorylate histone H3 at Ser10
Phrase Chunking
• Detect BaseNPs and Verb Groups– Active p90Rsk2 was found to be able to
phosphorylate histone H3 at Ser10
Phrase Chunking
• Detect BaseNPs– Active p90Rsk2 was found to be able to
phosphorylate histone H3 at Ser10
• Verb groups (passive vs. active forms)
Phrase Chunking
• Detect BaseNPs– Active p90Rsk2 was found to be able to
phosphorylate histone H3 at Ser10
• Verb groups
• Appositives – … Sic1, an inhibitor …, is phosphorylated
• Relative Clauses– … Ser38 which is phosphorylated …
Steps in IE
• Basic Text Processing – e.g., sentence boundary detection.
• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)
chunking• Type Classification of Terms and noun
phrases• Template Pattern Matching
Why type classification?
• A phosphorylated B in C– ATR/FRP-1 also phosphorylated p53 in Ser
15 …– Active Chk2 phosphorylated the SQ/TQ sites
in Ckk2 SCD …– cdk9/cyclinT2 could phosphorylate the
retinoblastoma gene (pRb) in human cell lines
Type Classification
• Extensive use of type information in rules
• Typing done by means of – Phrase internal -- e.g., Ras guanine
nucleotide exchange factor Sos– Contextual – e.g., homolog of TAF(II) – syntactic information – appositive,
coordination etc.
Steps in IE
• Basic Text Processing – e.g., sentence boundary detection.
• Part of Speech Tagging• Name/term Detection• Phrase (esp., Noun and Verb Phrase)
chunking• Type Classification of Terms and noun
phrases• Template Pattern Matching
Patterns and Templates
• <Agent> <VG-active-phosphorylate> <Target> (in/at <SITE>)?– Active p90Rsk2 was found to be able to
phosphorylate histone H3 at Ser10
• Active, Passive, Adjectival forms for phoshorylate/phosphorylated
• Different orders and optionality of arguments
Patterns for Phosphorylation
• Non-Verbal (not common) but frequent • Phosphorylation of <Target> (by <Agent>)?
(in/at <Site>)?
• Phosphorylation of <Site> …• <Agent> <VG-active> <Target> by/via
phosphorylation (at <Site>)?• Altogether, large number of patterns from
examining 300 abstracts and 10 journal articles.
Sentence-Based Evaluation
Precision Recall F-measure
Agent 91 88 89
Theme 96 87 92
Site 94 73 82
Relation 89 77 83
Agent 89 89 81
Theme 98 93 96
Site 100 96 96
Relation 96 89 92
Utility in Building Databases
• IE on 1000 abstracts – 5m/3s
• Precision on 200 abstracts – Relation > 92%
• Scales up well
• Useful for constructing DBs.
Discussion
• High precision and recall
• Beyond protein-protein (e.g., site)
• Non-verbal
• Generalizes to other post-translational modifications? (acetylate, methylation,…)
Future Work
• Piecemeal information specification
• X phosphorylates Y
+
phosphorylation of Y at Z
=
X phosphorylates Y at Z
• Fusion/Information Merging