Biocuration2012 poster P113

1
T EXT M INING A SSISTANTS IN W IKIS FOR B IOCURATION Bahar Sateli 1 , Caitlin Murphy 2 , René Witte 1 , Marie-Jean Meurs 1,2 and Adrian Tsang 2 1.Semantic Software Lab. & 2.Centre For Structural and Functional Genomics [{b_satel,mjmeurs}@encs,rwitte@cse,{cmurphy,tsang}@gene].concordia.ca Literature Curation Manually refining and updating bioinformatics databases Labor-intensive, error-prone and expensive task Web Crawler Spreadsheet Online Query Interface Database Curator WWW Downloaded Literature mycoMINE Natural Language Processing (NLP) techniques supporting Biofuel Research Process Experiments pH fungus (organism) substrate temperature buffer products enzyme gene specific activity substrate specificity and/or conditions conditions assay kinetic assay activity assay Automatic extraction of semantic entities and facts: z }| { z }| { Assay, Enzyme Activity Assay Conditions, Glycosylation Gene, Host Kinetic Assay Conditions, pH, Product Analysis Organism, Substrate SpecificActivity, Substrate Specificity, Temperature mycoMINE [1]: a text mining pipeline based on GATE [2]. mycoMINE Workflow Ontology Instantiated Processed Documents Populated Ontology for BRENDA External DBs: SwissProt, NCBI Grammar: Named Entity and Fact recognition Normalization: resolve ambiguities, abbreviations... Gazetteer: find matching entities, assign ontology classes NLP preprocessing: tokenisation, sentence splitting... Coreference Resolution: determine identical individuals Grounding: link to external DBs OWL Ontology Export mapping lists gazetteer lists GenWiki An NLP-enhanced wiki system based on the MediaWiki engine Seamlessly integrates NLP capabilities in a wiki environment Allows curators to discover knowledge embodied in the wiki content GenWiki User Interface Semantic Infrastructure NLP capabilities provided by the Semantic Assistants Framework [3] Brokers NLP pipelines as Web services Wiki-NLP communication through wiki plug-ins Database Wiki Ontologies Language Descriptions Service API Plug-in Web Server Graphical User Interface Rendering Engine Database Interface Client-Side Abstraction Layer Wiki-SA Connector Web Server NLP Service Connector JavaScript Wiki System Browser Semantic Assistants Service Information Service Invocation Publications [1] MJ. Meurs, C. Murphy, I. Morgenstern, G. Butler, J. Powlowski, A. Tsang, and R. Witte, “Semantic Text Mining Support for Lignocellulose Research” BMC MIDM, Vol.12 Suppl.1, 2012, to appear [2] H. Cunningham et al., “TextProcessing with GATE (Version 6)”University of Sheffield, Department of Computer Science 2011 [3] R. Witte, T. Gitzinger, “Semantic Assistants –User-Centric Natural Language Processing Services for Desktop Clients”ASWC 2008, Bangkog, Thailand. Springer LNCS 5367, pp. 360–374 2009 Application Templating mechanism to present the NLP results to users Semantic metadata embedded in each page Using the Semantic MediaWiki (SMW) markup Semantic Entity Retrieval Retrieve literature that contain certain type of entities Using SMW inline queries {{#ask: [[ hasType::Enzyme]] | ?Enzyme = Enzyme Entities Found | format = table | headers = plain | default = No pages found! | mainlabel = Page Name }} Evaluation Methodology: Corpus of 15 documents Manually pre-filled GeneWiki with full-text papers Used mycoMINE to automatically extract relevant entities Keeping the track of time Average Curation Time: No Semantic Support vs. GenWiki Abstract Selection Full Paper No Support GenWiki No Support GenWiki 1 min. 20 sec. 37.5 min. 30.6 min. 67% 20% Conclusion GenWiki proves that seamless integration of NLP capabilities inside a wiki is indeed efficient, in terms of the time curators need to annotate and retrieve entities of interest from the domain literature. Acknowledgments Funding: Genome Canada & Génome Québec. Collaboration: Sherry Wu.

Transcript of Biocuration2012 poster P113

TEXT MINING ASSISTANTS IN WIKIS FOR BIOCURATIONBahar Sateli1, Caitlin Murphy2, René Witte1, Marie-Jean Meurs1,2 and Adrian Tsang2

1.Semantic Software Lab. & 2.Centre For Structural and Functional Genomics[{b_satel,mjmeurs}@encs,rwitte@cse,{cmurphy,tsang}@gene].concordia.ca

Literature CurationManually refining and updating bioinformatics databasesLabor-intensive, error-prone and expensive task

Web Crawler Spreadsheet Online Query Interface

Database

Curator

WWW

Downloaded Literature

mycoMINENatural Language Processing (NLP) techniques supporting

Biofuel Research

Process ExperimentspH

fungus(organism)

substrate

temperaturebuffer

products

enzyme

gene

specificactivity

substratespecificity

and/or

conditions

conditions

assay

kinetic assay

activity assay

Automatic extraction of semantic entities and facts:︷ ︸︸ ︷ ︷ ︸︸ ︷Assay, Enzyme Activity Assay Conditions, Glycosylation

Gene, Host Kinetic Assay Conditions, pH, Product AnalysisOrganism, Substrate Specific Activity, Substrate Specificity, Temperature

mycoMINE [1]: a text mining pipeline based on GATE [2].

mycoMINE Workflow

OntologyInstantiated

Processed DocumentsPopulated Ontology for

BRENDA

External DBs:

SwissProt, NCBI

Grammar: Named Entity and Fact recognition

Normalization: resolve ambiguities, abbreviations...

Gazetteer: find matching entities, assign ontology classes

NLP preprocessing: tokenisation, sentence splitting...

Coreference Resolution: determine identical individuals

Grounding: link to external DBs

OWL Ontology Export

mappinglists

gazetteerlists

GenWikiAn NLP-enhanced wiki system based on the MediaWiki engineSeamlessly integrates NLP capabilities in a wiki environmentAllows curators to discover knowledge embodied in the wiki content

GenWiki User Interface

Semantic InfrastructureNLP capabilities provided by the Semantic Assistants Framework [3]Brokers NLP pipelines as Web servicesWiki-NLP communication through wiki plug-ins

Database

Wiki

Ontologies

Language

Descriptions

Service

AP

I

Plu

g−

inWeb Server

Graphical User Interface

Rendering Engine

Database Interface

Clie

nt−

Sid

e A

bstra

ctio

n L

ayer

Wik

i−S

A C

on

necto

r

Web

Serv

er

NLP Service Connector

JavaScript

Wiki System

Browser

Semantic Assistants

Service Information

Service Invocation

Publications[1] MJ. Meurs, C. Murphy, I. Morgenstern, G. Butler, J. Powlowski, A. Tsang, and R. Witte, “Semantic Text Mining Support for Lignocellulose Research” BMC MIDM, Vol.12 Suppl.1, 2012, to appear

[2] H. Cunningham et al., “Text Processing with GATE (Version 6)”University of Sheffield, Department of Computer Science 2011

[3] R. Witte, T. Gitzinger, “Semantic Assistants –User-Centric Natural Language Processing Services for Desktop Clients”ASWC 2008, Bangkog, Thailand. Springer LNCS 5367, pp. 360–374 2009

ApplicationTemplating mechanism to present the NLP results to usersSemantic metadata embedded in each pageUsing the Semantic MediaWiki (SMW) markup

Semantic Entity RetrievalRetrieve literature that contain certain type of entitiesUsing SMW inline queries

{{#ask: [[ hasType::Enzyme]]| ?Enzyme = Enzyme Entities Found| format = table| headers = plain| default = No pages found!| mainlabel = Page Name}}

EvaluationMethodology:

Corpus of 15 documentsManually pre-filled GeneWiki with full-text papersUsed mycoMINE to automatically extract relevant entitiesKeeping the track of time

Average Curation Time: No Semantic Support vs. GenWiki

Abstract Selection Full PaperNo Support GenWiki No Support GenWiki

1 min. 20 sec. 37.5 min. 30.6 min.67% ⇓ 20% ⇓

ConclusionGenWiki proves that seamless integration of NLP capabilitiesinside a wiki is indeed efficient, in terms of the time curatorsneed to annotate and retrieve entities of interest from thedomain literature.

AcknowledgmentsFunding: Genome Canada & Génome Québec.Collaboration: Sherry Wu.