Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks...
-
Upload
rudolf-anthony -
Category
Documents
-
view
218 -
download
0
Transcript of Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks...
Copyright © 2005 by Limsoon Wong
Some Interesting Issues in
Constructing Gene/Protein
Networks
Limsoon WongInstitute for Infocomm
ResearchSingapore
Copyright © 2005 by Limsoon Wong.
Issues
• Sound:– Is the contents of our databases correct?– Trying our hands on a data cleansing problem
• Complete:– Is the structure of our databases expressive
enough to capture critical information explicitly?
• Understandable:– Is our databases or search results
understandable?
• Other issues relating to NLP/IE
Copyright © 2005 by Limsoon Wong
Soundness:Is the contents of our databases correct?
This part is based on work of Judice Koh and Vladimir
Brusic
Categories of errors found
Copyright © 2005 by Limsoon Wong.
Copyright © 2005 by Limsoon Wong
Context of the misspellingsCorrectionsMisspellings
EMBL:Y18050 E.faecium pbp5 geneTITLE Modification of penicillin-binding protein 5 asociated with highlevel ampicillin resistance in Enterococcus faeciumgi|1143442|emb|X92687.1|EFPBP5G
associatedasociated
Swiss-Prot:P03385Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70);Tranmembrane protein (TM) (p15E); R protein].gi|119478|sp|P03385|ENV_MLVMO
transmembranetranmembrane
Patent Database:A76783 Sequence 11 from Patent WO9315210CDS <1..150/note="gene cassete encoding intercalating jun-zipper andlinker"gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638]
CassetteCassete
GenBank:AAD26534 nectin-1 [Rattus norvegicus]TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruitedto Cadherin-based Adherens Junctions through Interaction withAfadin, a PDZ Domain-containing Proteingi|4590334|gb|AAD26534.1
ImmunoglobulinImmunoglobinRECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Spelling errors
Format violation
Annotation error
Dubious sequences
Sequence redundancy
Data Provenance flaws
Cross-annotation error
Sequence structure violation
• Usually typo errors
• Occurs in different fields of the record
• We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez.
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
Example Spelling Errors
RECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Overlapping intron/exon
Annotation error
Dubious sequences
Sequence redundancy
Data Provenance flaws
Cross-annotation error
Sequence structure violation
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6.
rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2.
Example Overlapping Intron/Exon Errors
Copyright © 2005 by Limsoon Wong
RECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Annotation error
Dubious sequences
Sequence redundancy
Data provenance flaws
Cross-annotation error
Sequence structure violation
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
Replication of sequence informationDifferent views
Overlapping annotations of the same sequence
• Submission of the same sequence to different databases
• Repeated submission of the same sequence to the same database
• Initially submitted by different groups
• Protein sequences may be translated from duplicate nucleotide sequences
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept
Example Seqs w/ Identical Info
Copyright © 2005 by Limsoon Wong
Copyright © 2005 by Limsoon Wong
Soundness:Trying our hands on a data cleansing problem
This part is based on work of Judice Koh
Copyright © 2005 by Limsoon Wong.
SINGLE-SOURCE DATABASE
ATTRIBUTE
Schema remapping
Sequence Structure Parser
RECORD
Concatenated values
Spelling errors
Format violation
Synonyms
Homonyms/Abbreviations
Misuse of fields
Features do not correspond with sequence
Dictionary lookup
Integrity constraints
Undersized sequences
Uninformative sequences
Replication of sequence information
Different views
Overlapping annotations of the same sequence
Fragments
Duplicate detection
Mis-fielded values
Comparative analysis
Sequence structure violation
MULTI-SOURCEDATABASE
Vector contaminated sequencesVector screening
Putative features
Cross-annotation error
Copyright © 2005 by Limsoon Wong.
Scorpion venom dataset containing 520 records
695 duplicate pairs are collectively identified.
Snake PLA2 venom dataset containing 780 records
Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB)
251 duplicate pairs 444 duplicate pairs
scorpion AND (venom OR toxin) serpentes AND venom AND PLA2
Expert annotation
Dataset
Copyright © 2005 by Limsoon Wong.
Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.
Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.
Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.
What else do we learn?
Definition of the sequence records do not play
a role in identifying the record duplicates
Results
Copyright © 2005 by Limsoon Wong
Completeness:Is the structure of our databases expressive enough to capture critical information explicitly?
Copyright © 2005 by Limsoon Wong.
Expressive Power
• Take a key paper such as the Kohn paper that summarises current knowledge on p53 regulation.
• Is there a structured database that is able to capture all info in that paper explicitly?
• Is there a semi-structured database that is able to capture all info in that paper explicitly?
• How well does this (semi-) structured database generalize to other similar type of papers?
Copyright © 2005 by Limsoon Wong
Understandability:Is our databases or search results understandable?
Copyright © 2005 by Limsoon Wong.
Self-Organization
• Take a search on p53. You will get >300k hits or some number like that on MEDLINE
• It is not feasible for anyone to go thru all of that to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year.
• Need to organize the database and/or the search results into hierarchy or “semantic” net to make it easier for users to understand or to browse the results
• How do we define this hierarchy/net? • Can this hierarchy/net be self-organized?
Copyright © 2005 by Limsoon Wong
Problems relating to NLP/IE
This part is mostly based on work of Chris Tan and See-Kiong Ng
Copyright © 2005 by Limsoon Wong.
Handling full-length papers
• Source document structure parsing• Hyper-linked file tracking• Figure and table processing• Special symbol handling
Copyright © 2005 by Limsoon Wong.
Information retrieval
• Document and sentence retrieval• Relevant interaction filtering
Copyright © 2005 by Limsoon Wong.
Bio name recognition
• Nomenclature loosely followed• Frequent use of conjunction and
disjunction in bio names with multiple bio-entity names sharing one head noun
• Long descriptive names• Names of genes and proteins used
interchangeably
Copyright © 2005 by Limsoon Wong.
• Inherent complexity of biological interactions
Sentences describing them also tend to be complicated
Bio-interaction extraction
Bio-interaction extraction
• Domain knowledge is often needed for interaction template filling
Copyright © 2005 by Limsoon Wong.
Copyright © 2005 by Limsoon Wong.
Extraction of other relevant info• Contextual information
– Species, cell type, cellular localisation, etc
• Negative information
• Speculative & incomplete facts
Copyright © 2005 by Limsoon Wong.
Information integration
• Bio-name mapping
• Bio-interaction mapping– how do you know two complex sentences are
talking about the same interaction?
Copyright © 2005 by Limsoon Wong.
Resource for training & benchmarking
• Is there such a good resource, especially for the more complex tasks?
I2R
Communications & DevicesServices & Applications Media
Media Processing
Human CentricMedia
Media Semantics
Infocomm Security
Context-Aware Systems
Knowledge Discovery
Radio Systems Networking LightwaveEmbedded Systems
Digital Wireless
Acknowledgements
Copyright © 2005 by Limsoon Wong
Data Cleansing:Judice Koh, Vladimir Brusic, Mong Li Lee, Asif M. Khan,Paul T.J. Tan, Heiny Tan, Kenneth Lee, Wilson Goh,
Songsak Tongchusak, Kavitha Gopalakrishnan
NLP/IE Issues:See-Kiong Ng,
Chris Tan