The BIOCREATIVE Task in SEER
Outline
• Background for biomedical information extraction and BIOCREATIVE
• BIOCREATIVE NER Task
• Stanford-Edinburgh System
• Problems
Terms and Resources
Gene An ordered sequence of nucleotides that encodes a product such as a protein.
Protein Gene products; composed of chains of amino acids;
Have sophisticated structures;
kinases, enzymes, etc are types of proteins
Nucleotide Thousands of nucleotides link to form a DNA/RNA molecule
Molecular Biology Branch of biology studying all of the above
MEDLINE The primary research database of the biomedical community, from nursing to drugs to genetics
Gene Databases FlyBase, MGI (mouse), Saccharomyces Gen. Database (Yeast )
Other Databases Swiss-Prot (amino acid sequences of proteins)
GenBank (nucleotide sequences of genes)
Biotechnology Information Explosion
David Landsman NCBI Presentation
NER in the Biomedical Domain
• Many types of entities can be studied in the biomedical domain (drug names, chemicals)
• Much research has focused on molecular biological entities, particularly genes and proteins
Gene Names
• Genes and gene products are constantly being discovered and new names invented
• Nomenclatures exist but vary from organism to organism• Diverse:
– ‘bride of frizzled disco’, ‘cheap date’, ‘broken heart’– ‘REP2’, ‘RFM’
• Ambiguous:– With other genes – Acronyms– With proteins, where genes and their products are often referred to
by the same name. (1st gene in LocusLink is officially alpha-1-B-glycoprotein)
F-Score Evaluation Corpus Publication
0.92/Gene Corpus consisting of 750 sentences from FlyBase where each gene is referred to by its official name, and where each name is a single word, kept only sentences containing at least 2 gene mentions, and those gene mentions appear in the dictionary and all the articles concern drosophila melanogaster
Proux et al 1998
0.97/Protein 30 abstracts on SH3 protein Fukuda et al 1998 (KeX)
0.92/Protein SWISSPROT annotations on Transpath database
Hanisch et al 2003
0.15/DNA 0.72/Protein 100 MEDLINE abstracts Nobata et al 1999
0.64/Protein 99 MEDLINE abstracts Eriksson et al 2002 (Yapex)
0.76 Protein 0.03/RNA 100 MEDLINE abstracts Collier et al 2000
0.56 – 24 classes GENIA corpus Kazama et al 2002
0.70/Protein Molecule GENIA corpus Yamamoto et al 2003
Varying Tasks, Results and Evaluation Methods
BIOCREATIVE Motivations
• Seeking to be the MUC of the biomedical information extraction field
The BIOCREATIVE NER Task
• Given a single sentence from an abstract, to identify all mentions of genes
• “(or proteins where there is ambiguity)”
• In November changed the task to identify all mentions of genes and proteins (but not distinguishing between them)
The BIOCREATIVE NER Data
Data Set Sentences Words Genes
Training 7500 200,000 9000
Development 2500 70,000 3000
Evaluation 5000 130,000 6000
Data consisted of MEDLINE abstracts annotated for the single NE “GENE”
The BIOCREATIVE NER Evaluation Method
• Only exact matches to the gold standard (which includes alternate correct boundaries for several cases) are counted as correct.
• Genes detected with incorrect boundaries are doubly penalized as false negatives and false positives.
chloramphenicol acetyl transferase reporter gene (FN)
transferase reporter gene (FP)
Outline
• Background for BIOCREATIVE and biomedical information extraction
• BIOCREATIVE NER Task Stanford-Edinburgh System
• Problems
Baseline System
• Maximum Entropy Tagger in Java• Based on Klein et al (2003) CoNLL submission• Baseline Performance:
Precision 0.79 Recall 0.74 F-Score 0.76• Efforts were mostly in trying different features,
including different POS taggers, NP-chunking, Parsing, Gazetteers, Web, Abbreviations, Word Shapes, Tokenization…
Feature Set
wi wi-1 wi+1 Last “real” word Next “real” word Any of the 4 previous words
Word Features (All time s e.g. Monday, April are mapped to lower case)
Any of the 4 next words wi + wi-1 Bigrams wi + wi+1 POSi POSi-1
TnT POS (trained on GENIA POS) POSi+1 Character Substrings
Up to a length of 6
abbri abbri-1 + abbri abbri + abbri+1
Abbreviations
abbri-1 + abbri + abbri+1 wi + POSi
wi-1 + POSi Word + POS
wi+1 + POSi
shapei shapei-1 shapei+1 shapei-1 + shapei shapei + shapei+1
Word Shape
shapei-1 + shapei + shapei+1 wi-1 + shapei Word Shape+Word wi+1+ shapei NEi-1 NEi-2+ NEi-1
Previous NE
NEi-1+wi NEi-1+POSi-1+POSi Previous NE + POS NEi-2+ NEi-1+POSi-2+POSi-1+POSi NEi-1 + shapei NEi-1 + shapei+1 NEi-1 + shapei-1 + shapei
Previous NE + Word Shape
NEi-2+ NEi-1+ shapei-2 + shapei-1 + shapei Parentheses Paren-Matching – a feature that signals
when one parentheses in a pair has been assigned a different tag than the other in a window of 4 s
Features – External
Gazetteers 1,731,581 entries
Adapted from Locus Link, Gene Ontology and BIOCREATIVE data
ABGENE A transformation-based NE tagger based on gazetteers and pattern matching
GENIA Biomedical corpus using a different tag set consisting of 37 Named Entities
Web Test Initial tagger output submitted to the Web in patterns such as “X gene”
Postprocessing
• Discarded results with mismatched parentheses• Different boundaries were detected when
searching the sentence forwards versus backwards• Unioned the results of both; in cases where
boundary disagreements meant that one detected gene was contained in the other, we kept the shorter gene
Final System and Results
Precision Recall F-Score
Closed 0.791 0.854 0.821
Open 0.828 0.836 0.832
Preliminary Best-Closed 0.855 0.854 0.825
Preliminary Best-Open 0.863 0.836 0.832
• Trained on training+development data (1000 sentences)
• 1,247,775 features
Outline
• Background for BIOCREATIVE and biomedical information extraction
• BIOCREATIVE NER Task
• Stanford-Edinburgh SystemProblems
Performance Discrepancy
C&C Precision Recall F-Score
CoNLL-2003 84.3 85.5 84.9
BIOCREATIVE 77.6 75.9 76.8
Klein et al Precision Recall F-Score
CoNLL-2003 86.1 86.5 86.3
BIOCREATIVE 78.8 73.5 76.1
Gene Entity Pitfalls
• Language is complex Stably transfected human kidney 293 cells expressing the wild type rat LH / CG
receptor ( rLHR ) or receptors with C-terminal tails truncated at residues 653 , 631 , or 628 (designated rLHR-t653 , rLHR-t631 , and rLHR-t628 ) were used to probe the importance of this region on the regulation of hormonal responsiveness.
• Gene names are frequently uncapitalizedThe chick axon-associated surface glycoprotein neurofascin is implicated in axonal growth and fasciculation as revealed by antibody perturbation experiments .
• Looks weird is not indicative A newly synthesized anti-inflammatory agent , Y-8004 demonstrated a greater inhibition than did indomethacin ( IM ) . on inflammatory response such as ultraviolet erythema in guinea pigs , carrageenin edema , evans blue and carrageenin-induced pleuritis and acetic acid-induced peritonitis in rats .
Boundary Problems
• Gene names can be long and complex
• 37% of our false positives and 39% of false negatives were boundary problems
• Gold: chloramphenicol acetyl transferase reporter gene
chloramphenicol acetyl transferase reporter gene deletion
Gold: estrogen receptor
estrogen receptor ligand
Interannotator Agreement
• MUC-7 interannotator agreement was measured at 97 F-Score
• Demetriou and Gaizauskas:
Interannotator agreement for biomedical terms at 89% F-Score
• Hirschman measured interannotator agreement for gene names at 87% F-Score
Top Related