Tackling the difficult areas of chemical entity extraction

35
ACS National Meeting, Indianapolis, USA 8 th September 2013 Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK

description

 

Transcript of Tackling the difficult areas of chemical entity extraction

Page 1: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Tackling the difficult areas of chemical entity extraction:

Misspelt chemical names and unconventional

entities

Daniel Lowe and Roger Sayle

NextMove Software

Cambridge, UK

Page 2: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Text mining is big business

2013 Bio-IT World Best Practices winner

Page 3: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Approaches to Entity recognition

• Dictionary based

• Grammar based

• Machine Learning

LeadMine LeadMine

Page 4: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Approaches to Entity recognition

• Dictionary based approaches are ideal for relating entities to concepts but only recognise a finite number of terms

– Will not recognise novel compound names

• Hence for chemistry, dictionary approaches need to be used in conjunction with another method

Page 5: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Advantages of grammars

• Don’t require annotated corpora

• Encode knowledge about the domain

• Very fast recognition

• Allow spelling correction if an entity is a near match to one recognised by the grammar

Page 6: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Simple grammar Example

Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’

Digit : Digit1to9 | ‘0’

Cid : ‘CID:’ Digit1to9 Digit*

C I D 1..9 : 0..9

Page 7: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Grammar for IUPAC names

• Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...

– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...

• Generally aims to match a superset of the nomenclature covered by IUPAC

• Specifically this is the superset that can be theoretically be converted to structures

Page 8: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

State machine size

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Stat

es

req

uir

ed

Recall on names from MayBridge catalogue

Page 9: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Two Level State Machines

• Breaks problems into a state machine that keeps track of when concepts have to be matched and a state machine that matches each concept e.g. an acyclic group

– Avoids duplication of states to match the same concept in slightly different contexts

– Slower as multiple concepts may be possible that are allowed to start with the same characters

Page 10: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

State machine RevisiteD

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Stat

es

req

uir

ed

Recall on names from MayBridge catalogue

Page 11: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Grammar inheritance

• Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar

– Inherit rules rather than duplicate them

– Allow overriding of rules

pluralisedChemical : chemical 's'

elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition

metal'|'transuranic element' | _elementaryMetalAtom

Page 12: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Unconventional entities #1

• Formulae:

– Sum formulae

• C20H25NO6

– Line formulae

• CH3CH2CH2Cl (complete molecule)

• CH2CH2 (linker)

• CH3CH2 (substituent)

– Salts

• MgSO4

Page 13: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Unconventional entities #2

• Peptide formulae

– Cys-Tyr-Phe-Gln-Asn-Cys-Pro-Arg-Gly-NH2

• Oligosaccharides

– α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc-(1→3)-β-D-Galp-(1→4)-D-Glc-ol

• Oligonucleotides

– 3'-AATG-5'

Page 14: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Unconventional entities #3

• Patent numbers

– U.S. Pat. No. 6,677,355

• Journal references

– (1974) J. Biol. Chem. 249, 4250-4256

• CAS numbers

– 90-13-1

• InChI and SMILES

Page 15: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

navigating

Page 16: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Fast spelling correction

• Historically we have used Levenshtein-like distance measures (all possible corrections)

• Only use spelling correction when recognition fails

• Allow a certain level of “look behind”

– 13 characters empirically found to yield identical results

– Speeds up spelling correction ~80%

• Dictionary of common English words can be used to prevent attempting spelling correction

Page 17: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Words Ignored for spelling correction (gray)

Page 18: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Exceptions to local errors

• Whether a space is allowed may only be decidable once the suffix of a chemical name is encountered

propyl bromochloromethanol

propylbromochloromethanol

propyl bromochloromethanoate

19 character look behind required!

Page 19: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

BioCreative IV

• CHEMDNER (Chemical compound and drug name recognition task)

• 10000 annotated PubMed abstracts (3500 for training, 3500 for development and 3000 for testing)

• Deadline for submission: This Thursday

Page 20: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Typical annotated Abstract

Page 21: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Dictionaries… bigger is better

• For high recall of trivial names dictionaries with high coverage are required.

• The largest publically available dictionary is PubChem with over 94 million terms

• However most of these terms are either not useful or actually detrimental to text mining

Page 22: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Aggressive filtering

• “what you don't see won't hurt you”

• Hence remove terms are also English words or start with an English word

– Accomplished using a large English dictionary with chemistry terms removed

• Remove internal identifiers used by depositors

• Remove terms that are matched by our grammars

• Ultimate result: 94 million less than 3 million

Page 23: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Structure Aware filtering

• “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.”

• About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria

Page 24: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Entity Extension

• Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits

– α-santalol can be recognised from santalol in the dictionary

• Extension is bracketing aware and blocked by English words

• Entity trimming also performed to comply with the annotation guidelines

– ‘Allura Red AC dye’ ‘Allura Red AC’

Page 25: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Entity Merging

• Adjacent entities may actually be the same entities

– Ethyl ester one entity

– (+)-limonene epoxide one entity

BUT

– Hexane-benzene two entities

Page 26: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Using an ontology to determine when terms add information

• Genistein isoflavone two entities

• Glycine ester one entity

Genistein showing isoflavone core structure

Page 27: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Abbreviation detection

• Based on the Hearst and Schwartz algorithm

• Detects abbreviations of the following forms:

– Tetrahydrofuran (THF)

– THF (tetrahydrofuran)

– Tetrahydrofuran (THF;

– (tetrahydrofuran, THF)

– THF = tetrahydrofuran

Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.

Page 28: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

AnTI-Abbreviation detection

• Finds entities detected as abbreviations of unrecognised entities

– Can mean a common chemical abbreviation has been redefined in the scope of the document

current good manufacturing practice (cGMP)

cGMP = Cyclic guanosine monophosphate =

Page 29: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Grammars used

• Systematic molecule

• Systematic prefix

• Systematic generic name

• Registry number

• CAS number

• Chemical formulae

• Systematic polymer

• Semi systematic chemical name – Systematic prefix + common trivial name/name from PubChem

Page 30: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Dictionaries used

• Noise words e.g. lead

• Trivial polymer

• Generic chemical terms (some from ChEBI)

• Common abbreviations

• Common trivial names

• Filtered PubChem

• Alloys

• Allotropes

• Minerals

Page 31: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Making the most of the knowledge provided

• Use training data to identify terms that are not currently recognised (a whitelist)

• Identify terms that are often false positives (a blacklist)

• Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision/recall)

Page 32: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Results (on development set)

Configuration Precision Recall F-score

Baseline 0.87 0.82 0.84

WhiteList 0.86 0.85 0.85

BlackList 0.88 0.80 0.84

WhiteList + BlackList

0.87 0.83 0.85

Page 33: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Future work

• Typically we are focused on generating structures from the entities we recognise

– Line formula parsing

– Generic chemical name parsing (difficult to do in a way that the results are not tied to a particular toolkit)

• Grammars serve as an excellent starting point for writing parsers

Page 34: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

Conclusions

• Two level state machines allow many complicated grammars to be represented by far fewer states

• Back tracking spelling correction can provide significant speed improvements without effecting recall

• Check out our blog (nextmovesoftware.co.uk/blog) in a couple of weeks to find out how we did in BioCreative!

Page 35: Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013

[email protected]

Tackling the difficult areas of chemical entity extraction:

Misspelt chemical names and unconventional entities

Thank you for your attention