Finding the right stuff Mining chemical structural information from the drug literature… finding...

26
Mining chemical structural information from the drug literature… finding the right finding the right stuff stuff Debra L. Banville, Ph.D.

Transcript of Finding the right stuff Mining chemical structural information from the drug literature… finding...

Page 1: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

Mining chemical structural information from the drug literature…

finding the right stufffinding the right stuff

Debra L. Banville, Ph.D.

Page 2: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

BACKGROUND:BACKGROUND: The problem…

0

2000

4000

6000

8000

10000

12000

14000

pre-1988 1992 2005

Tota

l #

GP

CR

pu

blicati

on

s t

o d

ate

(art

icle

s +

pate

nts

)

9 290

>14,000

Page 3: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

BACKGROUND:BACKGROUND:

Document Retrieval (DR)

Gathering Mining

InformationExtraction (IE)

…would complicate interpretation of the results. Therefore, phosphorylation of sphingosine was measured in membranes prepared from mCB1-CHO cells and mouse

cerebellum. No detectable levels of S1P were formed by any of the membrane preparations (Fig. 4). In contrast, formation of S1P from sphingosine was readily detected in membranes from HEK cells transfected with SphK1 only in the presence of added ATP, suggesting that membranes from CHO cells and cerebellum do not phosphorylate sphingosine in the binding assays. These results also suggest that …

Page 4: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Filter– Without losing information?

Discover new information– Infusing knowledge & experience?

Manage– Over time?

DEFINE:DEFINE: How can a researcher…

Page 5: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

DEFINE:DEFINE: The First step=>Document Retrieval=>Document Retrieval

TITLE

AUTHOR/AFFILIATION

ABSTRACT

INDEXING/KEYWORDSDocument Retrieval

Page 6: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

DEFINE:DEFINE: The Second Step=>Information Extraction=>Information Extraction

MiningGathering

2901992

>14,0002005

Page 7: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

SOLUTIONS:SOLUTIONS: Pharmaceutical Information Mining

Ideally the marriage of biological & chemical information needs to be the ultimate focus of information mining applications

MVCEGKRSASCPCFFLLTAKFYWILTMMQRTHSQEYAHSIRVDGDIILGGLFPVHAKGERGCGELKKEKGIHRLEAMLYAIDQINKDPDLLSNITLGVRILDTCSRDTYALEQSLTFVQALIEKDASDVKCANGDPPIFTKPDKISGVIGAAASSVSIMVANILRLFKIPIA

STAPELSDNTRYDFFSRVVPPDSYQAQAMVDIVTALGWNYVSTLASEGNYGESGVEAFTQIS

REIGGVCIAQSQKIPREPRPGEFEKIIKRLLETPNARAVIMFANEDDIRRILEAAKKLNQSGHFLWIGSDSWGSKIAPVYQQEEIAEGAVTILPK

RASIDGFDRYFRSRTLANNRRNVWFAEFWEENFGCKLGSHGKRNSHIKKCTGLERIARDSSYEQEGKVQFVIDAVYSMAYALHNMHKDLCPGYIGLCPRMSTIDGKELLGYIRAVNFNGSAGTPVTFNENGDAPGRYDIFQYQITNKSTEYKVIGHWTNQLHLKVEDMQWAHREHTHPASVCSLPCKPGERKKTVKGVPCCWHCERCEGYNYQVDELSCELCPLDQRPNMNRTGCQLIPIIKLEWHSPWAVVPVFVAILGIIATTFVIVTFVRYNDTPIVRASGRELSYVLLTGIFLCYSITFLMIAAPDTII

CSFRRVFLGLGMCFSYAALLTKTNRIHRIFEQGKKSVTAPKFISPASQLVITFSLISVQLLGVFVWFVVDPPHIIIDYGEQRTLDPEKARGVLKCDISDLSLICSLGYSILLMVTCTVYANKTRGVPETFNEAKPIGFTMYTTCIIWLAFIPIFFGTAQSAEKMYIQTTTLTVSMSLSASVSLGMLYMPKVYIIIFHPEQNVQKRKRSFKAVVTAATMQSKLIQKGNDRPNGEVKSEL

Page 8: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

The Major Barrier to information mining

Lack of universal publication standards & structure– Terminology– Indexing policies– Etc…

Page 9: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Terminology Barriers…How many ways can you say aspirin…

Abbreviations

Systematic

Acetylsalicylic acidsalicylic acid, acetate2-acetyloxybenzoate2-carboxyl phenylacetate

Common/generic

CompanyCodes

Tradenames

Index & Ref.Anaphors

Compound 10…

Generic& fragmented

ChemicalStructures

Page 10: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Indexing Barriers…Mapping interleukin-8 Receptor to G-protein coupled receptors

EMBASE or Medline indexing?

CAS indexing?

No

Yes

Page 11: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Formatting & Copyright/Licensing Barriers…

Textual and/or images mixed together

Diverse document formats

Some are images only!

Access to full text

restricted

Page 12: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Other challenges…AMBIGUITY

O

O

O

O

methyl ethyl malonate

O

O

O

OH

methylethyl malonate

O

O

O

HO

methyl ethylmalonate

O-O

O

-O

methylethylmalonate

Methyl+Ethyl+Malonate

Page 13: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Br J Clin Pharmacol. 1993 Dec;36(6):521-30.

Identification of human liver cytochrome P450 isoforms mediating omeprazole metabolism.

Andersson T, Miners JO, Veronese ME, Tassaneeyakul W, Tassaneeyakul W, Meyer UA, Birkett DJ.

Clinical Pharmacology, Astra Hassle AB, S-43183 Molndal, Sweden.

1 The in vitro metabolism of omeprazole was studied in human liver microsomes in order to define the metabolic pathways and identify the cytochrome P450 (CYP) isoforms responsible for the formation of the major omeprazole metabolites. 2 The four major metabolites identified in vitro, in tentative order of importance, were hydroxyomeprazole, omeprazole sulphone, 5-O-desmethylomeprazole, and an unidentified compound termed metabolite X. Omeprazole pyridone was also detected but could not be quantitated. Incubation of hydroxyomeprazole and omeprazole sulphone with human microsomes resulted in both cases in formation of the hydroxysulphone. The kinetics of formation of the four primary metabolites studied were biphasic suggesting the involvement of multiple CYP isoforms in each case. Further studies used substrate concentrations at which the high affinity activities predominated. 3 Formation of the major metabolite, hydroxyomeprazole, was significantly correlated with S-mephenytoin hydroxylase and with benzo[a]pyrene metabolism and CYP3A content. Inhibition studies with isoform selective inhibitors also indicated a dominant role of S-mephenytoin hydroxylase with some CYP3A contribution in the formation of hydroxyomeprazole. Correlation and inhibition data for the sulphone and metabolite X were consistent with a predominant role of the CYP3A subfamily in formation of these metabolites. Formation of 5-O-desmethylomeprazole was inhibited by both R, S-mephenytoin and quinidine, indicating that both S-mephenytoin hydroxylase and CYP2D6 may mediate this reaction in human liver microsomes and in vivo. 4 The Vmax/Km (indicator of intrinsic clearance in vivo) for hydroxyomeprazole was four times greater than that for omeprazole sulphone. Consistent with findings in vivo, the results predict that omeprazole clearance in vivo would be reduced in poor metabolisers of mephenytoin due to reduction in the dominant partial metabolic clearance to hydroxyomeprazole.

MeSH Terms: Analysis of Variance Anti-Ulcer Agents/metabolism* Cytochrome P-450 Enzyme System/chemistry Cytochrome P-450 Enzyme System/metabolism* Enzyme Inhibitors/metabolism* Human Isomerism Liver/metabolism* Microsomes, Liver/metabolism Omeprazole/metabolism* Support, Non-U.S. Gov't

Substances: Anti-Ulcer Agents Enzyme Inhibitors Omeprazole Cytochrome P-450 Enzyme System

Text mining or data mining?

Unstructured data

Structured data

UNSTRUCTURED

Other Challenges …Structured & Unstructured text

Page 14: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Barriers in identifying the “right” set of sources & managing diverse output…

Full Text

Bibliographic & Indexed Citations

Document Retrieval

MULTIPLE SOURCESThe “right” content

Medline/MeShEmbase/EMTREE

CAS/CT, etc…

ScirusHighWireJournals @ OVIDinternal reportsrecent USPTOeJournals

e.g. Patent authorities–USPTO, WO, EP, JP, etc…

Page 15: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Lowering these barriers

Chemical reading capability:– Recognition phenylacetate

– Extraction …interaction of phenylacetate with…

– Conversion to searchable form

O

HO

phenylacetate

–Annotation/tagging of text to entity …interaction of phenylacetate with …

Dream or reality…

Page 16: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Eugene GarfieldRecognizing Chemical Names by Machine…

Historical Solution—1958Chemical name recognition & extraction

…Opler estimated that at least ten man-years would be required just to write the necessary computer programs for display any type of chemical diagram after suitable linguistic analyses…(Opler, Private communication to Garfield, 1959)….Subsequently, I turned to the possibility of calculating molecular formulas…Essays of an Information Scientist (1984) 7, 441

Page 17: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Historical Solutions—30+ years later Chemical name recognition & extraction

Standard Generalized Markup Language (SGML)

1980’s: Gail Hodge et al.-extraction and conversion of name from text fields

1990’s: Chowdury & Lynch then Kemp & Lynch– Extraction from abstract summaries:– Segmentation algorithms & statistics – Chemical names– Full text patents.

2000’s: 2000’s Focus on full unstructured text

Page 18: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Commercial developments:– Reel2’s SureChem– MDL/Temis Reading Machine– Simbiosys’ CLiDE– Etc…

In-house/AZ example

Other SolutionsChemical name recognition & extraction

Page 19: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

SureChem Example

Page 20: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

MDL Reading MachineName Candidates

Identify:

• Formal structure• Physical properties• Name Candidates• Labels, Abbreviations, Anapher

Page 21: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

SimBioSys CLiDETM Example

Table w/ chemical images

Page 22: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Chemical Image ExtractionW

O-P

ate

nt

Ap

plicati

on

Page 23: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Business rationale at AZ…AstraZeneca patent applications

describing alpha‑7 agonists

Patent Number CmpdsWO1996006098 024WO1997030998 036WO1999003859 118WO2000042044 082WO2001029034 126WO2001036417 110WO2001060821 261

Totals 757 named compounds including reagents & intermediates

1. 3'-Methylspiro[1-azabicyclo[2.2.2]octane-3,5'-oxazolidin]-2'-one monohydrochloride

2. 3-Hydroxy-1-azabicyclo[2.2.2]octane-3-acetic acid

3. (3S)-Spiro[1-azabicyclo[2.2.2]octane-3,5'-oxazolidin]-2'-one monohydrochloride

4. 3-Hydroxy-1-azabicyclo[2.2.1]heptane-3-acetic acid hydrazide

5. Spiro[1-azabicyclo[2.2.2]octane-3,5'-oxazolidin]-2'-one monohydrochloride

6. Etc..

•Nonstandard IUPAC name

–Spiro[1-azabicyclo[2.2.2]octane-3,2'(1'H)-furo[2,3-c]isoquinoline]

Page 24: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Defining a vision>14,0002005

Contextual-- Extracted Summaries

Page 25: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

Undiscovered Public KnowledgeUndiscovered Public KnowledgeDavid R. Swanson

Science may be better served by a new image of its literature as a…

vast mosaic of undiscovered connections…

– a world with its own endless frontier.Reference: Swanson DR: Medical literature as a a potential source of new knowledge. Bulletin of the Medical Library Association 1990 78(1): 29-37.

a potential source of countless recombinant ideas

Page 26: Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.

September 2006 ACS Meeting San Francisco, CA

The End

Acknowledgments: James RosamondJames DamewoodBob StumpoJessica Pfennig and…

many thanks to you for your kind attention!!