Finding the right stuff Mining chemical structural information from the drug literature… finding...
-
Upload
ralf-blake -
Category
Documents
-
view
216 -
download
0
Transcript of Finding the right stuff Mining chemical structural information from the drug literature… finding...
Mining chemical structural information from the drug literature…
finding the right stufffinding the right stuff
Debra L. Banville, Ph.D.
September 2006 ACS Meeting San Francisco, CA
BACKGROUND:BACKGROUND: The problem…
0
2000
4000
6000
8000
10000
12000
14000
pre-1988 1992 2005
Tota
l #
GP
CR
pu
blicati
on
s t
o d
ate
(art
icle
s +
pate
nts
)
9 290
>14,000
September 2006 ACS Meeting San Francisco, CA
BACKGROUND:BACKGROUND:
Document Retrieval (DR)
Gathering Mining
InformationExtraction (IE)
…would complicate interpretation of the results. Therefore, phosphorylation of sphingosine was measured in membranes prepared from mCB1-CHO cells and mouse
cerebellum. No detectable levels of S1P were formed by any of the membrane preparations (Fig. 4). In contrast, formation of S1P from sphingosine was readily detected in membranes from HEK cells transfected with SphK1 only in the presence of added ATP, suggesting that membranes from CHO cells and cerebellum do not phosphorylate sphingosine in the binding assays. These results also suggest that …
September 2006 ACS Meeting San Francisco, CA
Filter– Without losing information?
Discover new information– Infusing knowledge & experience?
Manage– Over time?
DEFINE:DEFINE: How can a researcher…
September 2006 ACS Meeting San Francisco, CA
DEFINE:DEFINE: The First step=>Document Retrieval=>Document Retrieval
TITLE
AUTHOR/AFFILIATION
ABSTRACT
INDEXING/KEYWORDSDocument Retrieval
September 2006 ACS Meeting San Francisco, CA
DEFINE:DEFINE: The Second Step=>Information Extraction=>Information Extraction
MiningGathering
2901992
>14,0002005
September 2006 ACS Meeting San Francisco, CA
SOLUTIONS:SOLUTIONS: Pharmaceutical Information Mining
Ideally the marriage of biological & chemical information needs to be the ultimate focus of information mining applications
MVCEGKRSASCPCFFLLTAKFYWILTMMQRTHSQEYAHSIRVDGDIILGGLFPVHAKGERGCGELKKEKGIHRLEAMLYAIDQINKDPDLLSNITLGVRILDTCSRDTYALEQSLTFVQALIEKDASDVKCANGDPPIFTKPDKISGVIGAAASSVSIMVANILRLFKIPIA
STAPELSDNTRYDFFSRVVPPDSYQAQAMVDIVTALGWNYVSTLASEGNYGESGVEAFTQIS
REIGGVCIAQSQKIPREPRPGEFEKIIKRLLETPNARAVIMFANEDDIRRILEAAKKLNQSGHFLWIGSDSWGSKIAPVYQQEEIAEGAVTILPK
RASIDGFDRYFRSRTLANNRRNVWFAEFWEENFGCKLGSHGKRNSHIKKCTGLERIARDSSYEQEGKVQFVIDAVYSMAYALHNMHKDLCPGYIGLCPRMSTIDGKELLGYIRAVNFNGSAGTPVTFNENGDAPGRYDIFQYQITNKSTEYKVIGHWTNQLHLKVEDMQWAHREHTHPASVCSLPCKPGERKKTVKGVPCCWHCERCEGYNYQVDELSCELCPLDQRPNMNRTGCQLIPIIKLEWHSPWAVVPVFVAILGIIATTFVIVTFVRYNDTPIVRASGRELSYVLLTGIFLCYSITFLMIAAPDTII
CSFRRVFLGLGMCFSYAALLTKTNRIHRIFEQGKKSVTAPKFISPASQLVITFSLISVQLLGVFVWFVVDPPHIIIDYGEQRTLDPEKARGVLKCDISDLSLICSLGYSILLMVTCTVYANKTRGVPETFNEAKPIGFTMYTTCIIWLAFIPIFFGTAQSAEKMYIQTTTLTVSMSLSASVSLGMLYMPKVYIIIFHPEQNVQKRKRSFKAVVTAATMQSKLIQKGNDRPNGEVKSEL
September 2006 ACS Meeting San Francisco, CA
The Major Barrier to information mining
Lack of universal publication standards & structure– Terminology– Indexing policies– Etc…
September 2006 ACS Meeting San Francisco, CA
Terminology Barriers…How many ways can you say aspirin…
Abbreviations
Systematic
Acetylsalicylic acidsalicylic acid, acetate2-acetyloxybenzoate2-carboxyl phenylacetate
Common/generic
CompanyCodes
Tradenames
Index & Ref.Anaphors
Compound 10…
Generic& fragmented
ChemicalStructures
September 2006 ACS Meeting San Francisco, CA
Indexing Barriers…Mapping interleukin-8 Receptor to G-protein coupled receptors
EMBASE or Medline indexing?
CAS indexing?
No
Yes
September 2006 ACS Meeting San Francisco, CA
Formatting & Copyright/Licensing Barriers…
Textual and/or images mixed together
Diverse document formats
Some are images only!
Access to full text
restricted
September 2006 ACS Meeting San Francisco, CA
Other challenges…AMBIGUITY
O
O
O
O
methyl ethyl malonate
O
O
O
OH
methylethyl malonate
O
O
O
HO
methyl ethylmalonate
O-O
O
-O
methylethylmalonate
Methyl+Ethyl+Malonate
September 2006 ACS Meeting San Francisco, CA
Br J Clin Pharmacol. 1993 Dec;36(6):521-30.
Identification of human liver cytochrome P450 isoforms mediating omeprazole metabolism.
Andersson T, Miners JO, Veronese ME, Tassaneeyakul W, Tassaneeyakul W, Meyer UA, Birkett DJ.
Clinical Pharmacology, Astra Hassle AB, S-43183 Molndal, Sweden.
1 The in vitro metabolism of omeprazole was studied in human liver microsomes in order to define the metabolic pathways and identify the cytochrome P450 (CYP) isoforms responsible for the formation of the major omeprazole metabolites. 2 The four major metabolites identified in vitro, in tentative order of importance, were hydroxyomeprazole, omeprazole sulphone, 5-O-desmethylomeprazole, and an unidentified compound termed metabolite X. Omeprazole pyridone was also detected but could not be quantitated. Incubation of hydroxyomeprazole and omeprazole sulphone with human microsomes resulted in both cases in formation of the hydroxysulphone. The kinetics of formation of the four primary metabolites studied were biphasic suggesting the involvement of multiple CYP isoforms in each case. Further studies used substrate concentrations at which the high affinity activities predominated. 3 Formation of the major metabolite, hydroxyomeprazole, was significantly correlated with S-mephenytoin hydroxylase and with benzo[a]pyrene metabolism and CYP3A content. Inhibition studies with isoform selective inhibitors also indicated a dominant role of S-mephenytoin hydroxylase with some CYP3A contribution in the formation of hydroxyomeprazole. Correlation and inhibition data for the sulphone and metabolite X were consistent with a predominant role of the CYP3A subfamily in formation of these metabolites. Formation of 5-O-desmethylomeprazole was inhibited by both R, S-mephenytoin and quinidine, indicating that both S-mephenytoin hydroxylase and CYP2D6 may mediate this reaction in human liver microsomes and in vivo. 4 The Vmax/Km (indicator of intrinsic clearance in vivo) for hydroxyomeprazole was four times greater than that for omeprazole sulphone. Consistent with findings in vivo, the results predict that omeprazole clearance in vivo would be reduced in poor metabolisers of mephenytoin due to reduction in the dominant partial metabolic clearance to hydroxyomeprazole.
MeSH Terms: Analysis of Variance Anti-Ulcer Agents/metabolism* Cytochrome P-450 Enzyme System/chemistry Cytochrome P-450 Enzyme System/metabolism* Enzyme Inhibitors/metabolism* Human Isomerism Liver/metabolism* Microsomes, Liver/metabolism Omeprazole/metabolism* Support, Non-U.S. Gov't
Substances: Anti-Ulcer Agents Enzyme Inhibitors Omeprazole Cytochrome P-450 Enzyme System
Text mining or data mining?
Unstructured data
Structured data
UNSTRUCTURED
Other Challenges …Structured & Unstructured text
September 2006 ACS Meeting San Francisco, CA
Barriers in identifying the “right” set of sources & managing diverse output…
Full Text
Bibliographic & Indexed Citations
Document Retrieval
MULTIPLE SOURCESThe “right” content
Medline/MeShEmbase/EMTREE
CAS/CT, etc…
ScirusHighWireJournals @ OVIDinternal reportsrecent USPTOeJournals
e.g. Patent authorities–USPTO, WO, EP, JP, etc…
September 2006 ACS Meeting San Francisco, CA
Lowering these barriers
Chemical reading capability:– Recognition phenylacetate
– Extraction …interaction of phenylacetate with…
– Conversion to searchable form
O
HO
phenylacetate
–Annotation/tagging of text to entity …interaction of phenylacetate with …
Dream or reality…
September 2006 ACS Meeting San Francisco, CA
Eugene GarfieldRecognizing Chemical Names by Machine…
Historical Solution—1958Chemical name recognition & extraction
…Opler estimated that at least ten man-years would be required just to write the necessary computer programs for display any type of chemical diagram after suitable linguistic analyses…(Opler, Private communication to Garfield, 1959)….Subsequently, I turned to the possibility of calculating molecular formulas…Essays of an Information Scientist (1984) 7, 441
September 2006 ACS Meeting San Francisco, CA
Historical Solutions—30+ years later Chemical name recognition & extraction
Standard Generalized Markup Language (SGML)
1980’s: Gail Hodge et al.-extraction and conversion of name from text fields
1990’s: Chowdury & Lynch then Kemp & Lynch– Extraction from abstract summaries:– Segmentation algorithms & statistics – Chemical names– Full text patents.
2000’s: 2000’s Focus on full unstructured text
September 2006 ACS Meeting San Francisco, CA
Commercial developments:– Reel2’s SureChem– MDL/Temis Reading Machine– Simbiosys’ CLiDE– Etc…
In-house/AZ example
Other SolutionsChemical name recognition & extraction
September 2006 ACS Meeting San Francisco, CA
SureChem Example
September 2006 ACS Meeting San Francisco, CA
MDL Reading MachineName Candidates
Identify:
• Formal structure• Physical properties• Name Candidates• Labels, Abbreviations, Anapher
September 2006 ACS Meeting San Francisco, CA
SimBioSys CLiDETM Example
Table w/ chemical images
September 2006 ACS Meeting San Francisco, CA
Chemical Image ExtractionW
O-P
ate
nt
Ap
plicati
on
September 2006 ACS Meeting San Francisco, CA
Business rationale at AZ…AstraZeneca patent applications
describing alpha‑7 agonists
Patent Number CmpdsWO1996006098 024WO1997030998 036WO1999003859 118WO2000042044 082WO2001029034 126WO2001036417 110WO2001060821 261
Totals 757 named compounds including reagents & intermediates
1. 3'-Methylspiro[1-azabicyclo[2.2.2]octane-3,5'-oxazolidin]-2'-one monohydrochloride
2. 3-Hydroxy-1-azabicyclo[2.2.2]octane-3-acetic acid
3. (3S)-Spiro[1-azabicyclo[2.2.2]octane-3,5'-oxazolidin]-2'-one monohydrochloride
4. 3-Hydroxy-1-azabicyclo[2.2.1]heptane-3-acetic acid hydrazide
5. Spiro[1-azabicyclo[2.2.2]octane-3,5'-oxazolidin]-2'-one monohydrochloride
6. Etc..
•Nonstandard IUPAC name
–Spiro[1-azabicyclo[2.2.2]octane-3,2'(1'H)-furo[2,3-c]isoquinoline]
September 2006 ACS Meeting San Francisco, CA
Defining a vision>14,0002005
Contextual-- Extracted Summaries
September 2006 ACS Meeting San Francisco, CA
Undiscovered Public KnowledgeUndiscovered Public KnowledgeDavid R. Swanson
Science may be better served by a new image of its literature as a…
vast mosaic of undiscovered connections…
– a world with its own endless frontier.Reference: Swanson DR: Medical literature as a a potential source of new knowledge. Bulletin of the Medical Library Association 1990 78(1): 29-37.
a potential source of countless recombinant ideas
September 2006 ACS Meeting San Francisco, CA
The End
Acknowledgments: James RosamondJames DamewoodBob StumpoJessica Pfennig and…
many thanks to you for your kind attention!!