SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open...

32
ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI [email protected]

Transcript of SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open...

Page 1: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

ChemAxon UGM, Budapest

20/05/2015

SureChEMBL: Open Patent Data

George Papadatos, PhD

ChEMBL Group, EMBL-EBI

[email protected]

Page 2: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

EMBL-EBI Resources Genes, genomes & variation

ArrayExpress Expression Atlas PRIDE

InterPro Pfam UniProt

ChEMBL ChEBI

Literature &

ontologies

Europe PubMed Central

Gene Ontology

Experimental Factor

OntologyMolecular structures

Protein Data Bank in Europe

Electron Microscopy Data Bank

European Nucleotide

Archive

1000 Genomes

Gene, protein & metabolite expression

Protein sequences, families & motifs

Chemical biology

Reactions, interactions &

pathways

IntAct Reactome MetaboLights

SystemsBioModels

Enzyme Portal

BioSamples

Ensembl

Ensembl Genomes

European Genome-phenome Archive

Metagenomics portal

Page 3: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Bioactivity data

Compound

Assay/T

arg

et

>Thrombin

MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE

RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT

NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT

TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT

THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY

CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF

EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR

WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR

ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA

NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG

PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE

3. Insight, tools and resources for translational drug discovery

2. Organization, integration, curation and standardization of pharmacology data

1. Scientific facts

Ki = 4.5nM

APTT = 11 min.

ChEMBL: Data for drug discovery

Page 4: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Why looking at patent documents?

• Patent filing and searching

• Legal, financial and commercial incentives & interests

• Prior art, novelty, freedom to operate searches

• Competitive intelligence

• Unprecedented wealth of knowledge

• Most of knowledge will never be disclosed anywhere else

• Average lag of 2-3 years between patent document and journal

publication disclosure for chemistry

Page 5: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

From SureChem to SureChEMBL

• Digital Science/Macmillan donated SureChem to EMBL-

EBI

• SureChem: commercial patent chemistry mining product

• Wellcome Trust funds further development

• EMBL-EBI provides an on-going, live service

• Full functionality freely available to everyone

• Query, view and export chemistry from patents

• Complemented with biological annotations

Page 6: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

SureChEMBL data processing

WO

EPApplications& Granted

USApplications

& granted

JPAbstracts

Patent

OfficesChemistry Database

SureChEMBL System

Patent PDFs

(service)Application

Users

API

Database

Entity Recognition

SureChem IP

1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-

methylpiperazine

Image to Structure(one method)

Name to Structure (five methods)

OCR

Processed patents(service)

Page 7: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

SureChEMBL data processing

WO

EPApplications& Granted

USApplications

& granted

JPAbstracts

Patent

OfficesChemistry Database

SureChEMBL System

Patent PDFs

(service)Application

Users

API

Database

Entity Recognition

SureChem IP

1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-

methylpiperazine

Image to Structure(one method)

Name to Structure (five methods)

OCR

Processed patents(service)

Page 8: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Homepage

Help

Search by keyword and meta-data

Search by chemical structure(sketch

compound)

Search by SMILES, MOL,

SMARTS, name

Search by patent numberFilter by authority

(US, EP, WO and JP)

Filter by document section (title, claims, abstract,

description and images)

Chemical search type

(substructure, similarity, identical) Filter

by date

Filter by MW

www.surechembl.org

Page 9: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk
Page 10: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk
Page 11: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk
Page 12: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Data growth

• ~80K novel compounds every month

• ~800K novel compounds since EBI took over

• 2–7 days for a published patent to be chemically annotated and

searchable in SureChEMBL

Cumulative growth of SureChEMBL compounds

Co

mp

ou

nd

co

un

t

Time

Page 13: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

EMBL-EBI chemistry resources

RDF and REST API interfaces

REST API Interface - https://www.ebi.ac.uk/unichem/

Atlas

Ligand induced

transcript response

750

PDBe

Ligand structures

fromprotein

complexes

15K

ChEBI

Nomenclature of primary and

secondary metabolites.

Chemical Ontology

24K

SureChEMBL

Chemicalstructures

from patent literature

16M

ChEMBL

Bioactivity data from literature

and depositions

1.5M

UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >90M

3rd Party Data

ZINC, PubChem, ThomsonPharma DOTF, IUPHAR,

DrugBank, KEGG, NIH NCC,

eMolecules, FDA SRS, PharmGKB,

Selleck, ….

~65M

Page 14: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Data access & exports

• Full compound repository

• FTP download, SDF and CSV format

• Updates quarterly

• Full compound-patent map

• FTP download, flat file

• Updates quarterly

• Data feed client

• Creates a local replica database of SureChEMBL

• Updates daily

Page 15: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Compound-patent map

• Flat file with

• Compound, global frequency, document, section, section frequency, publication date

• Back file

• 187,958,584 unique patent-compound pairs

• 14,076,090 unique compound IDs

• 3,585,233 EP, JP, WO and US patent docs

• 1960-2014

• Quarterly incremental updates

• Q1 2015 is also now available on the FTP

http://chembl.blogspot.co.uk/2015/03/the-surechembl-map-file-is-out.html

Page 16: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Data feed client

http://vartree.blogspot.co.uk/2015/01/how-to-create-your-own-replica-of.html

Page 17: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Use cases with SureChEMBL

• Chemoinformatics

• Chemistry landscape for a particular biological target/disease

• Novel chemistry & scaffolds

• MDS, MCS and R-group analysis for a particular patent family claimed

chemistry

• (Negative) novelty checking with UniChem

• Competitive intelligence

• Reporting

• Patent alerts

• Per target/disease/company

Page 18: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Bioactivity data extraction? Compounds

Target/Assay

Bioactivity

Page 19: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Markush structure extraction?

-alkyl

-aryl

-heteroaryl

-heterocyclyl

-cycloalkyl

….

Page 20: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Biological annotations

Bioannotations soon to be integrated into SureChEMBL interface –

using SciBite’s Termite text mining engine

Page 21: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

US

-9012636-B

2

Page 22: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Future steps

• OpenPHACTS ENSO

• Biological tagging of targets, genes, indications and diseases

• Development of integrated use-cases

• Combine chemistry & biology from patents, literature, pathways, etc.

• OpenPHACTS API

• Accessible via KNIME nodes

• Further improvements/added value

• Data quality and accuracy

• Target and compound relevance score

Page 23: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Acknowledgements

ChEMBL team:

• John Overington

• Anne Hersey

• Anna Gaulton

• Mark Davies

• Nathan Dedman

• Michal Nowotka

Collaborators:

• James Siddle

• Richard Koks

• Lee Harland

• Kevin Clark

Support:

[email protected]

Webinar:

http://www.ebi.ac.uk/training/online/course/surechembl-accessing-chemical-patent-data-webinar

Page 24: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Technology partners

Page 25: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

ChemAxon UGM, Budapest

20/05/2015

SureChEMBL: Open Patent Data

George Papadatos, PhD

ChEMBL Group, EMBL-EBI

[email protected]

Page 26: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Back-up slides

Page 27: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

• Connectivity match on single components - UniChem

ChEMBL-SureChEMBL compound overlap

21.4%SureChEMBL

ChEMBL

1.5M

16M

Page 28: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Too granular? Try scaffolds instead

Page 29: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Level 1 scaffold overlap

57%SureChEMBLChEMBL

61K

298K

Page 30: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Level 1 scaffold overlap

57%SureChEMBLChEMBL

61K

298K

Page 31: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Can we have everything?

Cost

TimeQuality

Page 32: SureChEMBL: Open Patent Data - ChemAxon · ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk

Common sources of errors

• Small, poor quality images

• OCR errors in names (OCR done by IFI). There is an OCR correction

step, but cannot fix all errors

-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-3-

vDbenzamide’

• Reliability better for US patents due to inclusion of mol files