Jchem PostgreSQL Cartridge · 2017-06-27 · Jchem PostgreSQL Cartridge Thank you: all the people...

34
+ = J C hem PostgreSQL Cartridge Handling large volumes of chemical data in a scalable way and on a tight budget By Ellert van Koperen

Transcript of Jchem PostgreSQL Cartridge · 2017-06-27 · Jchem PostgreSQL Cartridge Thank you: all the people...

  • + =

    JChem PostgreSQL CartridgeHandling large volumes of chemical data in a scalable way and on a tight budget

    By Ellert van Koperen

  • Why use a database?

  • Why use a database?Lets make a list:

    - Data consistency

    - Safety and security

    - Central storage and backup

    - Speed

  • Why use a database?Lets make a list:

    - Data consistency

    - Safety and security

    - Central storage and backup

    - Speed -> Only important if the task has to be repeated several times

  • Why use a database?Lets make a list:

    - Data consistency

    - Safety and security

    - Central storage and backup

    - Speed -> Only important if the task has to be repeated several times

    - Centralized tools with centralized licensing

  • A medium-short intermission

  • Why the PostgreSQL cartridge?

    If you want to use Chemaxon tools* there are several alternatives for data storage:

    Textfiles (e.g. sdf)

    MySql

    PostgreSQL

    Oracle

    Lets look at the landscape

  • Why the PostgreSQL cartridge?Lets look at the landscape

  • Files Mysql Postgres Oracle

    Why the PostgreSQL cartridge?Lets look at the landscape

  • Files Mysql Postgres Oracle

    Ease of use with tools

    Why the PostgreSQL cartridge?Lets look at the landscape

    Examples of tools: PaDEL

  • Files Mysql Postgres Oracle

    Ease of use with tools

    Search speed on ‘normal’ values

    Why the PostgreSQL cartridge?Lets look at the landscape

  • Files Mysql Postgres Oracle

    Ease of use with tools

    Search speed on ‘normal’ values

    Search speed on chemical expressions

    Why the PostgreSQL cartridge?Lets look at the landscape

  • Files Mysql Postgres Oracle

    Ease of use with tools

    Search speed on ‘normal’ values

    Search speed on chemical expressions

    Cost 0,- 0,- € €€€€€

    Why the PostgreSQL cartridge?Lets look at the landscape

  • Why the PostgreSQL cartridge?Issues:

    -Not totally mature yet, so bugs may lurk in corners

    -Mixing chemical and non-chemical clauses may need hints and tweaking to have it use the chemindex

    -Mixing several chemical clauses is not possible without tricks

    -The NOT operator can NOT be used with a chemindex

  • A medium-short intermission

  • Usecase exampleRun Reactor on a subset of all commercially available compounds. -> This testcase: Grignard reaction

    Question: Why pump it through a database ?

  • Usecase exampleRun Reactor on a subset of all commercially available compounds. This testcase: Grignard reaction

    Why pump it through a database ? 1. Pre-filter

    ±12.000.000 compounds in test set.

    Grignard reaction has 2 reactants.Reactor would need to try 144.000.000.000.000 combinations.

  • Usecase exampleRun Reactor on a subset of all commercially available compounds. This testcase: Grignard reaction

    Why pump it through a database ? 1. Pre-filter2. Pre-segment / group / organize

    To keep data clustered, e.g. per vendor or per chemotype,

    or to distribute workload.

  • Usecase exampleRun Reactor on a subset of all commercially available compounds. This testcase: Grignard reaction

    Why pump it through a database ? 1. Pre-filter2. Pre-segment / group / organize3. Ensure correctness of classes without the need

    to fix the reaction filters in Reactor(See for example problem with Fries rearrangement on rings.)

  • Usecase exampleRun Reactor on a subset of all commercially available compounds. This testcase: Grignard reaction

    Why pump it through a database ? 1. Pre-filter2. Pre-segment / group / organize3. Ensure correctness of classes without the need

    to fix the reaction filters in Reactor4. Easy test-and-fix iterations due to high speed

  • Usecase exampleRun Reactor on a subset of all commercially available compounds. This testcase: Grignard reaction

    Steps:1. Gather, curate and store compounds in DB2. Query / Segment data3. Feed it to Reactor4. (Collect and use the result)

  • Step 1: Gather, Curate and Store

  • Step 1: Gather, Curate and StoreThree ways to get all that data into the database:

    1. Direct SQL loading

    2. Through KNIME

    3. Using JCMAN

  • Usecase sample: step 1Gather, curate and store compounds in DBLoading it through JCMAN:

    1. Prepare the tables from the GUI2. Invoke JCMAN with the ‘a’ command to load the data3. Convert the ‘jchem’ type structure to ‘cartridge’ internal

    format using a simple command:

    4. Create the Chemindex:

  • Usecase sample: step 1Creating the Chemindex is where the hard work is done:

  • Usecase sample: step 2Pre-filter data for Grignard reaction

    * Why pre-filter? N^2 != Performance.

    Grignard reagent can react with- nitriles to form ketones- imines to form amins- epoxides, aziridines to form alcohols- diazonium salts to form azo compounds- amine, thio, posphorous, boron compounds - disulfanes to form sulfanes- esters, carboxylic acids and acid halides and sulfure analogues.- alkyl or aryl halidesGrignard reagent can undergo reduction with acidic groups.

  • Usecase sample: step 2Pre-filter data for Grignard reaction

    Query 1, for the reagent: [#6]-[Mg][#17,#35,#53]

    Query 2, for the reactant has a problem: it is not possible to use a NOT in a clause on a chemindex:

  • Usecase sample: step 2Pre-filter data for Grignard reaction

    Got a trick up my sleeve:

    Etc…

  • Usecase sample: step 2Pre-filter data for Grignard reaction

    Different trick: use the EXCEPT operator

    Very elegant solution, in theory.In practice, after the first except, it breaks. Bug

  • Usecase sample: step 3Feed it to reactor

    - Export the selections to filesuse a join with the temp_table

    Create the rest of the sdf fields programmaticallyOptionally in chunksEasy to automate with a scripting language

  • Usecase sample: step 3Feed it to reactor

    - Export the selections to files- Run reactor on the generated files- Import and process the results

  • Handling large volumes of chemical data in a scalable way and on a tight budget is possible.

  • + =

    Jchem PostgreSQL Cartridge

    Thank you: all the people of Chemaxon, and especially Andras Volford, Annamaria Kovacs, Eszter Szabo, Ivan Solt,

    János Fejérvári, Janos Papdeak, Krisztina Vajda, Mihály Medzihradszky, Miklos Vargyas, Tamas Varga, Zsolt Mohacsi, Zsuzsanna Szabo.

    And:Anna Karawajczyk, Daniel Blanco Ania, Daniel Gironés