UGM 2006 Miklós Vargyas Whats new in JKlustor. Overview An introduction to JKlustor –Brief...

Post on 26-Mar-2015

220 views 1 download

Tags:

Transcript of UGM 2006 Miklós Vargyas Whats new in JKlustor. Overview An introduction to JKlustor –Brief...

UGM 2006

Miklós Vargyas

What’s new in JKlustor

Overview

• An introduction to JKlustor – Brief history of the product– Main features– Usage examples– Performance

• LibMCS, an alternative approach to clustering chemical structures– Concepts, motivation– Features– Performance

• Future of JKlustor

Brief history of JKlustor

• First discovery tool in the JChem package– Jarp released in version 1.5.2 (March 22, 2001)– Compr 1.5.7 (May 27, 2001)– Ward 1.5.9 (Jun 25, 2001)

• API released in JChem 1.6.2 (May 16, 2002)

• Experimental LibMCS first released in JChem 3.0 (Dec 1, 2004)

• New JKlustor GUI to be released in JChem 3.?

JKlustor features

• Similarity based clustering– ChemAxon’s topological fingerprint– External data points, arbitrary dimension– Tanimoto, weighted Euclidean

• Hierarchical clustering: Ward– Reciprocal nearest neighbor algorithm– Kelley method

• Non-hierarchical clustering: Jarvis-Patrick

• Diversity calculation: Compr

• Structure based clustering: LibMCS

JKlustor usage

• Command line tools– Pipelining commands– Option flags– Structure file/database input– Manual creation of cluster views

Input SDFile GenerateMD NNeib

JarvisPatrick CreateView MarvinView Picture

JKlustor usage

generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt nneib -f 512 -t 0.1 -g –i fingerprints.txt –o neighborlists.txt jarp -c 0.2 -y –i neighborlists.txt –o clusters.txt

• Prepare data and run clustering

• View first cluster

• View centroids, display cluster id and size

crview -i id -c "clid=1" -s input.sdf -t clusters.txt –o jarp_cluster1.sdf

mview –c 3 -r 3 jarp_cluster1.sdf

crview -i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt -o jarp_centroids.sdf

mview -c 3 -r 3 -f "clid:size" jarp_centroids.sdf

JKlustor usage

0

2000

4000

6000

8000

10000

12000

14000

100 1000 10000 20000 40000 100000

library size

run

tim

e (s

)

Ward 512

Jarp 512

JKlustor performance

• Memory: O(n)

• Time: Jarvis-Patrick O(n1.5), Ward O(n2)

What is MCS?

• The Maximum Common Substructure of two chemical structures

Clustering by MCS?

• Find the MCS of a group of structures

Very brief history of LibMCS

• Reaction automapper, based on Maximum Common Subgraph Search

• MCS class API made public

• Customer requested MCS based clustering– More intuitive than similarity based– Focused set analysis

• screens: 2000 – 10000 structures• lead optimization: 3000 – 5000 structures

– Should be hierarchical (outliers)– Ultimate goal: cluster 5000 compounds in 5

seconds

LibMCS features

• MCS based hierarchical clustering

• Flexible search options

• Hierarchy browser

• Filtering by chemical properties

• Cluster statistics

• No size limitation

• Fast operation

LibMCS – Dendogram view

LibMCS – Molecule view

LibMCS – Table view

LibMCS – Statistics

LibMCS – Selections

LibMCS – Property filters

LibMCS – Output files

LibMCS – Output files

CCCN1CC(=O)SCC(C)C1=O CC1CSC(=O)CN(C2CCCC2)C1=O 0 21 0CCCN1CC(=O)SCC(C)C1=O CC1CSC(=O)C2CCCN2C1=O 0 21 0OC(=O)C1CCCN1C(=O)CCS CC(CS)C(=O)N1CCCC1C(O)=O 0 19 0OC(=O)C1CCCN1C(=O)CCS [H]C1(CCCN1C(=O)CCS)C(O)=O 0 19 0OC(=O)C1CCCN1C(=O)CCS OC(=O)C1CCCN1C(=O)C2CCCC2SC(=O)C3=CC=CC=C3 0 19 0OC(=O)C1CCCN1C(=O)CCS OC(=O)C1CCCN1C(=O)C2CCCCC2S 0 19 0CCC(=O)N(CC1=CC=CC=C1)C(C)C=O CC1SC(=O)C2(C)CC3=CC=CC=C3CN2C1=O 0 20 0CCC(=O)N(CC1=CC=CC=C1)C(C)C=O CC1CSC(=O)C2CC3=C(CN2C1=O)C=CC=C3 0 20 0CC1SC(=O)C2CCCN2C1=O CC1SC(=O)C2CCCN2C1=O 0 30 0CC1SC(=O)CNC1=O CC1SC(=O)CNC1=O 0 29 0OC(=O)C1CSCCCCCCCCC(CS)C(=O)N1 OC(=O)C1CSCCCCCCCCC(CS)C(=O)N1 0 31 0CC(S)C(=O)NCC(O)=O CC(S)C(=O)NCC(O)=O 0 24 0CCC1=CC=CC=C1 CC(NC(CCC1=CC=CC=C1)C(O)=O)C(=O)N2CCCC2C(O)=O 0 22 0CCC1=CC=CC=C1 CCOC(=O)C(CC1=CC=CC=C1)NC(=O)NC(CC2=CC=CC=C2)C(=O)OCC 0 22 0OC(=O)C1CCCN1C(=O)NC2=CC=CC=C2 OC(=O)C1CCCN1C(=O)NC2=CC=CC=C2 0 23 0C\C(Cl)=N/OC(N)=O C\C(Cl)=N/OC(N)=O 0 27

> <Cluster_ID>1163

> <Element_count>1

> <Parent_ID>1

$$$$

Marvin 05290619172D

23 24 0 0 0 0 999 V2000 2.4230 -0.3587 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.1375 0.0538 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.1375 0.8788 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.4349 -1.1837 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -1.1494 -1.5962 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.8638 -1.1837 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0

LibMCS – RGroup decomposition

LibMCS – RGroup decomposition

LibMCS – Performance

• Depends on– average structure size– total diversity– minimal required MCS size– atom/bond constraints

• Scales linearly

• Maximum speed achieved– 1 000 structures in 3 seconds

• Memory requirements– 100 000 structures occupy 200MB

LibMCS – Performance

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (s

ec)

LibMCS – Further applications

• Find the MCS of existing clusters

• Data retrieval

• Assay analysis

• Compound acquisition

• Combinatorial library profiling

Development plans

• Disconnected MCS

• Multi-group clustering

• More chemical sense (e.g. avoid opening rings, consider chirality)

• Performance tuning (e.g. NN)

• Integrate Ward/Jarp into new GUI

• Additive clustering

• Clustering million compound libraries

• Integrate Chemical Terms

• Integrate molecular descriptors, optimized metrics

Summary

• New tool in JKlustor based on MCS

• More plausible grouping

• Hierarchical with dendogram browser

• Statistics

• Filtering, coloring, selection

Acknowledgements

• Developers– Ferenc Csizmadia, Árpád Tamási,

András Volford, Szilárd Doránt– Péter Vadász, Nóra Máté

• Special thanks