sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It...

34
sMOL Explorer 1.1 sMOL Explorer * : User’s Guide Copyright © 2007 * Supawadee Ingsriswang; Eakasit Pacharawongsakda (2007), "sMOL Explorer: an open source, web-enabled database and exploration tool for Small MOLecules datasets", Bioinformatics, Vol 23(18), September, pp. 2498- 2500 http://bioinformatics.oxfordjournals.org/cgi/reprint/23/18/2498 © 2007 ISL/BIOTEC. All Rights reserved. Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand 1

Transcript of sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It...

Page 1: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

sMOL Explorer*: User’s GuideCopyright © 2007

* Supawadee Ingsriswang; Eakasit Pacharawongsakda (2007), "sMOL Explorer: an open source, web-enabled database and exploration tool for Small MOLecules datasets", Bioinformatics, Vol 23(18), September, pp. 2498- 2500 http://bioinformatics.oxfordjournals.org/cgi/reprint/23/18/2498

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

1

Page 2: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Table of Contents Page

1. Getting Started with sMOL Explorer1.1. User Registration 31.2. Menus 4

2. Structure Data Management2.1. Direct Entry 72.2. Batch upload 72.3. Data Workspace 11

3. Structural Similarity and Text Search3.1. Structure Search 153.2. Text Search 15

4. Clustering Analysis4.1. Loading or Selecting Data 184.2. Selecting a Clustering Method 194.3. The Clustering Output 20

5. Finding Frequent Substructure5.1. Loading or Selecting Data 225.2. Specifying the minimum support threshold 225.3. The List of Frequent Substructures 22

6. Feature Selection6.1. Loading or Selecting Data 246.2. Selecting a Feature Selection Method 246.3. The Output 25

7. Classification7.1. Loading or Selecting Data 277.2. Testing options 277.3. Selecting a Classifier 287.4. The Classification output 29

8. Utilities8.1. Data Preparation 318.2. File Conversion 318.3. Computing Molecular Descriptors 328.4. Administrator Tasks 33

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

2

Page 3: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

1. Getting started with sMOL Explorer

1.1 User Registration Click on Sign Up in the Login Page to register a user account of sMOL

Explorer. In Sign-Up page, you can follow these two steps (Figure 1-1):

Note: Default username is administrator and password is 1q2w3e4r.

Figure 1-1 Sign-Up page

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

3

Page 4: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

• Type your required information including username, password, first name, last name, email address and telephone number

• Click on the Submit Button to send your request to the system administrator. After you have the granted permission from the administrator, you can sign into sMOL Explorer.

1.2 MenusOnce you sign in, you can begin using sMOL Explorer. The menu bar always

appears at the top of the screen as shown in Figure 1-2. It contains, from left to right, the Structure Registration Menu, the Search Menu, the Data Analysis Menu, the Data Workspace menu, the Utility Menu and the Logout. To navigate menus, drag mouse over the menu title, then left-click (or just click with a single button mouse) on the item you want.

Figure 1-2 Menu bar

Figure 1-3 Structure Registration Menu

The Structure Registration Menu is in the top-left corner of the screen and, when clicked, shows a menu containing three items: (Figure 1-3)

• Direct Entry: Register molecule by molecule into the database• Batch upload: Prepare data of multiple molecules in a data file and upload

into the database• Edit/Delete Compound: Edit or Remove molecules

For more information, see details on how to manage the structure database with sMOL Explorer in Section 2.

Figure 1-4 Search Menu

The Search Menu consists of two items: (Figure 1-4)• Structure Search: Find all the compounds in the database that have the

given structure or substructure. In sMOL Explorer, there are three basic categories: exact structure, substructure and structural similarity searches.

• Text Search: Find the compounds that have information relevant to the query text.

See details in Section 3.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

4

Page 5: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 1-5 Data Analysis Menu

The Data Analysis Menu lists the tools for exploring data within sMOL Explorer. Click on the following items when you want to: (Figure 1-5)

• Clustering: Cluster the selected molecules based on molecular fingerprints.• Molecular Substructure Miner: Find the list of frequent substructures that

occur in molecules above the minimum support in the dataset. • Feature Selection: Remove irrelevant features from the dataset before

attempting to train a classifier.• Classification: Train and test a model in classifying the compounds.

Detailed information on how to analyze or explore the structure data with sMOL Explorer are in Section 4 to 7.

Figure 1-6 Data Workspace menu

The Data Workspace Menu consists of three workspace categories relating to three common operations: (Figure 1-6)

• Upload Workspace: Manage the previously uploaded data in the Data Workspace.

• Search Workspace: Operate the search results saved in the Data Workspace

• Analysis Workspace: Keep the saved analysis results in the Data WorkspaceTo use the Data Workspace efficiently, go to Section 2.

Figure 1-7 Utility menu

The Utility Menu integrates utilities supporting the file format conversion, computing molecular descriptors of structure, and administrator tasks. The Utility Menu gives four items: (Figure 1-7)

• Data Preparation: Prepare data into the sMOL-defined tab-delimited format. This file will be use as input of data analysis.

• File Conversion: Convert chemical data file formats • Calculate Descriptor: Compute molecular descriptors of the query

structure.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

5

Page 6: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

• Administrator Tasks: Manage user accounts, setup the system configuration and update the URLs to external databases.

See details in Section 8.

The Logout: Click on the Logout when you want to closes the program and logout from the system.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

6

Page 7: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

2. Structure Data Management2.1 Direct Entry

In mode of direct entry, users can add a structure of small molecule into database via the web with several options:

Draw interactively the 2D structure of molecules or Paste SMILES via JChemPaint editor,

Upload the Mol file directly into the database.After submitting the structure data, user can enter the associated screening

data.

2.2 Batch UploadFor batch mode, users can prepare structure and screening data in either

SDF file or sMOL-defined XML file and upload into the database.

SDF-FileThe SDF file format is defined by MDL (Molecular Design Ltd). A SDF

file can contain multiple compounds together with properties and references. The SDF file for sMOL Explorer must contain the following SD fields.

Required SD Fields- Datasource The name of data source.- CompoundID The number of compounds containing in this file- SMILES SMILES string to be used to represent the chemical structure for the

compound being registered. It will be ignored if a chemical structure with atoms is also provided in the SD file format.

- CompoundName The name of compound. - Category The category of compound. It can be :

(1) Natural Product , or(2) Commercially Available , or(3) Semi-synthesis

- CompoundType The type of compound. It can be :

(1) Terpenes / Steroids , or(2) Alkaloids , or(3) Polyketides , or(4) Fatty acids , or(5) Unknown

- Available This field is used to check the permission of user who can view this

compound. It should be “Only registed user” or “Everyone”.

Allowed SD Fields- CASNumber Chemical Abstract Service identification number. (if molecule is in CAS)- PubChemID NCBI's PubChem database identification number. (if molecule is in

PubChem)

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

7

Page 8: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

- KEGGIDKyoto Encyclopedia of Genes and Genomes compound identification

number. (if molecule is in KEGG) - IUPACName

IUPAC or standard chemical name for the compound. - MeltingPoint

Melting point (if solid) or boiling point (if liquid) in degrees Celsius - SpecimenVoucher

A specimen voucher is the remainder from which this compound has been isolated. Currently sMOL Explorer support only one specimen voucher per compound. - Type

The type of organism. It should be “Microbe” or “Plant” - Phylum

The systematic name that represents the biological Phylum of this organism. - Order

The systematic name that represents the biological Order of this organism. - Family

The systematic name that represents the biological Family of this organism.

- GenusThe systematic name that represents the biological Genus of this

organism. - Species

The systematic name that represents the biological Species of this organism. - NumberOfActivities

The number of biological activities of this compound. - ActivityName

The name of biological activity. Note: If the NumberOfActivities is greater than one, it will be

ActivityName concat an order number, for example, ActivityName1, ActivityName2. - ActivityMeasure

- The toxicity and cell viability assessments. It can be : - IC50 : for concentration likely to cause a 50% reduction in light output

from the population. - EC50 : for effective concentration that inhibits growth in 50% of the

tested population. - MIC : for determining the minimal inhibitory concentration. Note: If the NumberOfActivities is greater than one, it will be

ActivityMeasure concat an order number, for example, ActivityMeasure1, ActivityMeasure2. - ActivityValue

The value of bioassay test.Note: If the NumberOfActivities is greater than one, it will be

ActivityValue concat an order number, for example, ActivityValue1, ActivityValue2. - ActivityConfidence

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

8

Page 9: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

The confidence level of biological activities. It can be on of the following items.

(1) Weakly Active(2) Moderately Active(3) Strongly Active(4) UnknownNote: If the NumberOfActivities is greater than one, it will be

ActivityConfidence concat an order number, for example, ActivityConfidence1, ActivityConfidence2. - Application

The utilization of this organism. - NumberOfReferences

The number of publications related to this compound.- ReferenceTitle The title of publication. Note: If the NumberOfReferences is greater than one, it will be

ReferenceTitle concat an order number, for example, ReferenceTitle1, ReferenceTitle2.

- ReferenceAuthor – separate by ;All authors of the publication. If the publication has many authors, use ;

(semi-colon) to separate each author. Note: If the NumberOfReferences is greater than one, it will be

ReferenceAuthor concat an order number, for example, ReferenceAuthor1, ReferenceAuthor2. - ReferenceYear

The year of publication. Note: If the NumberOfReferences is greater than one, it will be

ReferenceYear concat an order number, for example, ReferenceYear1, ReferenceYear2.

- ReferenceJournalThe journal name of publication.Note: If the NumberOfReferences is greater than one, it will be

ReferenceJournal concat an order number, for example, ReferenceJournal1, ReferenceJournal2. - ReferenceVolume

The volume of journal.Note: If the NumberOfReferences is greater than one, it will be

ReferenceVolume concat an order number, for example, ReferenceVolume1, ReferenceVolume2. - ReferencePage

The page of publication in journal. It uses – (dash) for separating start page and end page.

Note: If the NumberOfReferences is greater than one, it will be ReferencePage concat an order number, for example, ReferencePage1, ReferencePage2.

sMOL-defined XMLLike the SDF format, the DTD of sMOL-defined XML is shown below.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

9

Page 10: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

<!DOCTYPE DataSet [<!ELEMENT DataSet ( Compound+ ) ><!ELEMENT Compound ( Structure, Characteristic, Bioresource, Activities, Application, References ) ><!ATTLIST Compound number NMTOKEN #REQUIRED >

<!ELEMENT Structure ( MOL?, SMILES? ) ><!ELEMENT MOL ( #PCDATA ) ><!ELEMENT SMILES ( #PCDATA ) >

<!ELEMENT Characteristic ( Datasource, CASNumber?, PubChemID?, KEGGID?, CompoundName, IUPACName?, MeltingPoint?, Category, CompoundType, Available ) >

<!ELEMENT Datasource ( #PCDATA ) ><!ELEMENT CASNumber ( #PCDATA ) ><!ELEMENT PubChemID ( #PCDATA ) ><!ELEMENT KEGGID ( #PCDATA ) ><!ELEMENT CompoundName ( #PCDATA ) ><!ELEMENT IUPACName ( #PCDATA ) ><!ELEMENT MeltingPoint ( #PCDATA ) ><!ELEMENT Category ( #PCDATA ) ><!ELEMENT CompoundType ( #PCDATA ) ><!ELEMENT Available ( #PCDATA ) >

<!ELEMENT Bioresource ( SpecimenVoucher?, Type?, Phylum?, Order?, Family?, Genus?, Species ) >

<!ELEMENT SpecimenVoucher ( #PCDATA ) ><!ELEMENT Type ( #PCDATA ) ><!ELEMENT Phylum ( #PCDATA ) ><!ELEMENT Order ( #PCDATA ) ><!ELEMENT Family ( #PCDATA ) ><!ELEMENT Genus ( #PCDATA ) ><!ELEMENT Species ( #PCDATA ) >

<!ELEMENT Activities ( Activity+ ) ><!ELEMENT Activity ( Name, Measure, Value, Confidence ) ><!ATTLIST Activity number NMTOKEN #REQUIRED ><!ELEMENT Name ( #PCDATA ) ><!ELEMENT Measure ( #PCDATA ) ><!ELEMENT Value ( #PCDATA ) ><!ELEMENT Confidence ( #PCDATA ) >

<!ELEMENT Application ( #PCDATA ) >

<!ELEMENT References ( Reference+ ) ><!ELEMENT Reference ( Title, Authors, Year, Journal, Volume?, Page? ) ><!ATTLIST Reference no NMTOKEN #REQUIRED ><!ELEMENT Title ( #PCDATA ) ><!ELEMENT Authors ( Author+ ) ><!ELEMENT Author ( FirstName, MiddleName, LastName ) ><!ATTLIST Author no NMTOKEN #REQUIRED ><!ELEMENT FirstName ( #PCDATA ) ><!ELEMENT MiddleName ( #PCDATA ) ><!ELEMENT LastName ( #PCDATA ) ><!ELEMENT Year ( #PCDATA ) ><!ELEMENT Journal ( #PCDATA ) ><!ELEMENT Volume ( #PCDATA ) ><!ELEMENT Page ( Start, End ) ><!ELEMENT Start ( #PCDATA ) ><!ELEMENT End ( #PCDATA ) >]>

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

10

Page 11: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

2.3 Data WorkspaceThe Data Workspace is an environment for handling data being used by each user. It contains the following three types of working spaces relating to three common operations in sMOL Explorer.

• Click on Upload Workspace for managing the previously uploaded data in the Upload Workspace.

• Click on Search Workspace for operating the saved search results in the Search Workspace

• Click on Analysis Workspace for processing the saved analysis results in the Analysis Workspace

Upload WorkSpace:Once a dataset has been uploaded into the sMOL Explorer for analysis, it will be stored in the Upload Workspace.

Figure 2-1 Upload Workspace page.

In the Upload WorkSpace Page, you can perform the following tasks (Figure 2-1)• Input the dataset name or click on to select a data file and

then click on to upload a new dataset• Click on Delete or Download at the row corresponding to the dataset you

want to remove or download from the database respectively. • Click on the Clear All Link to delete all datasets

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

11

Page 12: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

• Click on the dataset name to select and view the data (Figure 2-2). At the bottom of the data page, you can

o Click on the Export Link to export the dataseto Click on the Analyze Link to transfer the dataset to the analysis

main page.o Click on the Clear All to delete the dataset

Figure 2-2 Compounds in each data set.

Search WorkSpace :Each time you search the database in sMOL Explorer, you can select the molecules from search result to be combined in the Search Workspace. In the Search Workspace Page, you can (Figure 2-3)

• Determine the number of data to display in this page• Click on Delete at the row corresponding to the molecule you want to

remove from the Search Workspace. • Click on the generic name at the row corresponding to the molecule you

want to view the molecule data. • At the bottom of the Search Workspace page, you can

o Click on the Export Link to export all the data from the Search Workspace

o Click on the Analyze Link to transfer all the data from the Search Workspace to the analysis main page.

o Click on the Clear All to delete all the data in Search Workspace

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

12

Page 13: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 2-3 Search workspace page.

Analysis WorkSpace: The analysis space keeps the saved results from the data analysis for each

user. When the data analysis finishes and displays the result, user can click on the Save to WorkSpace button to save the result into Analysis Workspace. The Analysis Workspace Page lists the saved analysis result by types of analysis. In the Analysis Workspace Page, you can perform the following tasks. (Figure 2-4)

• Click on Delete at the row corresponding to the saved result you want to remove from the workspace.

• Click on the dataset name at the row corresponding to the saved result you want to select and view the result.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

13

Page 14: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 2-4 Analysis workspace page.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

14

Page 15: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

3. Structural Similarity and Text Search3.1 Structure Search

sMOL Explorer supports structure search in three basic categories:• Exact Search• Substructure Search• Similarity Search.To use structure search in sMOL explorer, chemist can paste a molecular

structure into JChemPaint interface or upload a data file, select the database and search type to find the similar compound in database. sMOL Explorer also allows users to search molecules against public accessible databases including PubChem, KEGG, DrugBank and eMolecules via an internet. For similarity search, users must specify the similarity measure such as Tanimoto, Cosine or Simpson and similarity threshold. (Figure 3-1)

3.2 Text SearchIn text search, sMOL Explorer allows users to specify text search terms to find the compounds that have information relevant to the query text as shown in Figure 3-2.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

15

Page 16: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 3-1 Structure search page.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

16

Page 17: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 3-2 Text search page.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

17

Page 18: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

4. Clustering Analysis4.1 Loading or Selecting Data

User can directly upload a new dataset or select a dataset from the workspace for clustering analysis. In Figure 4-1, you can perform the following tasks.

• Click on the New Data Set, then enter the filename or click on Click

to browse a data file to be uploaded. Then insert name of data set. (Figure 4-1)

• Otherwise click on the Data Workspace. You can select a type of following workspaces:

o Upload Workspace for the previously uploaded datasetso Search Workspace for the saved search resulto Analysis Workspace for the previously saved results from data

analysisWhen a workspace is selected, you can choose a dataset to restore from the list of datasets previously uploaded/saved in that workspace. (Figure 4-2)

• Type a name in the Result Name Textbox for the clustering result

Figure 4-1 Clustering page with new data set as input.

Figure 4-2 Clustering page with upload data set as input.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

18

Page 19: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

4.2 Selecting a Clustering Method

Presently, clustering methods in sMOL Explorer can be grouped into two types: Partitioning and Hierarchical algorithms. Hierarchical Methods include Agglomerative Nesting Clustering (R: AGNES) and Hierarchical clustering (R: HClust) from R-packages, while partitioning methods are K-Centroids Cluster analysis from R-packages, and Minimum Entropy clustering (Figure 4-3).

Figure 4-3 Clustering algorithm.

• Partitioning methodso K-Centroids

Number of Clusters: Specify the initial number of clusters Family: Select a clustering method such as K-Means, K-

Medians, Angle, Expectation-based Jaccardo Minimum Entropy clustering

Number of Clusters: Specify the initial number of clusters Alpha: Kernel: Hypercube /Guassian Bandwidth Number of K-Means Iterations

• Agglomerative Nesting and Hierarchical Clusteringo Similarity Metric: You can choose one of the following methods for

measuring the similarity between a sample-pair in the dataset. Euclidean Maximum Binary Canberra Manhattan Minkowski

o Clustering Methods: To combine or separate two clusters of data, you need to measure the distance between groups or clusters. Based on the different inter-group distance measures, there are a number of clustering methods to use as below.

Group Average Method Single Linkage Method Complete Linkage Method Ward’s Method Weighted Average Method

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

19

Page 20: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

The other three methods: Mcquitty, Median and Centroid are available only for Hierarchical Clustering.

4.3 The Clustering OutputThe outputs produced are available to you for online inspection, download, and your own analysis. (Figure 4-4 and Figure 4-5)

• Allows the user to download the solutions and visualizations in PDF format.

• To save the clustering result, Click on the

Figure 4-4 Clustering result of K-Centroids Cluster Analysis algorithm.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

20

Page 21: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 4-5 Clustering result of Agglomerative Nesting Clustering.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

21

Page 22: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

5. Finding Frequent Substructure5.1 Loading or Selecting Data

User can directly upload a new dataset or select a dataset from the workspace. (Figure 5-1)

Figure 5-1 Molecular Substructure Miner.

In Figure 5-1, you can perform the following tasks.• Click on the New Data Set, then enter the filename or click on Click

to browse a data file to be uploaded• Otherwise click on the Data Workspace. You can select a type of following

workspaces: o Upload Workspace contains the previously uploaded datasetso Search Workspace keeps the current search resulto Analysis Workspace includes the previously saved results from data

analysis. When a workspace is selected, you can choose a dataset to restore from the list of datasets previously uploaded/saved in that workspace.

• Type a name in the Result Name Textbox for the frequent substructures result

5.2 Specifying the minimum support thresholdThe minimum support is actually the frequency of small molecules

containing the same substructure. Users must specify a minimum support threshold

and click on for finding the frequent substructures in the dataset.

5.3 The List of Frequent SubstructuresThe output from this analysis normally returns a list of frequent

substructures that occur in molecules above the specified minimum support in the

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

22

Page 23: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

dataset. Users can export the result into a file (XML or tab-delimited format) by

clicking on the or click on the to save it into the analysis workspace. (Figure 5-2) You can also view the previously saved results by clicking on a result name from the Previous Saved Results Box at the right corner of the screen (Figure 5-1).

Figure 5-2 List of frequent substructure.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

23

Page 24: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

6. Feature Selection6.1 Loading or Selecting Data

User can directly upload a new dataset or select a dataset from the workspace for identifying the important attributes. In Figure 6-1, you can perform the following tasks.

• Click on the New Data Set, then enter the filename or click on Click

to browse a data file to be uploaded• Otherwise click on the Data Workspace. You can select a type of following

workspaces: o Upload Workspace contains the previously uploaded datasetso Search Workspace keeps the current search resulto Analysis Workspace includes the previously saved results from data

analysis.When a workspace is selected, you can choose a dataset to restore from the list of previously uploaded/saved datasets in that workspace.

• Type a name in the Result Name Textbox for the feature selection result

Figure 6-1 Feature Selection.

6.2 Selecting a Feature Selection MethodsMOL Explorer provides two feature selection techniques : Variable

selection From Random Forest and Regression Subset Selection. Users can select an algorithm, specify parameters and click on Run button to start the feature selection process.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

24

Page 25: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

• Variable Selection from Random Forest (R: varSelRF)Random forest, a classification algorithm developed by Breiman, is an ensemble of individual tree predictor. Each of unpruned classification trees is built using a bootstrap sample of the data. Each node is split using the best split from random sampling of the variables. Thus, to classify new data, the predicted values from a number of trees are combined into a vote on class identity. During bootstrap iteration and the OOB (out-of-bag) prediction, predicting the data not in the bootstrap sample, random forests estimate the error rate and return several measures of variable importance, which can be used to perform variable or feature selection. The randomForest package and varSelRF package implemented in R are integreted in sMOL Explorer for comparing the importance of the features in classification. sMOL Explorer allows users to tune three parameters of the varSelRF as follows.

o mtryFactor: Enter the multiplication factor of sqrt{number.of.variables} for the number of variables to use for the ntry argument of randomForest

o ntree: Input the number of trees to be generated for the first forest

o ntreeIterat: Input the number of trees to use (ntree of randomForest) for all additional forests

• Regression Subset Selection (R: regsubsets)sMOL Explorer includes the leaps package implemented in R for Regression Subset selection. It performs a search for the best subsets of the variables in x for predicting y in linear regression. There are two parameters:

o Search Method: Choose a method from exhaustive search, forward selection, backward selection or sequential replacement to search

o nvmax: Specify the maximum size of feature subsets to examine

7.3 The Output of Feature SelectionThe output from this analysis normally contains two parts: (Figure 6-2)

• The list of feature subsets that are selected to evaluate their predictive ability

• The final set of selected features from the dataset.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

25

Page 26: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 6-2 Feature Selection result.

Similar to other analysis, users can export the result into a file (XML or tab-

delimited format) by clicking on the or click on the

to save it into the analysis workspace. You can also view the previously saved results by clicking on a result name from the Previous Saved Results Box at the right corner of the screen.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

26

Page 27: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

7. Classification7.1 Loading or Selecting Data

User can directly upload a new dataset or select a dataset from the workspace to build the classification model. In Figure 7-1, you can perform the following tasks.

• Click on the New Data Set, then enter the filename or click on Click

to browse a data file to be uploaded• Otherwise click on the Data Workspace. You can select a type of following

workspaces: o Upload Workspace contains the previously uploaded datasetso Search Workspace keeps the current search resulto Analysis Workspace includes the previously saved results from data

analysis.When a workspace is selected, you can choose a dataset to restore from the list of datasets previously uploaded/saved in that workspace.

• Type a name in the Result Name Textbox for the classification result

Figure 7-1 Classification.

7.2 Testing optionsUsers can specify the data set or use the training data for testing the

classification model. There are two options in the classifier evaluation: k-fold cross validation and leave one out (LOO) cross validation.

• K-fold cross-validation: The dataset is divided into K subsets. Of the K subsets, a subset is used as the testing data, and the remaining K − 1

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

27

Page 28: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

subsets are put together to form a training dataset. The model evaluation is then repeated K times with each of the K subsets used exactly once as the testing data. In this option, users must enter the number of folds (K) for using in the cross-validation process.

• LOO cross validation: The training set is represented by all the dataset without a sample, and the testing set has only a sample.

7.3 Selecting a ClassifiersMOL Explorer gives five classification algorithms including Naïve Baye, C4.5

Decision Tree, Random Forest, Neural Network and Support Vector Machine from Weka and R packages. Users can select an algorithm with default parameter setting and click on Classify button to train and test the data. To change parameter values for each classification, select the Advanced Setup checkbox before clicking on the Classify button.

Below is how to set the parameter values for each algorithm.• Naïve Baye (weka.classifiers.bayes.NaiveBayes)

o Use kernel density: Select True if you want to use a kernel estimator for numeric attributes rather than a normal distribution and False otherwise.

o Use supervised discretization: Select True if you want to use supervised discretization to convert numeric attributes to nominal ones and False otherwise.

• C4.5 Decision Tree (weka.classifiers.trees.J48)o Use Unpruned Tree: Select False if pruning is performed; otherwise

select True.o Confidence Threshold: Input smaller values if more pruning is

required. o The Minimum Number of Instances per Leaf:o Use Reduced Error Pruning: Select True if you want to use

reduced-error pruning instead of C4.5 pruning and False otherwise.o The Number of Folds for Reduced Error: Determine the number of

folds, K, used for reduced-error pruning. The dataset is divided into K subsets. One subset is used for pruning, the rest K − 1 subsets for growing the tree.

o Use Binary Splits only: Select True if you want to use binary splits when building the tree and False otherwise.

o Seed for Random Data Shuffling : Specify the number for randomizing the data when reduced-error pruning is used.

• Random Forest (weka.classifiers.trees.RandomForest)o Number of Trees: Input the number of trees to be generated.o Number of Features to consider: Specify the number of randomly

chosen attributes o Seed for Random Number Generator: Set the random number seed

to be usedo The Maximum Depth of the Trees: Specify the maximum depth of

the trees, 0 for unlimited.• Neural Network (weka.classifiers.functions.multilayerPerceptron)

o Learning Rateo Momentum Rate

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

28

Page 29: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

o Number of Epochs• Support Vector Machine (R:e1071 package)

o Kernel Function: Specify the Kernel Type used in training and predicting

o Degree: a parameter needed for kernel of type polynomial (default: 3)

o Gamma: a parameter needed for all kernels except linear (default: 1/(data dimension))

o Coef0: a parameter needed for kernels of type polynomial and sigmoid (default: 0)

o Cost: cost of constraints violation (default: 1)—it is the ‘C’-constant of the regularization term in the Lagrange formulation.

7.4 The Classification outputThe classification output consists of three parts:

• Summary: This part provides the summary of classification performance of the model

• Detailed Accuracy by Class: This part indicates how accurate the classification/prediction model can be made for each data class.

• Prediction Result: This part lists the prediction for each individual sample from the validating dataset.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

29

Page 30: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 7-2 Classification result.

You can also export the classification result into a file (XML or tab-

delimited format) by clicking on the or click on the

to save it into the analysis workspace. To view the previously saved results, click on a result name from the Previous Saved Results Box at the right corner of the screen.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

30

Page 31: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

8.Utilities8.1 Data Preparation

Uploading data for analysis, you need to prepare a data file in sMOL-defined tab-delimited format. In this part, assume that your original file must be in sMOL-defined XML format. You just enter the filename or browse for the original SML

file, and click on the to get the sMOL-defined tab-delimited file. (Figure 8-1)

Figure 8-1 Data preparation.

8.2 File Conversion In the conversion page, you can convert the chemical data file into another

format by the following steps: (Figure 8-2)• Input or browse for molecule files you want to convert• Select the format of the input file• Select the output format • Specify additional options such addition and deletion of hydrogens

• Click the

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

31

Page 32: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 8-2 File conversion.

8.3 Computing Molecular DescriptorsJust upload a Mol file, select a chemical structure level from Bond, Atomic

and Molecule and click on Calculate Button, sMOL Explorer will generate the molecular descriptors corresponding to the selected chemical structure level. (Figure 8-3)

Figure 8-3 Calculate Molecular Descriptors.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

32

Page 33: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

8.4 Administrator TasksOnly the users with administration permissions can perform the following tasks:

• User ManagementThis part describes basic functions for the user data control:- (Figure 8.4)o At the bottom of the User Management page,

Add New Users: To add new user to sMOL Explorer, click on Add New User to open the Add New User page.

Click on Check ALL or Uncheck All to select or de-select all the users.

Click on Register or Delete after With Selected to register or remove all the selected/checked users.

o Edit User: Click on Edit at the row corresponding to the user that you want to modify the information

o Delete User: Click on Delete at the row corresponding to the user that you want to remove from the system.

Figure 8-4 User management.

• Setup Configuration: This section allows you to edit the following system configurations

o The Number of Data/Thread: sMOL Explorer speeds up the search process using multi-threading, so the data is divided for each thread. This parameter defines the maximum number of data samples per thread.

o Database Configurations: You can change the database parameters including server name, database name, database user account and password.

o The R Home Directory: Specify the directory path to R.o The sMOL Home Directory: Specify the directory path to sMOL

Explorer.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

33

Page 34: sMOL Explorer : User’s Guide - BIOTEC · - ReferencePage The page of publication in journal. It uses – (dash) for separating start page and end page. Note: If the NumberOfReferences

sMOL Explorer 1.1

Figure 8-5 Edit configuration file.

• Setup Public Database Links: Input or Change the URL or WWW address of public databases including Pubchem, KEGG, DrugBank, and eMolecule. (Figure 8-6)

Figure 8-6 Edit public database link.

© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand

34