simple question - ChemAxon Forum Archive

User e9249ba1fe

21-10-2007 06:37:43

this is a bit foolish but cant help

1. i have academic license for jchem and generatemd

2. i have a windows xp desktop pc

3. i have a file conitaing 50,000~ molecules represented as smiles

4. i wish to compute as many possible descriptors as i can for qspr/qsar

please help

thanks a lot

User 677b9c22ff

23-10-2007 08:14:08

Hi,

You know how to program in JAVA and to use Eclipse?

Then you can use the book Molecular Descriptors from Todeschini and

programm all the descriptors with the JCHEM API.

If you don't know JAVA its going to be harder. You can

in principle use cxcalc or generateMD to generate all the descriptors using

a XML command sheet. Or you can use Instant-JChem to

generate most of the descriptors, but due to current

restrictions Instant-JChem can only have 256 or 512 columns

so you can not include all the possible descriptors.

Furthermore it is not possible to apply a general XML sheet

(write once-use often) to generate most of the descriptors.

Now you also have to be clear, what kind of descriptors you

want to use 0D,1D,2D,3D,4D molecular descriptors? All of them? (See Engel\Gasteiger Cheminformatics)

* 0D - bond counts, mol weight, atom counts

* 1D - fragment counts, H-Bond acc/don, Crippen, PSA, SMARTS

* 2D - topological descriptors (Balaban, Randic, Wiener, BCUT, kappa, chi)

* 3D - geometrical descriptors (3D WHIM, 3D autocorrelation, 3D-Morse) + surface properties + COMFA

* 4D - 3D coordinates + conformations (JCHEM conformer, CORINA, gold set, Crystaleye)

The good thing about the JCHEM API is, that in principle you can implement most of the stuff very easily. Those

functions are attached at the bottom. The 1D fragment counts can be implemented using a SMARTS matcher function.

Among those fingerprints are the PubChem Fingerprints or the public

OpenBabel SMARTS implementation. You can also use MCS maximum common substructures (LIBMCS) to create such

patterns only for your dataset or any other dataset (like PubChem).

You can easily calculate 2000 descriptors with different

software applications, see moleculardescriptors.eu

For a small test set of 150 molecules you can use VCCLAB from Igor Tetko for testing the effectiveness of some of

the descriptors (you want to implement with the JCHEM API).

Or you can use JOELIB or better the CDK Descriptor Calculator GUI from Rajarshi Guha.

Beware! Most of the descriptors you can calculate

will have no impact. You need to use feature selection to find useful descriptors for regression or classification.

It is also helpful to prevent overfitting by dividing your dataset into a 70% development and 30% test set

and have a independent external validation set at hand.

You can additionally use v-fold cross-validation or bootstrapping for your development test set.

All those methods are known since the 70s of the last century.

Do not use the R^2=0.999999999 linear fit scam.

Use prediction errors or R^2, Q^2 for independent datasets or other measurements (do not fool yourself).

For the classification or regression statistics it absolutely

does not matter which method you use. The best case is to test all methods or build ensemble methods or group contribution methods which may include:

Generalized Linear Models (GLM)

General Discriminant Analysis

Binary logit (logistic) regression

Binary probit regression

Nonlinear models

Multivariate adaptive regression splines (MARS)

Tree models

Standard Classification Trees (CART)

Standard General Chi-square Automatic Interaction Detector (CHAID)

Exhaustive CHAID

Boosting classification trees

Neural Networks

Multilayer Perceptron

neural network (MLP)

Radial Basis Function neural network (RBF)

Machine Learning

Support Vector Machines (SVM)

Naive Bayes classifier

k-Nearest Neighbors (KNN)

You can implement such methods with MEV, Statistica Dataminer, Yale or WEKA.

Tobias

JCHEM descriptors supported in the API:

Code:

Fragment counts using OpenBabel counts and the

JCHEM SMARTS matching function:

Code:

# SMARTS Patterns for Functional Group Classification

#

# written by Christian Laggner

# Copyright 2005 Inte:Ligand Software-Entwicklungs und Consulting GmbH

#

# Released under the Lesser General Public License (LGPL license)

# see http://www.gnu.org/copyleft/lesser.html

# Modified from Version 221105

# Project homepage: http://sourceforge.net/projects/openbabel

Primary_carbon: [CX4H3][#6]

Secondary_carbon: [CX4H2]([#6])[#6]

Tertiary_carbon: [CX4H1]([#6])([#6])[#6]

Quaternary_carbon: [CX4]([#6])([#6])([#6])[#6]

Alkene: [CX3;$([H2]),$([H1][#6]),$(C([#6])[#6])]=[CX3;$([H2]),$([H1][#6]),$(C([#6])[#6])]

Alkyne: [CX2]#[CX2]

Allene: [CX3]=[CX2]=[CX3]

Alkylchloride: [ClX1][CX4]

Alkylfluoride: [FX1][CX4]

Alkylbromide: [BrX1][CX4]

Alkyliodide: [IX1][CX4]

Alcohol: [OX2H][CX4;!$(C([OX2H])[O,S,#7,#15])]

Primary_alcohol: [OX2H][CX4H2;!$(C([OX2H])[O,S,#7,#15])]

Secondary_alcohol: [OX2H][CX4H;!$(C([OX2H])[O,S,#7,#15])]

Tertiary_alcohol: [OX2H][CX4D4;!$(C([OX2H])[O,S,#7,#15])]

Dialkylether: [OX2]([CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])])[CX4;!$(C([OX2])[O,S,#7,#15])]

Dialkylthioether: [SX2]([CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])])[CX4;!$(C([OX2])[O,S,#7,#15])]

Alkylarylether: [OX2](c)[CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])]

Diarylether: [c][OX2][c]

Alkylarylthioether: [SX2](c)[CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])]

Diarylthioether: [c][SX2][c]

Oxonium: [O+;!$([O]~[!#6]);!$([S]*~[#7,#8,#15,#16])]

Amine: [NX3+0,NX4+;!$([N]~[!#6]);!$([N]*~[#7,#8,#15,#16])]

Primary_aliph_amine: [NX3H2+0,NX4H3+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]

Secondary_aliph_amine: [NX3H1+0,NX4H2+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]

Tertiary_aliph_amine: [NX3H0+0,NX4H1+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]

Quaternary_aliph_ammonium: [NX4H0+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]

Primary_arom_amine: [NX3H2+0,NX4H3+]c

Secondary_arom_amine: [NX3H1+0,NX4H2+;!$([N][!c]);!$([N]*~[#7,#8,#15,#16])]

Tertiary_arom_amine: [NX3H0+0,NX4H1+;!$([N][!c]);!$([N]*~[#7,#8,#15,#16])]

Quaternary_arom_ammonium: [NX4H0+;!$([N][!c]);!$([N]*~[#7,#8,#15,#16])]

Secondary_mixed_amine: [NX3H1+0,NX4H2+;$([N]([c])[C]);!$([N]*~[#7,#8,#15,#16])]

Tertiary_mixed_amine: [NX3H0+0,NX4H1+;$([N]([c])([C])[#6]);!$([N]*~[#7,#8,#15,#16])]

Quaternary_mixed_ammonium: [NX4H0+;$([N]([c])([C])[#6][#6]);!$([N]*~[#7,#8,#15,#16])]

Ammonium: [N+;!$([N]~[!#6]);!$(N=*);!$([N]*~[#7,#8,#15,#16])]

Alkylthiol: [SX2H][CX4;!$(C([SX2H])~[O,S,#7,#15])]

Dialkylthioether: [SX2]([CX4;!$(C([SX2])[O,S,#7,#15,F,Cl,Br,I])])[CX4;!$(C([SX2])[O,S,#7,#15])]

Alkylarylthioether: [SX2](c)[CX4;!$(C([SX2])[O,S,#7,#15])]

Disulfide: [SX2D2][SX2D2]

1,2-Aminoalcohol: [OX2H][CX4;!$(C([OX2H])[O,S,#7,#15,F,Cl,Br,I])][CX4;!$(C([N])[O,S,#7,#15])][NX3;!$(NC=[O,S,N])]

1,2-Diol: [OX2H][CX4;!$(C([OX2H])[O,S,#7,#15])][CX4;!$(C([OX2H])[O,S,#7,#15])][OX2H]

1,1-Diol: [OX2H][CX4;!$(C([OX2H])([OX2H])[O,S,#7,#15])][OX2H]

Hydroperoxide: [OX2H][OX2]

Peroxo: [OX2D2][OX2D2]

Organolithium_compounds: [LiX1][#6,#14]

Organomagnesium_compounds: [MgX2][#6,#14]

Organometallic_compounds: [!#1;!#5;!#6;!#7;!#8;!#9;!#14;!#15;!#16;!#17;!#33;!#34;!#35;!#52;!#53;!#85]~[#6;!-]

Aldehyde: [$([CX3H][#6]),$([CX3H2])]=[OX1]

Ketone: [#6][CX3](=[OX1])[#6]

Thioaldehyde: [$([CX3H][#6]),$([CX3H2])]=[SX1]

Thioketone: [#6][CX3](=[SX1])[#6]

Imine: [NX2;$([N][#6]),$([NH]);!$([N][CX3]=[#7,#8,#15,#16])]=[CX3;$([CH2]),$([CH][#6]),$([C]([#6])[#6])]

Immonium: [NX3+;!$([N][!#6]);!$([N][CX3]=[#7,#8,#15,#16])]

Oxime: [NX2](=[CX3;$([CH2]),$([CH][#6]),$([C]([#6])[#6])])[OX2H]

Oximether: [NX2](=[CX3;$([CH2]),$([CH][#6]),$([C]([#6])[#6])])[OX2][#6;!$(C=[#7,#8])]

Acetal: [OX2]([#6;!$(C=[O,S,N])])[CX4;!$(C(O)(O)[!#6])][OX2][#6;!$(C=[O,S,N])]

Hemiacetal: [OX2H][CX4;!$(C(O)(O)[!#6])][OX2][#6;!$(C=[O,S,N])]

Aminal: [NX3v3;!$(NC=[#7,#8,#15,#16])]([#6])[CX4;!$(C(N)(N)[!#6])][NX3v3;!$(NC=[#7,#8,#15,#16])][#6]

Hemiaminal: [NX3v3;!$(NC=[#7,#8,#15,#16])]([#6])[CX4;!$(C(N)(N)[!#6])][OX2H]

Thioacetal: [SX2]([#6;!$(C=[O,S,N])])[CX4;!$(C(S)(S)[!#6])][SX2][#6;!$(C=[O,S,N])]

Thiohemiacetal: [SX2]([#6;!$(C=[O,S,N])])[CX4;!$(C(S)(S)[!#6])][OX2H]

Halogen_acetal_like: [NX3v3,SX2,OX2;!$(*C=[#7,#8,#15,#16])][CX4;!$(C([N,S,O])([N,S,O])[!#6])][FX1,ClX1,BrX1,IX1]

Acetal_like: [NX3v3,SX2,OX2;!$(*C=[#7,#8,#15,#16])][CX4;!$(C([N,S,O])([N,S,O])[!#6])][FX1,ClX1,BrX1,IX1,NX3v3,SX2,OX2;!$(*C=[#7,#8,#15,#16])]

Halogenmethylen_ester_and_similar: [NX3v3,SX2,OX2;$(**=[#7,#8,#15,#16])][CX4;!$(C([N,S,O])([N,S,O])[!#6])][FX1,ClX1,BrX1,IX1]

NOS_methylen_ester_and_similar: [NX3v3,SX2,OX2;$(**=[#7,#8,#15,#16])][CX4;!$(C([N,S,O])([N,S,O])[!#6])][NX3v3,SX2,OX2;!$(*C=[#7,#8,#15,#16])]

Hetero_methylen_ester_and_similar: [NX3v3,SX2,OX2;$(**=[#7,#8,#15,#16])][CX4;!$(C([N,S,O])([N,S,O])[!#6])][FX1,ClX1,BrX1,IX1,NX3v3,SX2,OX2;!$(*C=[#7,#8,#15,#16])]

Cyanhydrine: [NX1]#[CX2][CX4;$([CH2]),$([CH]([CX2])[#6]),$(C([CX2])([#6])[#6])][OX2H]

Chloroalkene: [ClX1][CX3]=[CX3]

Fluoroalkene: [FX1][CX3]=[CX3]

Bromoalkene: [BrX1][CX3]=[CX3]

Iodoalkene: [IX1][CX3]=[CX3]

Enol: [OX2H][CX3;$([H1]),$(C[#6])]=[CX3]

Endiol: [OX2H][CX3;$([H1]),$(C[#6])]=[CX3;$([H1]),$(C[#6])][OX2H]

Enolether: [OX2]([#6;!$(C=[N,O,S])])[CX3;$([H0][#6]),$([H1])]=[CX3]

Enolester: [OX2]([CX3]=[OX1])[#6X3;$([#6][#6]),$([H1])]=[#6X3;!$(C[OX2H])]

Enamine: [NX3;$([NH2][CX3]),$([NH1]([CX3])[#6]),$([N]([CX3])([#6])[#6]);!$([N]*=[#7,#8,#15,#16])][CX3;$([CH]),$([C][#6])]=[CX3]

Thioenol: [SX2H][CX3;$([H1]),$(C[#6])]=[CX3]

Thioenolether: [SX2]([#6;!$(C=[N,O,S])])[CX3;$(C[#6]),$([CH])]=[CX3]

Acylchloride: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[ClX1]

Acylfluoride: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[FX1]

Acylbromide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[BrX1]

Acyliodide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[IX1]

Acylhalide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[FX1,ClX1,BrX1,IX1]

Carboxylic_acid: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[$([OX2H]),$([OX1-])]

Carboxylic_ester: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[OX2][#6;!$(C=[O,N,S])]

Lactone: [#6][#6X3R](=[OX1])[#8X2][#6;!$(C=[O,N,S])]

Carboxylic_anhydride: [CX3;$([H0][#6]),$([H1])](=[OX1])[#8X2][CX3;$([H0][#6]),$([H1])](=[OX1])

Carboxylic_acid_derivative: [$([#6X3H0][#6]),$([#6X3H])](=[!#6])[!#6]

Carbothioic_acid: [CX3;!R;$([C][#6]),$([CH]);$([C](=[OX1])[$([SX2H]),$([SX1-])]),$([C](=[SX1])[$([OX2H]),$([OX1-])])]

Carbothioic_S_ester: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[SX2][#6;!$(C=[O,N,S])]

Carbothioic_S_lactone: [#6][#6X3R](=[OX1])[#16X2][#6;!$(C=[O,N,S])]

Carbothioic_O_ester: [CX3;$([H0][#6]),$([H1])](=[SX1])[OX2][#6;!$(C=[O,N,S])]

Carbothioic_O_lactone: [#6][#6X3R](=[SX1])[#8X2][#6;!$(C=[O,N,S])]

Carbothioic_halide: [CX3;$([H0][#6]),$([H1])](=[SX1])[FX1,ClX1,BrX1,IX1]

Carbodithioic_acid: [CX3;!R;$([C][#6]),$([CH]);$([C](=[SX1])[SX2H])]

Carbodithioic_ester: [CX3;!R;$([C][#6]),$([CH]);$([C](=[SX1])[SX2][#6;!$(C=[O,N,S])])]

Carbodithiolactone: [#6][#6X3R](=[SX1])[#16X2][#6;!$(C=[O,N,S])]

Amide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[#7X3;$([H2]),$([H1][#6;!$(C=[O,N,S])]),$([#7]([#6;!$(C=[O,N,S])])[#6;!$(C=[O,N,S])])]

Primary_amide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[NX3H2]

Secondary_amide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[#7X3H1][#6;!$(C=[O,N,S])]

Tertiary_amide: [CX3;$([R0][#6]),$([H1R0])](=[OX1])[#7X3H0]([#6;!$(C=[O,N,S])])[#6;!$(C=[O,N,S])]

Lactam: [#6R][#6X3R](=[OX1])[#7X3;$([H1][#6;!$(C=[O,N,S])]),$([H0]([#6;!$(C=[O,N,S])])[#6;!$(C=[O,N,S])])]

Alkyl_imide: [#6X3;$([H0][#6]),$([H1])](=[OX1])[#7X3H0]([#6])[#6X3;$([H0][#6]),$([H1])](=[OX1])

N_hetero_imide: [#6X3;$([H0][#6]),$([H1])](=[OX1])[#7X3H0]([!#6])[#6X3;$([H0][#6]),$([H1])](=[OX1])

Imide_acidic: [#6X3;$([H0][#6]),$([H1])](=[OX1])[#7X3H1][#6X3;$([H0][#6]),$([H1])](=[OX1])

Thioamide: [$([CX3;!R][#6]),$([CX3H;!R])](=[SX1])[#7X3;$([H2]),$([H1][#6;!$(C=[O,N,S])]),$([#7]([#6;!$(C=[O,N,S])])[#6;!$(C=[O,N,S])])]

Thiolactam: [#6R][#6X3R](=[SX1])[#7X3;$([H1][#6;!$(C=[O,N,S])]),$([H0]([#6;!$(C=[O,N,S])])[#6;!$(C=[O,N,S])])]

Oximester: [#6X3;$([H0][#6]),$([H1])](=[OX1])[#8X2][#7X2]=,:[#6X3;$([H0]([#6])[#6]),$([H1][#6]),$([H2])]

Amidine: [NX3;!$(NC=[O,S])][CX3;$([CH]),$([C][#6])]=[NX2;!$(NC=[O,S])]

Hydroxamic_acid: [CX3;$([H0][#6]),$([H1])](=[OX1])[#7X3;$([H1]),$([H0][#6;!$(C=[O,N,S])])][$([OX2H]),$([OX1-])]

Hydroxamic_acid_ester: [CX3;$([H0][#6]),$([H1])](=[OX1])[#7X3;$([H1]),$([H0][#6;!$(C=[O,N,S])])][OX2][#6;!$(C=[O,N,S])]

Imidoacid: [CX3R0;$([H0][#6]),$([H1])](=[NX2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[$([OX2H]),$([OX1-])]

Imidoacid_cyclic: [#6R][#6X3R](=,:[#7X2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[$([OX2H]),$([OX1-])]

Imidoester: [CX3R0;$([H0][#6]),$([H1])](=[NX2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[OX2][#6;!$(C=[O,N,S])]

Imidolactone: [#6R][#6X3R](=,:[#7X2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[OX2][#6;!$(C=[O,N,S])]

Imidothioacid: [CX3R0;$([H0][#6]),$([H1])](=[NX2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[$([SX2H]),$([SX1-])]

Imidothioacid_cyclic: [#6R][#6X3R](=,:[#7X2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[$([SX2H]),$([SX1-])]

Imidothioester: [CX3R0;$([H0][#6]),$([H1])](=[NX2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[SX2][#6;!$(C=[O,N,S])]

Imidothiolactone: [#6R][#6X3R](=,:[#7X2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[SX2][#6;!$(C=[O,N,S])]

Amidine: [#7X3v3;!$(N([#6X3]=[#7X2])C=[O,S])][CX3R0;$([H1]),$([H0][#6])]=[NX2v3;!$(N(=[#6X3][#7X3])C=[O,S])]

Imidolactam: [#6][#6X3R;$([H0](=[NX2;!$(N(=[#6X3][#7X3])C=[O,S])])[#7X3;!$(N([#6X3]=[#7X2])C=[O,S])]),$([H0](-[NX3;!$(N([#6X3]=[#7X2])C=[O,S])])=,:[#7X2;!$(N(=[#6X3][#7X3])C=[O,S])])]

Imidoylhalide: [CX3R0;$([H0][#6]),$([H1])](=[NX2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[FX1,ClX1,BrX1,IX1]

Imidoylhalide_cyclic: [#6R][#6X3R](=,:[#7X2;$([H1]),$([H0][#6;!$(C=[O,N,S])])])[FX1,ClX1,BrX1,IX1]

Amidrazone: [$([$([#6X3][#6]),$([#6X3H])](=[#7X2v3])[#7X3v3][#7X3v3]),$([$([#6X3][#6]),$([#6X3H])]([#7X3v3])=[#7X2v3][#7X3v3])]

Alpha_aminoacid: [NX3,NX4+;!$([N]~[!#6]);!$([N]*~[#7,#8,#15,#16])][C][CX3](=[OX1])[OX2H,OX1-]

Alpha_hydroxyacid: [OX2H][C][CX3](=[OX1])[OX2H,OX1-]

Peptide_middle: [NX3;$([N][CX3](=[OX1])[C][NX3,NX4+])][C][CX3](=[OX1])[NX3;$([N][C][CX3](=[OX1])[NX3,OX2,OX1-])]

Peptide_C_term: [NX3;$([N][CX3](=[OX1])[C][NX3,NX4+])][C][CX3](=[OX1])[OX2H,OX1-]

Peptide_N_term: [NX3,NX4+;!$([N]~[!#6]);!$([N]*~[#7,#8,#15,#16])][C][CX3](=[OX1])[NX3;$([N][C][CX3](=[OX1])[NX3,OX2,OX1-])]

Carboxylic_orthoester: [#6][OX2][CX4;$(C[#6]),$([CH])]([OX2][#6])[OX2][#6]

Ketene: [CX3]=[CX2]=[OX1]

Ketenacetal: [#7X2,#8X3,#16X2;$(*[#6,#14])][#6X3]([#7X2,#8X3,#16X2;$(*[#6,#14])])=[#6X3]

Nitrile: [NX1]#[CX2]

Isonitrile: [CX1-]#[NX2+]

Vinylogous_carbonyl_or_carboxyl_derivative: [#6X3](=[OX1])[#6X3]=,:[#6X3][#7,#8,#16,F,Cl,Br,I]

Vinylogous_acid: [#6X3](=[OX1])[#6X3]=,:[#6X3][$([OX2H]),$([OX1-])]

Vinylogous_ester: [#6X3](=[OX1])[#6X3]=,:[#6X3][#6;!$(C=[O,N,S])]

Vinylogous_amide: [#6X3](=[OX1])[#6X3]=,:[#6X3][#7X3;$([H2]),$([H1][#6;!$(C=[O,N,S])]),$([#7]([#6;!$(C=[O,N,S])])[#6;!$(C=[O,N,S])])]

Vinylogous_halide: [#6X3](=[OX1])[#6X3]=,:[#6X3][FX1,ClX1,BrX1,IX1]

Carbonic_acid_dieester: [#6;!$(C=[O,N,S])][#8X2][#6X3](=[OX1])[#8X2][#6;!$(C=[O,N,S])]

Carbonic_acid_esterhalide: [#6;!$(C=[O,N,S])][OX2;!R][CX3](=[OX1])[OX2][FX1,ClX1,BrX1,IX1]

Carbonic_acid_monoester: [#6;!$(C=[O,N,S])][OX2;!R][CX3](=[OX1])[$([OX2H]),$([OX1-])]

Carbonic_acid_derivatives: [!#6][#6X3](=[!#6])[!#6]

Thiocarbonic_acid_dieester: [#6;!$(C=[O,N,S])][#8X2][#6X3](=[SX1])[#8X2][#6;!$(C=[O,N,S])]

Thiocarbonic_acid_esterhalide: [#6;!$(C=[O,N,S])][OX2;!R][CX3](=[SX1])[OX2][FX1,ClX1,BrX1,IX1]

Thiocarbonic_acid_monoester: [#6;!$(C=[O,N,S])][OX2;!R][CX3](=[SX1])[$([OX2H]),$([OX1-])]

Thiourea: [#7X3;!$([#7][!#6])][#6X3](=[SX1])[#7X3;!$([#7][!#6])]

Isourea: [#7X2;!$([#7][!#6])]=,:[#6X3]([#8X2&!$([#8][!#6]),OX1-])[#7X3;!$([#7][!#6])]

Isothiourea: [#7X2;!$([#7][!#6])]=,:[#6X3]([#16X2&!$([#16][!#6]),SX1-])[#7X3;!$([#7][!#6])]

Guanidine: [N;v3X3,v4X4+][CX3](=[N;v3X2,v4X3+])[N;v3X3,v4X4+]

Carbaminic_acid: [NX3]C(=[OX1])[O;X2H,X1-]

Urethan: [#7X3][#6](=[OX1])[#8X2][#6]

Biuret: [#7X3][#6](=[OX1])[#7X3][#6](=[OX1])[#7X3]

Semicarbazide: [#7X3][#7X3][#6X3]([#7X3;!$([#7][#7])])=[OX1]

Carbazide: [#7X3][#7X3][#6X3]([#7X3][#7X3])=[OX1]

Semicarbazone: [#7X2](=[#6])[#7X3][#6X3]([#7X3;!$([#7][#7])])=[OX1]

Carbazone: [#7X2](=[#6])[#7X3][#6X3]([#7X3][#7X3])=[OX1]

Thiosemicarbazide: [#7X3][#7X3][#6X3]([#7X3;!$([#7][#7])])=[SX1]

Thiocarbazide: [#7X3][#7X3][#6X3]([#7X3][#7X3])=[SX1]

Thiosemicarbazone: [#7X2](=[#6])[#7X3][#6X3]([#7X3;!$([#7][#7])])=[SX1]

Thiocarbazone: [#7X2](=[#6])[#7X3][#6X3]([#7X3][#7X3])=[SX1]

Isocyanate: [NX2]=[CX2]=[OX1]

Cyanate: [OX2][CX2]#[NX1]

Isothiocyanate: [NX2]=[CX2]=[SX1]

Thiocyanate: [SX2][CX2]#[NX1]

Carbodiimide: [NX2]=[CX2]=[NX2]

Orthocarbonic_derivatives: [CX4H0]([O,S,#7])([O,S,#7])([O,S,#7])[O,S,#7,F,Cl,Br,I]

Phenol: [OX2H][c]

1,2-Diphenol: [OX2H][c][c][OX2H]

Arylchloride: [Cl][c]

Arylfluoride: [F][c]

Arylbromide: [Br][c]

Aryliodide: [I][c]

Arylthiol: [SX2H][c]

Iminoarene: [c]=[NX2;$([H1]),$([H0][#6;!$([C]=[N,S,O])])]

Oxoarene: [c]=[OX1]

Thioarene: [c]=[SX1]

Hetero_N_basic_H: [nX3H1+0]

Hetero_N_basic_no_H: [nX3H0+0]

Hetero_N_nonbasic: [nX2,nX3+]

Hetero_O: [o]

Hetero_S: [sX2]

Heteroaromatic: [a;!c]

Nitrite: [NX2](=[OX1])[O;$([X2]),$([X1-])]

Thionitrite: [SX2][NX2]=[OX1]

Nitrate: [$([NX3](=[OX1])(=[OX1])[O;$([X2]),$([X1-])]),$([NX3+]([OX1-])(=[OX1])[O;$([X2]),$([X1-])])]

Nitro: [$([NX3](=O)=O),$([NX3+](=O)[O-])][!#8]

Nitroso: [NX2](=[OX1])[!#7;!#8]

Azide: [NX1]~[NX2]~[NX2,NX1]

Acylazide: [CX3](=[OX1])[NX2]~[NX2]~[NX1]

Diazo: [$([#6]=[NX2+]=[NX1-]),$([#6-]-[NX2+]#[NX1])]

Diazonium: [#6][NX2+]#[NX1]

Nitrosamine: [#7;!$(N*=O)][NX2]=[OX1]

Nitrosamide: [NX2](=[OX1])N-*=O

N-Oxide: [$([#7+][OX1-]),$([#7v5]=[OX1]);!$([#7](~[O])~[O]);!$([#7]=[#7])]

Hydrazine: [NX3;$([H2]),$([H1][#6]),$([H0]([#6])[#6]);!$(NC=[O,N,S])][NX3;$([H2]),$([H1][#6]),$([H0]([#6])[#6]);!$(NC=[O,N,S])]

Hydrazone: [NX3;$([H2]),$([H1][#6]),$([H0]([#6])[#6]);!$(NC=[O,N,S])][NX2]=[#6]

Hydroxylamine: [NX3;$([H2]),$([H1][#6]),$([H0]([#6])[#6]);!$(NC=[O,N,S])][OX2;$([H1]),$(O[#6;!$(C=[N,O,S])])]

Sulfon: [$([SX4](=[OX1])(=[OX1])([#6])[#6]),$([SX4+2]([OX1-])([OX1-])([#6])[#6])]

Sulfoxide: [$([SX3](=[OX1])([#6])[#6]),$([SX3+]([OX1-])([#6])[#6])]

Sulfonium: [S+;!$([S]~[!#6]);!$([S]*~[#7,#8,#15,#16])]

Sulfuric_acid: [SX4](=[OX1])(=[OX1])([$([OX2H]),$([OX1-])])[$([OX2H]),$([OX1-])]

Sulfuric_monoester: [SX4](=[OX1])(=[OX1])([$([OX2H]),$([OX1-])])[OX2][#6;!$(C=[O,N,S])]

Sulfuric_diester: [SX4](=[OX1])(=[OX1])([OX2][#6;!$(C=[O,N,S])])[OX2][#6;!$(C=[O,N,S])]

Sulfuric_monoamide: [SX4](=[OX1])(=[OX1])([#7X3;$([H2]),$([H1][#6;!$(C=[O,N,S])]),$([#7]([#6;!$(C=

User e9249ba1fe

23-10-2007 11:50:08

hey thanks a lot.

i was hoping that there would be some kind of simple answer.

i know a lot about ml stuff.

but biggest problem is getting the descriptors.

vcc lab allows only 150 molecules to be processed at a time see ms i will have to go back to cdk etc.

thanks for help

User 677b9c22ff

23-10-2007 20:12:09

akshayubhat wrote:

hey thanks a lot.

i was hoping that there would be some kind of simple answer.

A) open the DOS commandline and call cxcalc

I am not quite sure what is simpler than that.

Output will be something like:

Code:

D:\temp>cxcalc plattIndex randicIndex balabanIndex hararyindex wienerindex fusedRingcount largestringsize c6h6.smi

1 12 3.00 2.00 10.00 27 0 6

2 20 2.97 1.64 10.33 27 0 3

3 20 2.97 1.74 10.67 25 2 4

4 28 2.98 2.21 11.50 22 3 5

5 36 3.00 1.29 12.00 21 4 4

6 36 3.00 1.29 12.00 21 4 4

7 8 2.91 2.34 8.70 35 0 0

8 14 2.93 1.88 9.50 31 0 3

9 14 2.93 2.01 9.75 29 0 4

10 22 2.93 1.65 10.25 28 2 3

11 12 3.00 2.00 10.00 27 0 6

12 14 2.93 1.88 9.50 31 0 3

B) Going back to CDK does not help you if you

can not program in JAVA. If you can program in JAVA

its like that:

1) Use MolImporter

2) Load and loop through all molecules

3) Initialize the plugin (see table above)

4) Perform calculation

5) Output calculation

For each of the plugins from the large list above

you can repeat that by simply calling them and

adding more functions, for the topological descriptors it

looks like that, and to be honest I am not quite

sure what is simpler than that (if you know JAVA).

The code is not pretty but it works and its quickly to built.

Code:

package examples;

import chemaxon.formats.*;

import chemaxon.struc.*;

import chemaxon.marvin.calculations.*;

import chemaxon.marvin.plugin.*;

import java.io.*;

public class CalcDescSimple {

/** Defines a MolImporter object to the structure file. */

private static MolImporter createMolImporter(String filename) {

   MolImporter mi = null;

   try{

      File f = new File(filename);

      FileInputStream fis = new FileInputStream(f);

      mi = new MolImporter(fis);

   } catch(FileNotFoundException ex) {

      System.err.println(filename+": not found");

      System.exit(1);

   } catch(MolFormatException ex) {

      System.err.println(filename+": "+ex.getMessage());

      System.exit(1);

   } catch(Exception ex) {

      System.err.println("Error: "+filename+" is not a structure file.");

      System.exit(1);

   }

   return mi;

}

/** counts molecules from a structure file. */

private static long countMolecules(String filename) throws PluginException, MolFormatException, IOException

{

   MolImporter mi = createMolImporter(filename);

   long globalmolcounter = 0;

   while (( mi.read()) != null) {

      globalmolcounter++;

   }

   mi.close();

   return globalmolcounter;

}

public static void main(String[] args) throws PluginException, MolFormatException, IOException {

   String filename = "d:/temp/c6h6.smi";

   System.out.println("Number of molecules in " + filename+ ": "+ countMolecules(filename));

   MolImporter mi = createMolImporter(filename);

   TopologyAnalyserPlugin topologyplugin = new TopologyAnalyserPlugin();

   // for each input molecule run the calculation and display the results

   Molecule target = null; long molcounter = 0; long totalerrors = 0;

   while ((target = mi.read()) != null) {

      // set the input molecule

      topologyplugin.setMolecule(target);

      try {

         // run the calculation

         topologyplugin.run();

         //conversion double to string - if you want calculations with doubles use tempXXX

         //loss of precision possible 12-decimals

         java.text.DecimalFormat df12 = new java.text.DecimalFormat("0.000000000000");

         // maybe prettier to put them in array or LIST ?

         int count = target.getAtomCount();

         int aliphaticatomCount = topologyplugin.getAliphaticAtomCount();

         int aliphaticbondcount = topologyplugin.getAliphaticBondCount();

         int aliphaticringcount = topologyplugin.getAliphaticRingCount();

         int aromaticatomcount = topologyplugin.getAromaticAtomCount();

         int aromaticbondcount = topologyplugin.getAromaticBondCount();

         int aromaticringcount = topologyplugin.getAromaticRingCount();

         int asymmetricatomcount = topologyplugin.getAsymmetricAtomCount();

         double tempbalabanindex = topologyplugin.getBalabanIndex();

         String balabanindex = df12.format(tempbalabanindex);

         int bondcount = topologyplugin.getBondCount();

         int carboaromaticringcount = topologyplugin.getCarboaromaticRingCount();

         int carboringcount = topologyplugin.getCarboRingCount();

         int chainatomcount = topologyplugin.getChainAtomCount();

         int chainbondcount = topologyplugin.getChainBondCount();

         int chiralcentercount = topologyplugin.getChiralCenterCount();

         boolean tempconnectedGraph = topologyplugin.isConnectedGraph();

         int connectedGraph= tempconnectedGraph?1:0;

         int cyclomaticNumber = topologyplugin.getCyclomaticNumber();

         int fusedaliphaticringcount = topologyplugin.getFusedAliphaticRingCount();

         int fusedaromaticringcount = topologyplugin.getFusedAromaticRingCount();

         int fusedringcount = topologyplugin.getFusedRingCount();

         double temphararyIndex = topologyplugin.getHararyIndex();

         String hararyIndex = df12.format(temphararyIndex);

         int heteroaromaticringcount = topologyplugin.getHeteroaromaticRingCount();

         int heteroringcount = topologyplugin.getHeteroRingCount();

         int hyperWienerIndex = topologyplugin.getHyperWienerIndex();

         int largestringsize = topologyplugin.getLargestRingSize();

         int plattIndex = topologyplugin.getPlattIndex();

         double temprandicIndex = topologyplugin.getRandicIndex();

         String randicIndex = df12.format(temprandicIndex);

         int ringatomcount = topologyplugin.getRingAtomCount();

         int ringbondcount = topologyplugin.getRingBondCount();

         int ringcount = topologyplugin.getRingCount();

         int rotatablebondcount = topologyplugin.getRotatableBondCount();

         int smallestringsize = topologyplugin.getSmallestRingSize();

         int szegedIndex = topologyplugin.getSzegedIndex();

         int wienerIndex = topologyplugin.getWienerIndex();

         int wienerPolarity = topologyplugin.getWienerPolarity();

         //*******************************************************************

         String TopologyResult = molcounter + "\t"+count+"\t" + aliphaticatomCount + "\t" + aliphaticbondcount + "\t" + aliphaticringcount + "\t";

         TopologyResult = TopologyResult + aromaticatomcount + "\t" +aromaticbondcount + "\t" + aromaticringcount + "\t";

         TopologyResult = TopologyResult + asymmetricatomcount + "\t" +balabanindex+ "\t"+bondcount+ "\t";

         TopologyResult = TopologyResult + carboaromaticringcount + "\t" +carboringcount + "\t" +chainatomcount + "\t" + chainbondcount + "\t";

         TopologyResult = TopologyResult + chiralcentercount +"\t" + connectedGraph + "\t" + cyclomaticNumber+ "\t";

         TopologyResult = TopologyResult + fusedaliphaticringcount +"\t" + fusedaromaticringcount +"\t" + fusedringcount +"\t" ;

         TopologyResult = TopologyResult + hararyIndex+"\t" +heteroaromaticringcount+"\t" +heteroringcount+"\t"+hyperWienerIndex+"\t" ;

         TopologyResult = TopologyResult + largestringsize +"\t" +plattIndex+"\t" +randicIndex+"\t";

         TopologyResult = TopologyResult + ringatomcount+"\t"+ringbondcount+"\t"+ringcount +"\t";

         TopologyResult = TopologyResult + rotatablebondcount+"\t"+smallestringsize +"\t"+szegedIndex+"\t";

         TopologyResult = TopologyResult + wienerIndex +"\t"+ wienerPolarity +"\t";

         System.out.println();

         System.out.print(TopologyResult);

      } //this is for plugin-errors

      catch (Exception e)

      {

         System.out.println ("Error - " + e );

         totalerrors++;

      }

   }

   System.out.println();

   System.out.println("Number of errors:"+totalerrors);

   mi.close();

}

}

The output is something like:

Code:

Number of molecules in d:/temp/c6h6.smi: 217

0 6 0 0 0 6 6 1 0 2.000000000000 12 1 1 0 0 0 1 1 0 0 0 10.00000000000 0 0 42 6 12 3.000000000000 6 6 1 0 6 54 27 3

0 6 6 7 2 0 0 0 0 1.641897173182 13 0 2 0 1 0 1 2 0 0 0 10.33333333333 0 0 43 3 20 2.966326495189 6 6 2 1 3 27 27 4

0 6 6 7 2 0 0 0 0 1.738063991517 13 0 2 0 0 2 1 2 2 0 2 10.66666666666 0 0 37 4 20 2.966326495189 6 7 2 0 4 59 25 2

0 6 6 8 3 0 0 0 0 2.213093912396 14 0 3 0 0 4 1 3 3 0 3 11.50000000000 0 0 29 5 28 2.983163247594 6 8 3 0 3 33 22 0

0 6 6 9 4 0 0 0 0 1.285714285714 15 0 4 0 0 6 1 4 4 0 4 12.00000000000 0 0 27 4 36 3.000000000000 6 9 4 0 3 51 21 0

0 6 6 9 4 0 0 0 0 1.285714285714 15 0 4 0 0 6 1 4 4 0 4 12.00000000000 0 0 27 4 36 3.000000000000 6 9 4 0 4 81 21 0

0 6 6 5 0 0 0 0 0 2.339092314976 11 0 0 6 5 0 1 0 0 0 0 8.700000000000 0 0 70 0 8 2.914213562373 0 0 0 2 0 35 35 3

0 6 6 6 1 0 0 0 0 1.876285894838 12 0 1 3 3 0 1 1 0 0 0 9.500000000000 0 0 56 3 14 2.931851652578 3 3 1 2 3 31 31 3

0 6 6 6 1 0 0 0 1 2.014266206296 12 0 1 2 2 1 1 1 0 0 0 9.750000000000 0 0 49 4 14 2.931851652578 4 4 1 1 4 45 29 3

0 6 6 7 2 0 0 0 2 1.647800297284 13 0 2 2 2 2 1 2 2 0 2 10.25000000000 0 0 47 3 22 2.931851652578 4 5 2 1 3 34 28 3

0 6 6 6 1 0 0 0 0 2.000000000000 12 0 1 0 0 0 1 1 0 0 0 10.00000000000 0 0 42 6 12 3.000000000000 6 6 1 0 6 54 27 3

0 6 6 6 1 0 0 0 0 1.876285894838 12 0 1 3 3 0 1 1 0 0 0 9.500000000000 0 0 56 3 14 2.931851652578 3 3 1 1 3 31 31 3

0 6 6 6 1 0 0 0 0 2.014266206296 12 0 1 2 2 0 1 1 0 0 0 9.750000000000 0 0 49 4 14 2.931851652578 4 4 1 1 4 45 29 3

0 6 6 7 2 0 0 0 2 1.795593921009 13 0 2 0 0 2 1 2 2 0 2 10.83333333333 0 0 34 5 20 2.966326495189 6 7 2 0 3 34 24 1

0 6 6 6 1 0 0 0 0 2.184105569636 12 0 1 1 1 0 1 1 0 0 0 10.16666666666 0 0 39 5 14 2.893846850117 5 5 1 0 5 33 26 2

0 6 6 6 1 0 0 0 0 2.014266206296 12 0 1 2 2 0 1 1 0 0 0 9.750000000000 0 0 49 4 14 2.931851652578 4 4 1 1 4 45 29 3

0 6 6 6 1 0 0 0 0 1.876285894838 12 0 1 3 3 0 1 1 0 0 0 9.500000000000 0 0 56 3 14 2.931851652578 3 3 1 1 3 31 31 3

0 6 6 7 2 0 0 0 0 1.641897173182 13 0 2 0 1 0 1 2 0 0 0 10.33333333333 0 0 43 3 20 2.966326495189 6 6 2 1 3 27 27 4

...snip

I added the three files, the example SMILES from all

C6H6 isomers were calculated using the CDK.

Given the fact that you only need 5 lines of code

with the JChem API which actually perform the calculation

I think its quite simple. Its actually a no brainer. Just adding

up routines. The only thing which would be nice to have

a parser which loops through all the XML properties

and then automatically adds each new descriptor

and a calculation line to the JAVA code. But this would require

some serious programming and I am just too lazy for that,

or lets say that goes beyond my programming knowledges.

Kind regards

Tobias

ChemAxon efa1591b5a

24-10-2007 08:22:36

Hi Tobias, great man, thank you for all the useful suggestions, detailed explanations and for the source codes you provided.

Btw: have you received a ChemAxon User Forum t-shirt yet?

Best regards,

Miklos

User e9249ba1fe

24-10-2007 08:53:11

THANKS TOBIAS

FOR ALL THE CODE AND HELP.

I WILL TRY IT.

THANKS AGAIN

ChemAxon efa1591b5a

24-10-2007 09:26:57

Quote:

1. i have academic license for jchem and generatemd

2. i have a windows xp desktop pc

3. i have a file conitaing 50,000~ molecules represented as smiles

4. i wish to compute as many possible descriptors as i can for qspr/qsar

Hi,

as an academic user you are entitled to use all jchem tools without any size or other ind of limitations. Your ~50K compounds can be processed without any problems (i.e. practical memory or time limits will not be reached).

You can use generatemd to calculate complex descriptors like fingerprints, you can even incorporate your own descriptors, in which case, however, you need to write some java code.

As Tobias mentioned, cxcalc can also be quite useful and relevant for your prject. That program can calculate a large number of physico-chemical properties as well as topological and geometrical descriptors and write results in standard text files that are easy to process further. For a detailed list of avaialble properties you may wish to follow this link:

Quote:

http://www.chemaxon.com/marvin/chemaxon/marvin/help/calculator-plugins.html

.

Regards,

Miklos

User e9249ba1fe

06-11-2007 10:17:18

finally i could caculate the chemical fingerprints using

generatemd c aids.sdf -k CF -o descp.txt

however the desc.txt contains 34 integers

what do theses integers represent are these binary fingerprints?

if yes how can i get 1/0 values?

ChemAxon efa1591b5a

06-11-2007 12:30:53

Hi,

great!

What you got in the output file is a binary fingerprint in decimal text representation. Each consecutive 32 bits of the binary fingerprint are respected as an integer value and that value is printed in decimal format as readable text. This is a compact representation, much shorter than a 0,1 text. If you insists on using 0,1 text then add the -2 flag to the command line of generatemd. (See the command line help, generatemd -x ). The user's guide may also be useful: http://www.chemaxon.com/jchem/doc/user/GenerateMD.html.)

I still do not understand your real goal, but in most cases the binary text format is not needed and not so useful. For any kind of calculations the integers are just fine, you can directly compare them by tanimoto etc.

Does this help at all?

regards,

Miklos

User e9249ba1fe

06-11-2007 16:10:48

i want to use those fingerprints as descriptors.

also i want to load them into matlab for further calculation of tanimoto etc hence i am using text format so that i can load the delimited file into matlab.

however i would like to know whether is it possible to calculate a similarity matrix i.e. there are 4773 molecules (bursi mutagencity) i want a tanimoto similarity matrix 4773 *4773 similarity values is it possible with screenmd?

Thanks

User 677b9c22ff

07-11-2007 04:25:20

Hi,

besides using generateMD and generFP and all other tools,

you can use again the Evaluator very easily.

Assume you have the SMARTS (Derivation and Validation of Toxicophores for Mutagenicity Prediction;

Jeroen Kazius, Ross McGuire, and Roberta Bursi

J. Med. Chem.; 2005; 48(1) pp 312 - 320; (Article) DOI: 10.1021/jm040835a)

you want to use or any other SMARTS like from

Performance of Kier-Hall E-state Descriptors in Quantitative

Structure Activity Relationship (QSAR) Studies of

Multifunctional Molecules; Darko Butina; Molecules 2004, 9, 1004-1009)

Code:

RowNo smarts-definitions estates-atom-types-Kier-Hall

1 [OH1][*] sOH

2 O=[*] dO

3 [OH0]([*])[*] ssO

4 [o] aaO

5 [NH2][*] sNH2

6 [NH1]=[*] dNH

7 [NH1]([*])[*] ssNH

8 [nH1] aaNH

9 N#[*] tN

10 [ND2](=[*])[*] dsN

11 [nH0] aaN

12 N([*])([*])[*] sssN

13 N(=[*])(=[*])[*] ddsN

14 [N;+]([*])([*])([*])[*] ssssN+

15 [SH1][*] sSH

16 S=[*] dS

17 [SX2]([*])[*] ssS

18 [s] aaS

19 S(=[*])(=[*])([*])[*] ddssS

20 [F][*] sF

21 [Cl][*] sCl

22 [Br][*] sBr

23 [I][*] sI

24 [CH3][*] sCH3

25 [CH2]([*])[*] ssCH2

26 [CH2]=[*] dCH2

27 [CH1]([*])([*])[*] sssCH1

28 [CH1](=[*])[*] dsCH1

29 [CH1]#[*] tCH

30 [cH] aaCH

31 [cH0] aasC

32 C(=[*])=[*] ddC

33 C(#[*])[*] tsC

34 C(=[*])([*])[*] dssC

35 C([*])([*])([*])[*] ssssC

what you do is you create an evaluator XML file:

Code:

array(

matchCount("[OH1][*]"),

matchcount("O=[*]"),

matchcount("[OH0]([*])[*]"),

matchcount("[o]"),

matchcount("[NH2][*]"),

matchcount("[NH1]=[*]"),

matchcount("[NH1]([*])[*]"),

matchcount("[nH1]"),

matchcount("N#[*]"),

matchcount("[ND2](=[*])[*]"),

matchcount("[nH0]"),

matchcount("N([*])([*])[*]"),

matchcount("N(=[*])(=[*])[*]"),

matchcount("[N;+]([*])([*])([*])[*]"),

matchcount("[SH1][*]"),

matchcount("S=[*]"),

matchcount("[SX2]([*])[*]"),

matchcount("[s]"),

matchcount("S(=[*])(=[*])([*])[*]"),

matchcount("[F][*]"),

matchcount("[Cl][*]"),

matchcount("[Br][*]"),

matchcount("[I][*]"),

matchcount("[CH3][*]"),

matchcount("[CH2]([*])[*]"),

matchcount("[CH2]=[*]"),

matchcount("[CH1]([*])([*])[*]"),

matchcount("[CH1](=[*])[*]"),

matchcount("[CH1]#[*]"),

matchcount("[cH]"),

matchcount("[cH0]"),

matchcount("C(=[*])=[*]"),

matchcount("C(#[*])[*]"),

matchcount("C(=[*])([*])[*]"),

matchcount("C([*])([*])([*])[*]"))

and you call it with evaluator like this (but beware this does not work

for logP because it is a real number and the array function

is only defined as integer:

evaluate -f SMARTS-kier-hall-QSAR.txt NCI2000.smi >kier-hall-smarts-out.txt

The output is a nice matrix for any tool like Statistica or WEKA.

Code:

0;2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0;3;0;0;0;0;0;3;0

0;0;0;0;0;0;0;0;0;0;2;0;0;0;0;0;2;2;0;0;0;0;0;0;0;0;0;0;0;8;6;0;0;0;0

1;2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;2;4;0;0;0;0

0;1;0;0;0;1;0;1;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;0;0;1;2;0;0;0;0

0;2;0;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;7;5;0;0;2;0

2;2;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;2;0;0;0;0;0;0;0;8;11;0;0;1;0

0;2;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;1;0;0;2;0;0;0;0;0;4;2;0;0;4;0

0;3;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;6;6;0;0;2;0

2;0;0;0;0;0;0;0;0;2;0;0;0;0;0;0;0;0;0;0;0;0;0;2;0;0;0;0;0;0;0;0;0;2;0

0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;15;3;0;0;0;0

2;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;6;0;0;0;0;0;2;4;0;0;0;2

0;1;0;0;0;0;0;0;0;1;0;1;0;0;0;0;0;0;0;0;0;0;0;1;1;0;0;0;0;5;1;0;0;2;0

0;0;0;0;1;0;0;0;0;0;1;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0;0;5;4;0;0;0;0

...snip

Tobias

User e9249ba1fe

07-11-2007 09:50:16

thanks tobias

to calculate the similarity matrix i tried following code using screenmd.

csa.sdf csa2.sdf are same files containing same molecules.

i used

screenmd csa.sdf csa2.sdf -g -k CF -M Tanimoto -o output.txt

it seems to work.

actually while writing this post the calculation is going on.

ChemAxon efa1591b5a

07-11-2007 10:07:26

Indeed, you can calculate the similarity matrix using screenmd, just make sure that the dissimilarity threshold is 1 (for tanimoto, or a very large number when using Euclidean metric).

Regarding the use of the chemical fingerprint as a descriptor: it is possible to use the decimal values for further analysis, e.g. in matlab, there is no need to use the binary 0,1 text format. However, if you would like to perform any kind of dimension reduction then the binary form must be used.

Does this help?

Regards,

Miklos

User e9249ba1fe

07-11-2007 16:58:06

thanks Miklos

i got the similarity matrix.

could you tell me how dissimilarity threshold affect the whole procedure and how can i set it using command line?

also i want similarity values between 0 ~1. 1 indicates most similiar or equivalent molecule.

thanks

User e9249ba1fe

08-12-2007 22:14:26

when i calculated the similarity matrix using above procedure i got all diagonal elements as 0 while they should have been 1!!!

please help

thanks

ChemAxon efa1591b5a

10-12-2007 09:17:01

This is because you calculate the dissimilarity... The dissimilarity is often preferred over similarity as there are many common metrics (e.g. Euclidean) that aren't similarity but distance metrics and thus aren't upper bounded.

Hope this helps.

Regards,

Miklos