screening using scalar descriptors

User 1cbb912c5a

15-12-2010 11:47:48

Dear all


I am wondering that is it possible (and how) using Chemaxon tools perform screening of large compound's database stored in sdf file (or other)? Let's say that my workflow is as follows: i have a couple of milions compounds handle in sdf file. Then using cxcalc i could calculate eg. pKa, acceptorcount, donorcount and many other parameters for my database. But i want to take only compounds having right values for those descriptors. For example i define that my hits should marke all criteria:


pKa should be higher than 4


acceptorcount should be lower then 4


and so on...


So my question is how could i remove compounds did not matching my criteria?


I will be very gratefuly for any help


Best regards


Rafal

ChemAxon e08c317633

16-12-2010 21:34:42

Yes, it can be done with Chemical Terms. Chemical Terms Evaluator is a command line application which can do this.


Example:


$ evaluate -e 'pKa("1") > 4 && acceptorCount() < 4 && donorCount() < 4' -x smiles nci100.smiles
NC1=CC=NC2=C1C=CC(Cl)=C2
CCCCCCC1CCCCN1
CCCCCCCCCCCCCCCC1=C(N)C=CC(O)=C1
CC1=CC(C)=C(N)C(C)=C1
CC1=CN=C2C=CC=C(C)C2=C1Cl
CN(C)C(C1=CC=CC=C1)C1=C(C)C=CC=C1
C1=CC=C(C=C1)C(N=C(C1=CC=CC=C1)C1=CC=CC=C1)C1=CC=CC=C1
CCC1=C(N)C=C(C)N=C1
CC1=C(C2=CC=CC=C2)C(OC2=CC=CC=C2)=C2C=CC=CC2=N1
CC(C#N)N(C)C
N(C(C1=CC=CC=C1)C1=CC=CC=C1)C(C1=CC=CC=C1)C1=CC=CC=C1
C1CCN(CC1)C(C1=CC=CC=C1)C1=CC=CC=C1
O=C(CC(N1CCCCC1)C1=CC=CC=C1)C1=CC=CC=C1
COC(N(C)C)(C1=CC=CC=C1)C1=CC=CC=C1
CN(C)CCC(=O)C1=CC=CC=C1
CCN(CC)CCC(=O)OC
CCN(CC)CCC#N


Expression 'pKa("1") > 4 && acceptorCount() < 4 && donorCount() < 4' means:


  - strongest pKa is higher than 4  (Note: it is recommended to use apKa() and bpKa() functions instead of pKa())
  - and acceptorcount is lower than 4
  - and donorcount is lower than 4


For more functions and details see Chemical Terms Reference Tables.


The "-x" command line option sets the extract mode.


-x, --extract <format>                extract mode: write exactly those
molecules in the specified format that
satisfy the input boolean expression

The example filters those molecules from the input file which satisfy the expression. Evaluator can handle millions of input strucutres.


With Instant JChem  and Chemical Terms the filtering can be done directly on databases, see these parts of IJC documentation:
 - Chemical Terms Fields
 - Query builder


Regards,
Zsolt

User 1cbb912c5a

22-12-2010 07:49:31

Dear Zsolt,


Thank you very much for such usefull hints, this is reallly what I wanted. In fact, evaluator can save me time and it works very fast :) excellent!


By the way I have another one problem. What if I have i.e admet descriptors calculated in other software and included into sdf file. Let say that I have more than 1 M molecules in sdf. It will be very problematic and time consuming to load it to Instant IChem and use query builder and remove these compounds with unproper ADMET descriptors values. So my question is it is possible using some of your tools to screen (in batch mode, comman line, etc.) sdf file using as a query given fields name and thresholds for them?


I will be appreciate for any help.


 


Best regards for all


Rafal



ChemAxon e08c317633

03-01-2011 10:59:11

Yes, it is possible. Evaluator can refer to SDf fields, see the field() Chemical Terms function.


I attached the previously used nci100 file in SDf format; it contains "logP" SDf fields. Here is an example how you can refer to these fields:


$ evaluate -e 'pKa("1") > 4 && acceptorCount() < 4 && donorCount() < 4 && field("logP") < 4' -x smiles nci100logP.sdf
NC1=CC=NC2=C1C=CC(Cl)=C2
CCCCCCC1CCCCN1
CC1=CC(C)=C(N)C(C)=C1
CC1=CN=C2C=CC=C(C)C2=C1Cl
CCC1=C(N)C=C(C)N=C1
CC(C#N)N(C)C
CN(C)CCC(=O)C1=CC=CC=C1
CCN(CC)CCC(=O)OC
CCN(CC)CCC#N

A new condition is added to the expression: && field("logP") < 4. The logP values are read from the "logP" SDf field, and only those molecules are written to the output in which the value in logP field is less than 4.


The output SMILES with logP field values:


$ evaluate -e 'pKa("1") > 4 && acceptorCount() < 4 && donorCount() < 4 && field("logP") < 4' -x smiles:TlogP nci100logP.sdf
#SMILES logP
NC1=CC=NC2=C1C=CC(Cl)=C2 1.91
CCCCCCC1CCCCN1 3.37
CC1=CC(C)=C(N)C(C)=C1 2.68
CC1=CN=C2C=CC=C(C)C2=C1Cl 3.76
CCC1=C(N)C=C(C)N=C1 1.02
CC(C#N)N(C)C 0.29
CN(C)CCC(=O)C1=CC=CC=C1 1.66
CCN(CC)CCC(=O)OC 0.76
CCN(CC)CCC#N 0.68

Zsolt

User 1cbb912c5a

03-01-2011 12:38:44

Hi Zsolt


Great news.


I checked it and it works very nice.


Thanx for help.


 


Best regards


Rafal